Human communication is a rather messy ordeal. What we say often requires real world knowledge, we often use wordplay, and the way we communicate can change rapidly and depend on the medium. Just think about the emergence of Twitter and how we communicate there!
For a human this is effortless, we can quickly adapt to new rules of communication, we can understand puns and put sentences into context without any active work. But for a computer these are extremely complex and computationally expensive tasks.
Natural language processing, or NLP, sits in the intersection of machine learning, statistics and linguistics where it aims to help computers decipher the context of written or spoken language, as communicated by humans.
Since the amount of content available to us continues to explode, search engines play a central role in modern society – from the Googles of the world, to internal search in your CRM, or that on-site product search engine that helps you find the product you’re looking for.
Since using a search engine means that you are in fact communicating with a system, NLP seems like a perfect fit for search engines. And in many cases it is.
You expect to be able to give Google a rather natural query, like “best cheap compact cameras of 2017” and get relevant results.
We expect Google to figure out that “compact cameras” is our subject, that we want it to be cheap and that the camera should be released in 2017 or late 2016. This is a tough task for a computer to handle however.
This is the primary reason thousands of SEO agencies still exists. Their main role is to make website content easier to index and understand by Google’s algorithm. To create content headings and keyword-friendly meta descriptions that make NLP on Google possible.
But one has to understand that the way we communicate with search engines depend largely on the medium and intent – just like normal communication between humans.
When searching on Google, we tend to use a sort of keyword heavy natural language, like the one posted above. When using voice-commands, we will often use more filler words like “let”, “do” and “uhm”.
There has been a recent upswing in the number of search engines focused on e-commerce product search that claim to be using advanced NLP, which lets them understand naturally written or spoken queries. They use examples such as “red dress for under $200” as queries which they handle well.
Now that’s all fine and well, but how useful is it? What rules of communication do we use when communicating with product search engines?
A recent study showed that in Q1 of 2017, 82% of all search queries contain 2 or less words, and 93% of all queries contained 3 or less. The way we communicate with an e-commerce website is extremely keyword heavy and concise. Not very “natural” at all.
So Why Are We Searching for "Cheap Cameras from 2017" on Google but Not on E-Commerce Sites Directly?
One reason may be that our intent for Google search is to find discussions and do research in our purchase process. And even though we expect “cheap” to be subjective, we’ve learned that content creators and SEO agencies (especially product top lists) usually use words like “cheap” and “best” in their communication.
In an e-commerce store we understand that using a word like “cheap” is too subjective. We can’t simply translate cheap to a rule such as “less than $100” since it will be depend on the type of product and your perception of cheap. If we had some contextual knowledge about the product type we could create a relative measure of “cheapness”, but it would be conditioned on the products held by this specific store, unless it has worldwide market knowledge about the actual price range of the product type.
So some search engines are trying to create a language where the consumer can enter an almost rule based language such as “compact camera < $200” or “compact camera under $200”. In many cases these are just that – rules.They detect the the presence of “<” or “under” and create a rule to filter out products with a price > $200, much like a normal price facet works.
This is a trivial task, but not very useful since nobody uses this type of rule-based communication in an e-commerce store. You would have to teach the consumer how to communicate with your site before this would become efficient, and the likelihood of this happening is slim since there is no standardized way of how this form of communication would look.
The beauty of keyword heavy communication is that the format of communication is standardized – you know it works on most e-commerce sites.
One also has to realise that, as with all Machine Learning models, the quality of your NLP model will always depend on the quality and amount of data you have at your disposal.
Solving many of the tasks that NLP tries to solve in e-commerce, such as synonyms (is it “waterproof” of “gore-tex”?) becomes very hard since a single store in isolation will never have enough data to learn synonyms reliably.
You could use lexical databases such as WordNet, they can help but they are limited since it’s mainly used for english synonyms. Therefore, they aren't feasible for international stores and because each e-commerce store likely has a lot of unique words, such as SKUs, specific brands and product names that may not be a part of a lexical database.
At the same time, solving the problem with synonyms is extremely important for an e-commerce store since there will be a discrepancy between what a consumer calls a product and how the metadata describes the product.
You also have the problem of ambiguity in common words – what does “action” mean? It can be highly personal, where some people may like sci-fi action, while others prefer superhero action.
But If NLP Can’t Take Us All the Way (Yet), What Can We Do?
Due to the way e-commerce catalogues are structured, and because of the way consumers communicate with the stores, we can shift focus so that instead of putting our energy on trying to train NLP models, we can instead use the strengths of e-commerce stores extensive labeling of products and focus on creating models that helps us find meaningful relations between products, and use these relations to create a similar, but more reliable, effect as what we hope to achieve with NLP.
For example, if we understand that there is a strong relation between "gore-tex" products and “waterproof” we can choose to treat them as similar in their meaning.
And if a consumer searches for “action”, we can identify the different types of action movies based on the relations found between products - using their description, listed actors, directors etc.
In a way, this simplifies the problem of training models. Since we build models on the type of data e-commerce sites are already have (i.e. labeled and categorized product data) instead of data where e-commerce is traditionally quite weak (i.e. exhaustive product descriptions).
Modeling the relations and structure between products will also give us a foundation to build on, where we can add functionality such as one-to-one personalised search results - an area which NLP doesn't have a natural fit.
So Is NLP the Future of E-Commerce Product Search?
NLP will continue to deliver valuable functionality, such as identifying and separating an item (e.g. shirt) from attribute (e.g. white), and, as with almost all machine learning, it’s hard to know exactly how research will evolve. NLP may very well may be the future.
But in order to maximize your profits today and the foreseeable future, it is important to understand that consumers are not communicating with your online shop in a natural way, and there are no signs that this will change soon.
Because of this, we at Loop54 believe that the most valuable systems today will instead be modeling the relations between products, personalising sorting and filtering and innovating new ways to explore a product catalogue through search.
Here's how Loop54 works: