Multimodal Search and Browsing in the Product Catalogue — A primer for Conversational Search

FARFETCH Tech
FARFETCH Technology
5 min readMar 13, 2023

--

By Ricardo Sousa and Pedro Miguel Ferreira

This is the third part of the series about iFetch, our multimodal conversational agents for the online fashion marketplace. If you have not started reading this series, check the first article before proceeding.

Painting the picture for enabling navigation in a conversational system

Customers spend much time on product listing pages (PLP), searching and browsing for the desired product or seeking inspiration. The navigation alternates between the PLP and individual product display pages (PDP). This is an important point in the customer journey since, as our behavioral data shows, the greater the customer “adds to the bag” (or cart), the more engaged the customer is with the desired products. As a result, the transaction’s conversion rate is proportionally higher.

Customers’ behaviour throughout these two stages is critical because it helps our teams fine-tune the UI design and the machine learning algorithms inside it. As much as these patterns drive revenue, all engagement must be tailored to the customer’s specific needs. For example, if we fail to provide basic information such as size, styles, patterns, and materials, our customers will become dissatisfied and avoid any future interactions with our business. And/or they might swamp our customer service center with messages.

Figure 1: What individuals love most about their interactions as customers (PwC, 2018).

Speed, convenience, and thoughtful and attentive service are the most critical characteristics of a positive customer experience (see Figure 1). Thus, one must examine additional issues concerning client expectations for high automation. Keeping this in mind, technical developments should focus on implementing these attributes rather than adopting new technologies to stay ahead (PwC, 2018), as covered in the second part of this series and shown in the animation below.

Mockups by Sérgio Pires

By visualising holistic navigation across the conversion funnel, our message in this blog becomes clear: how to convey a framework for conversational search and discovery.

Enabling search in a conversational setting — From text to multimodal queries: a primer …
Traditional search engines lead to unduly complex tech solutions or ad-hoc rules, e.g., based on lexical matches. Instead, we propose to handle this problem with a semantic search strategy to explore robust and production-ready embedding models that better deal with the nuances of the language. (The platform that governs this process is out of the scope of this article. Readers are encouraged to check out our blog post, “Powering AI With Vector Databases: A Benchmark.”)

Creating a multimodal representation for a retrieval task is very challenging. First, we focus on solving the textual conversational search problem, thus starting small but aiming big.

That is, given a textual product reference (e.g., “Show me dresses with a chequered pattern,” <system retrieves products with message>, “in red”), we must provide an ordered list of products according to their resemblance to the query while keeping the context of the dialogue. This allows us to have a first understanding of how the system would respond to such queries.

… it is all about data …
Regarding data and evaluation protocol, we took advantage of the vast types of queries our customers have with the FARFETCH platform. We extended this groundwork with artificially generated user queries to retrieve the correct listing of products by considering the different product encodings. By removing stop words, imposing grammar errors, or even creating random token permutations, we were able to induce noise into our dataset that governs real-work user queries.

Figure 2: Illustration of a fully functioning navigation in a conversation within iFetch.

… collaboration that leads to breakthroughs
Our retrieval task for textual product representation helps us to comprehend and use our catalogue’s product attributes. However, effective multimodal product representations can improve our system’s textual retrieval by leveraging visual information that does not exist in its metadata.

Taking CLIP (Radford et al., 2021) as inspiration, our partners at IST explored how to map product metadata and product images to a common embedding space. CLIP is a model designed to learn multimodal representations through supervised training, pairing visual embeddings of images with the corresponding text embeddings from product information. This enables querying for images with a text input (and vice-versa) since the corresponding text embedding can be used to retrieve the closest visual embeddings in the catalogue.

We extended CLIP to incorporate additional taxonomic information about the fashion domain, which relates products according to a hierarchical tree specific to this domain. The contrastive loss function used to train CLIP assumes that there is only one possible textual information for each product image in the catalogue. We relaxed this rule by lowering the penalty for acceptable image-description pairs (Morais et al., 2022). Combining contrastive loss with label relaxation ensured taxonomically relevant results.

The fine-tuned models using label relaxation outperform our baselines for most of the queries, be they textual, visual, or both. However, text queries involving brands have significantly worse results, failing to present products from the brand referenced in the query. If a certain brand is referenced in text queries, we must retrieve the respective products for those brands.

We alleviated this problem by tackling different training strategies. One explored different text input combinations along with data augmentation techniques during training. We did this while consistently extending the label relaxation loss to ensure taxonomically relevant results and brand- and color-wise relevant results. That led to superior results, as shown in the snapshots below from a live demonstrator (see Figure 2).

Concluding remarks

Mockups by Sérgio Pires

The results thus far include a system capable of greeting the customer, assisting them in navigating the plethora of products available in our catalogue, and providing the correct information about the fashion items under consideration. One of these, the ability to describe a product that a customer has in mind by uploading an image, is a critical milestone in our project.

How? Moreover, how was this incorporated into a dialogue? The feedback from a small group of clients guided the ongoing research and development while paving the way for upcoming developments.

We will keep you updated in forthcoming posts.

References

Morais, Joao, et al. Morais (2022), Say yes to the fetch: Product retrieval on a structured multimodal catalog. In 1st Annual AAAI Workshop on AI to Accelerate Science and Engineering, 2022.

PwC (2018). Experience is everything. Get it right.

Radford, Alec, et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision.” In: CoRR abs/2103.00020.

Originally published at https://www.farfetchtechblog.com on March 7th, 2023.

--

--