Scaling Recommenders systems with Vespa

Published in

FARFETCH Technology

7 min readMar 26, 2024

By Ricardo Rossi Tegão

Inspiring customers is part of Farfetch’s mission. We use real-time personalized recommendations to match customers with items that are aligned with their style and persona. Our services provide recommendations to not only FARFETCH US | The Global Destination For Modern Luxury but also several other online retailers. This poses a unique challenge of building machine learning services scalable and resilient enough to serve recommendations to a myriad of online retailers.

One of our major challenges is to do it without increasing our infrastructure indefinitely. To tackle this scalability challenge, we choose to use a vector database and Vespa matches our necessity because of all the available features:

Tensor Operations: Dense and sparse vector operations (e.g. map, reduce, slice, etc.).
Complex Ranking: Compute different signals (e.g. bm25, dot-product, cosine, etc.) and use them freely to rank document matches.
Keyword Search: You can use Vespa like Elasticsearch.
LTR: Native model service that can be used for ranking (e.g. LTR).
Filtering: Pre and post-filtering that you can tune to fit your use cases.
Performance: C++ code optimised to use SIMD operations when possible.

With this, we intend to serve all recommendations for all platforms using Vespa and have no scalability issues when a new online retailer joins our services.

This blog post details how we integrate one of our recommenders with Vespa.

Recommender Review

We use a model to decompose the feature interactions (i.e., user interactions) into latent factors, this decomposition allows the model to learn a set of latent factors for each feature (capturing interactions among them), which are then used to provide recommendations for different use cases.

The training process is illustrated in the figure below. As train input, it receives two matrices:

UserProductInteractions
The interactions that the user had with the products in the catalogue.
UserProductFeatures

The content of this matrix represents users projected in the product-features space. We multiply the user-product interactions matrix with the product-features matrix to obtain this matrix.

As a result of the training process we have:

UserFeaturesEmbeddingsMatrix
ItemsEmbeddingsMatrix

The content of the matrices above will be detailed in the following section.

The inference process follows the equation below to provide user-product recommendations, according to Kula (here).

The entire inference process is represented in the image below.

In the image above, there are four different data sources. Let’s understand what they mean:

Real-time user product scores

-service that has information about the user-product previous interactions.

ProductMatrix

-product in the rows and product-features in the columns.

UserFeaturesEmbeddingsMatrix

-product-features in the rows and latent factors in the columns.

ItemsEmbeddginsMatrix

-items in the rows and latent factors in the columns.

With clearness about the necessary data, let’s go through the process of generating user-product recommendations:

Fetch user-product similarity (scores) from a real-time service (highlighted in green).
Multiply the user-product similarity vector by the ProductMatrix (highlighted in blue), as a result, we have the user representation in the product-features space.
Multiply the vector representation of the user by the UserFeaturesEmbeddingsMatrix (highlighted in yellow), to produce the product-features representation in the latent factor space.
Multiply product-features representation in the latent factor space by the ItemsEmbeddginsMatrix (highlighted in red), with this we have a vector of products and scores, that need to be ordered to provide user-product recommendations.

How to do it with Vespa?

Vespa helped us in the inference process, so we had to store the matrices from the figure above (ProductMatrix, UserFeaturesEmbeddings, and ItemsEmbeddings) in Vespa and make HTTP requests to provide user-product recommendations.

The trickiest part of this process was to figure out how to store data in a way that we can do the necessary calculus and match latency requirements when we perform the requests (in our case under 100ms), this took several iterations until we found our optimal solution, that is one that we will share here.

Product Matrix

There are a few aspects that we had to take into consideration when developing the schema and rank-profile for this matrix:

This matrix is very sparse (only 0.1% of the data is filled in)
We need to multiply the user-product similarity (scores) by this matrix, however, considering the size of our catalogue it’s unlikely to have a user that has interacted with all the products in the catalogue, so we will need to perform a dot-product where the shape of the “vector“ and the matrix are different (shape mismatching).

The figure below illustrates how we choose to store it. Each document in our schema is represented by a feature in the matrix and has a field with a key-value pair (map tensor) of the product_id and the feature value. The main goal is to store only the non-sparse data to avoid unnecessary memory consumption and improve dot-product performance.

The figure below illustrates the dot-product with the shape mismatching. In the case where a user just had interacted with four products (P0, P1, P2, and P3) we need to perform a dot-product with our ProductMatrix that has millions of products inside.

To be able to do it, the following rank-profile has been proposed. The inputs enable us to send key-value pairs (product-scores), which are then multiplied by the corresponding product field in each feature document if exists.

The result of this calculus will be a dense vector representing the user in the product-features space with a size equal to the number of the non-zero features-products that the user has interacted with.

User Features Embeddings Matrix

The output of the previous step, where s is the score for each feature, will be used to multiply the UserFeaturesEmbeddingsMatrix and the same problem with shape mismatch occurs here, so the same rank-profile strategy has been applied. The only difference now is that the UserFeaturesEmbeddingsMatrix is dense, so we choose to store all this matrix in a single document (we had a better performance to store and perform the necessary calculus with this approach).

The figure below illustrates the necessary calculus to be performed. The output of this step is represented as a dense vector with a size equal to the number of latent factors that we chose in the model training step.

Items Embeddings Matrix

This is the final step, and the output will be the ordered product recommendation for our users. The output of the previous step was a dense vector with a size equal to the number of latent factors which matches the number of lines of the ItemsEmbeddingsMatrix, so now we will not have to perform a dot-product with shapes mismatch.

The figure below illustrates the way we stored the matrix and performed the necessary calculus. Each product is stored as a single document, so when dot-product is performed we leverage Vespa’s implementation of HNSW to calculate the vector closeness and then add the necessary bias values.

Summary and Future Work

Using Vespa to perform matrix multiplication could be challenging, especially if you have some performance requirements to match, but it becomes a powerful tool once you figure out the critical points of how to store and develop the necessary rank-profile. Our team successfully implemented the entire recommendation process of one algorithm with Vespa, matching the latency requirements (provide recommendations under 100ms) and scalability needs.

One thing that is important to have in mind is that each step/calculus that has been demonstrated here represents a distinct HTTP request. While Vespa presentation time plays a significant role in the overall response latency, it’s important to take network latency into account.

In future work we could consider performing the necessary calculus with a single HTTP request, avoiding unnecessary network latency.