01 Jul 2024
Collaborative filtering is a popular method for solving recommendation tasks. Is it a good idea to implement it around a language model? Let’s find out.
We will use the classic MovieLens 100k dataset to evaluate the performance of the language
model based recommendation. The dataset consists of 100,000 movie ratings from 1000 users. Each data point
is a tuple of three values: user_id, movie_id, rating
. The rating
value is an integer between 1 and 5.
The data are split into 80,000 training and 20,000 test samples. The task is simple, predict the rating
for a given user_id, movie_id
pair.
In general, collaboarative filtering exploits a simple idea: if Alice likes movies 1, 2, 3, 4 and Ben likes movies 1, 2, 3, there is a good chance that Ben will like movie 4 as well.
To help the language model produce relevant predictions, we apply retrieval-augmented generation to create the model instructions. When asking the model to predict ratings from a user $i$, we tell it how the user, as well as users with similar preferences, have rated movies in the past. This is of course not my original invetion, there are many papers online describing this approach. The whole setup is described in the following diagram:
The process starts with transforming the training tuples into a user-item matrix $R$ where $r_{ij}$ is the rating that the user $i$ has given to the item (movie) $j$. The shape of $R$ is $n \times m$ because there are $n$ users and $m$ movies. With the user-item matrix ready, we can proceed to compute user embeddings by factorizing $R$ with singular value decomposition into user factors $U$ and item factors $V$
\[R = U \Sigma V^T\]where $\Sigma$ are the singluar values. In our case, we are interested only in the first $n \times n$ matrix $U$. We take the first 100 columns of $U$ to get an $n \times 100$ embedding matrix where the $i$-th row stores the $1 \times 100$ embedding vector of the $i$-th user. All this can be done in two lines of Python:
The cosine distance between a pair of vectors representing users that have rated movies in a similar way will be low. If the users frequently disagreed, the distance would be high. This allows us to take a user $i$ and find their $k=20$ nearest neighbors (kNN) with similar historical ratings.
Now we have everyting we need to construct the instructions for the language model. You can see the exact prompt in the snippet bellow:
I have tried two different models from OpenAI: gpt-4o and gpt-4o-mini and compared their test set performance with Surprise SVD algorithm and random predictions. The results show that the language models underperform the Surprise algorithm.
Notebook with all the code is available on my GitHub.