Using a Language Model for Collaborative Filtering

01 Jul 2024

Collaborative filtering is a popular method for solving recommendation tasks. Can a language model do it? Let’s find out.

The Task

We will use the classic MovieLens 100k dataset to evaluate the performance of the language model based recommendation. The dataset consists of 100,000 movie ratings from 1000 users. Each data point is a tuple of three values: user_id, movie_id, rating. The rating value is an integer between 1 and 5. The data are split into 80,000 training and 20,000 test samples. The task is simple, predict the rating for a given user_id, movie_id pair.

In general, collaboarative filtering exploits a simple idea: if Alice likes movies 1, 2, 3, 4 and Ben likes movies 1, 2, 3, there is a good chance that Ben will like movie 4 as well.

Retrieval-Augmented Generation

To help the language model produce relevant predictions, we apply retrieval-augmented generation to create the model instructions. When asking the model to predict ratings from a user $i$, we tell it how the user, as well as users with similar preferences, have rated movies in the past. This is of course not my original invetion, there are many papers online describing this approach. The whole setup is described in the following diagram:

LLM Recsys

The process starts with transforming the training tuples into a user-item matrix $R$ where $r_{ij}$ is the rating that the user $i$ has given to the item (movie) $j$. The shape of $R$ is $n \times m$ because there are $n$ users and $m$ movies. With the user-item matrix ready, we can proceed to compute user embeddings by factorizing $R$ with singular value decomposition into user factors $U$ and item factors $V$

\[R = U \Sigma V^T\]

where $\Sigma$ are the singluar values. In our case, we are interested only in the first $n \times n$ matrix $U$. We take the first 100 columns of $U$ to get an $n \times 100$ embedding matrix where the $i$-th row stores the $1 \times 100$ embedding vector of the $i$-th user. All this can be done in two lines of Python:

n_dims = 100
u, s, v = np.linalg.svd(train_user_item, full_matrices=True)
train_user_embeddings = u[:, :n_dims]

The cosine distance between a pair of vectors representing users that have rated movies in a similar way will be low. If the users frequently disagreed, the distance would be high. This allows us to take a user $i$ and find their $k=20$ nearest neighbors (kNN) with similar historical ratings.

from sklearn.neighbors import NearestNeighbors

train_knn = NearestNeighbors(n_neighbors=20, metric="cosine")
train_knn.fit(train_user_embeddings)

user_id = 23
vec = train_user_embeddings[user_id, np.newaxis]
nearest_user_ids = train_knn.kneighbors(vec, return_distance=False)

Now we have everyting we need to construct the instructions for the language model. You can see the exact prompt in the snippet bellow:

def get_instruction(
    user_history: UserHistory,
    nearest_user_histories: dict[int, UserHistory],
    movies_to_predict: list[int],
) -> str:
    
    formatted_nearest_user_histories = "\n".join(
        f"user {i}: {fmt_user_history(near_user_history)}."
        for i, near_user_history
        in enumerate(nearest_user_histories.values())
    )
    
    return f"""
        A user has seen and rated the following movies:
        {fmt_user_history(user_history)}.

        Users with similar preferences have seen and rated
        these movies: {formatted_nearest_user_histories}.

        Each comma-separated value in the lists above represents
        a movie id followed by a numerical user rating.

        Use the data above to predict the ratings the user has
        given to the movies: {fmt_list(movies_to_predict)}.
        Give exactly one rating to each of the movies.
        
        Respond with just a list of ratings and nothing else.
        Your response needs to have exactly
        {len(movies_to_predict)} ratings.
    """

Performance

I have tried two different models from OpenAI: gpt-4o and gpt-4o-mini and compared their test set performance with Surprise SVD algorithm and random predictions. The results show that the language models underperform the Surprise algorithm.

Eval

Notebook with all the code is available on my GitHub.