04 May 2025
It is well known that language model performance on tasks such as information retrieval decreases as context length increases. In this post, we are going to set up a simple experiment to study this behavior in more detail.
We will provide a language model a system messsage with a list of many different questions as context. Each question will have an associated unique UUID. Then, we will repeat 10 questions from the list in our user message and ask the model to retrieve the correct UUIDs for each one. The accuracy of the response (share of correct UUIDs) will measure the model’s performance on the task. We will use the gpt-4o-mini language model for this test and repeat it 10 times to get 100 retrievals in total.
You are a helpful assistant. You are given a database of many different questions
each associated with a UUID.
Users will provide a list of 10 questions and your goal is to retrieve the correct UUID
from the database for each of the questions.
The database is included in this system prompt. Return just comma separated
list of 10 UUIDs and nothing else.
The database:
[
{
"uuid": "5babf985-30ac-48f1-a678-977d5ba39b54",
"question": "Who won the Oesper Award in 1983?"
},
{
"uuid": "3b3242e5-6327-4eca-9351-09772c9b722f",
"question": "What EP did Machine Girl release in 2016?"
},
// many more questions...
]
We will run multiple iterations of the experiment where we modify: database size (from 100 to 1500 questions), language (English questions as well as the same questions translated to Czech), and serialization format (JSON and YAML).
The main expectation is that the retrieval accuracy will be worse for the large databases (long context) than for the smaller ones (short context).
The questions used in the experiment come from my Czech-SimpleQA eval project I have written about in a previous post.
The length of the context that a language model is working with is measured in number of tokens. The issue is that the same information can be encoded into countless different strings of characters, each with a different token count.
To visualize this, we can use basic HTML and OpenAI’s o200k_base
tokenizer from their tiktoken
library.
This tokenizer is used in gpt-4o-mini.
import tiktoken
from IPython.display import HTML, display
tokenizer = tiktoken.get_encoding("o200k_base")
def html_tokens(text: str) -> str:
opacity = 0.3
colors = [
f"rgba(164, 12, 43, {opacity})",
f"rgba(35, 93, 156, {opacity})",
f"rgba(12, 164, 133, {opacity})",
]
def get_color(i: int) -> str:
j = i % len(colors)
color = colors[j]
return f'style="background-color:{color}"'
tokens = tokenizer.encode(text)
html_elements = [
f'<span {get_color(i)}>{tokenizer.decode([token])}</span>'
for i, token in enumerate(tokens)
]
return "<pre>" + "".join(html_elements) + "</pre>"
display(HTML(html_tokens("this is tokenized hello world!")))
We can use this function to show the difference in tokenization of JSON and YAML data. The same two questions in the example below produce 41 and 27 tokens when formatted as JSON and YAML, respectively. This means 34% context length difference!
The explanation is simple: JSON contains characters such as {}",:
and lots of whitespace
indentation, which consumes additional tokens compared to YAML.
We can try the same thing again, this time comparing identical questions written in English and Czech (both formatted as YAML). English gives us 27 tokens, while Czech produces 42 tokens.
The reason for this difference is that the tokenizer was trained on mostly English content from the Internet. It uses the byte-pair encoding algorithm that merges frequently occurring character sequences into single tokens. There is an implementation of the algorithm on my GitHub, if you are interested in more details.
The algorithm tokenizes the English word original into a single token because it has seen the word many times during the training. On the other hand, the Czech word původní is encoded into three separate tokens, as it was not very common in the training data.
If we take a sample of Czech and English questions from Czech-SimpleQA, we can see that even though Czech questions have on average lower word count, they produce higher token counts than their English counterparts.
The overall experiment results are summarized in the table below. As expected, retrieval accuracy decreases
as the database size increases. It is also not surprising that additional JSON tokens produce longer
context (column db_token_count
), resulting in worse accuracy than YAML.
db_size | language | db_format | db_token_count | accuracy |
---|---|---|---|---|
100 | cs | json | 7156 | 0.98 |
100 | cs | yaml | 6539 | 0.98 |
100 | en | json | 6327 | 0.98 |
100 | en | yaml | 5727 | 1.00 |
500 | cs | json | 33964 | 0.63 |
500 | cs | yaml | 30900 | 0.97 |
500 | en | json | 30316 | 0.67 |
500 | en | yaml | 27345 | 0.95 |
1500 | cs | json | 101143 | 0.14 |
1500 | cs | yaml | 91975 | 0.71 |
1500 | en | json | 90357 | 0.20 |
1500 | en | yaml | 81437 | 0.66 |
However, it’s interesting that the Czech language shows performance very similar to English, even though
it generates a significantly longer context. A possible explanation is that the additional Czech tokens
still carry useful information, as opposed to the JSON tokens {}",:
, which are mostly just clutter
or syntactic noise, as the name of this section suggests.
We can test this hypothesis by randomly removing the noisy tokens {}",:
from the JSON database
and checking if the accuracy improves. The function remove_noisy_tokens
looks at every occurrence
of the most common clutter tokens and removes it with a probability remove_noisy_token_prob
.
Higher probability gets us lower db_token_count
, because we delete more and more tokens.
def remove_noisy_tokens(message_content: str, prob: float) -> str:
noisy_tokens = {
tokenizer.encode(token)[0]
for token in [
' "',
'":',
' ',
'uuid',
'question',
'",\n',
' {\n',
' },\n',
'?"\n',
]
}
space_token = tokenizer.encode(" ")[0]
tokens = tokenizer.encode(message_content)
# replace noisy tokens with a space with some probability
noisy_tokens_removed = [
space_token
if token in noisy_tokens
and np.random.rand() < prob else token
for token in tokens
]
# decode and replace any double/triple spaces
# that could have been created by inserting
# a space token
return (
tokenizer
.decode(noisy_tokens_removed)
.replace(" ", " ")
.replace(" ", " ")
.replace(" ", " ")
)
The results in the following table don’t support the hypothesis stated above, because the more clutter we remove, the worse accuracy we get. It’s possible that without the seemingly useless tokens, the structure of the database is compromised making it harder for the model to find the correct questions to retrieve because it does not know where each questions starts and ends.
db_size | language | db_format | remove_noisy_token_prob | db_token_count | accuracy |
---|---|---|---|---|---|
500 | en | json | 0.0 | 30316 | 0.73 |
500 | en | json | 0.1 | 28656 | 0.40 |
500 | en | json | 0.3 | 27433 | 0.17 |
500 | en | json | 0.5 | 26348 | 0.04 |
500 | en | json | 0.8 | 24631 | 0.01 |
500 | en | json | 1.0 | 23158 | 0.02 |
We can also try removing clutter that does not alter the database structure by using a custom database formatting that just separates each question with a delimiter of our own choice.
def delimited_formatting(
database: list[dict],
delimiter: str
) -> str:
formatted_questions = [
f"uuid: {q['uuid']}, question: {q['question']}"
for q in database
]
return delimiter.join(formatted_questions)
We will use 5 different delimiters of varying length. The longest consists of 482 characters, and the shortest is just three characters.
delimiters = [
" |\n",
" " + "$~*\t@#" + "\n",
" " + "$~*\t@#" * 10 + "\n",
" " + "$~*\t@#" * 30 + "\n",
" " + "$~*\t@#" * 50 + "\n",
]
The idea is that by moving from longer to shorter delimiters, we delete uninformative tokens from the context while stile preserving the structure of the data. This should improve the retrieval accuracy.
db_size | language | db_format | delimiter_length | total_token_count | accuracy |
---|---|---|---|---|---|
100 | en | delimited | 482 | 112844 | 0.44 |
100 | en | delimited | 242 | 63044 | 0.68 |
100 | en | delimited | 122 | 38144 | 0.86 |
100 | en | delimited | 8 | 14489 | 0.96 |
100 | en | delimited | 3 | 13244 | 0.96 |
We have already known that adding tokens and increasing context hurts language model information retrieval performance. What we have learned, though, is that not all tokens are equal.
{}",:
in JSON).Notebook with all the code is available on my GitHub.