Context Length, Tokenization, and Syntactic Noise

04 May 2025

It is well known that language model performance on tasks such as information retrieval decreases as context length increases. In this post, we are going to set up a simple experiment to study this behavior in more detail.

The Experiment

We will provide a language model a system messsage with a list of many different questions as context. Each question will have an associated unique UUID. Then, we will repeat 10 questions from the list in our user message and ask the model to retrieve the correct UUIDs for each one. The accuracy of the response (share of correct UUIDs) will measure the model’s performance on the task. We will use the gpt-4o-mini language model for this test and repeat it 10 times to get 100 retrievals in total.

You are a helpful assistant. You are given a database of many different questions
each associated with a UUID.

Users will provide a list of 10 questions and your goal is to retrieve the correct UUID
from the database for each of the questions.

The database is included in this system prompt. Return just comma separated
list of 10 UUIDs and nothing else.

The database:
[
  {
    "uuid": "5babf985-30ac-48f1-a678-977d5ba39b54",
    "question": "Who won the Oesper Award in 1983?"
  },
  {
    "uuid": "3b3242e5-6327-4eca-9351-09772c9b722f",
    "question": "What EP did Machine Girl release in 2016?"
  },
  // many more questions...
]

We will run multiple iterations of the experiment where we modify: database size (from 100 to 1500 questions), language (English questions as well as the same questions translated to Czech), and serialization format (JSON and YAML).

The main expectation is that the retrieval accuracy will be worse for the large databases (long context) than for the smaller ones (short context).

The questions used in the experiment come from my Czech-SimpleQA eval project I have written about in a previous post.

Tokenization and Byte-pair Encoding

The length of the context that a language model is working with is measured in number of tokens. The issue is that the same information can be encoded into countless different strings of characters, each with a different token count.

To visualize this, we can use basic HTML and OpenAI’s o200k_base tokenizer from their tiktoken library. This tokenizer is used in gpt-4o-mini.

import tiktoken
from IPython.display import HTML, display
tokenizer = tiktoken.get_encoding("o200k_base")

def html_tokens(text: str) -> str:
    opacity = 0.3
    colors = [
        f"rgba(164, 12, 43, {opacity})",
        f"rgba(35, 93, 156, {opacity})",
        f"rgba(12, 164, 133, {opacity})",
    ]

    def get_color(i: int) -> str:
        j = i % len(colors)
        color = colors[j]
        return f'style="background-color:{color}"'

    tokens = tokenizer.encode(text)
    html_elements = [
        f'<span {get_color(i)}>{tokenizer.decode([token])}</span>'
        for i, token in enumerate(tokens)
    ]

    return "<pre>" + "".join(html_elements) + "</pre>"

display(HTML(html_tokens("this is tokenized hello world!")))

We can use this function to show the difference in tokenization of JSON and YAML data. The same two questions in the example below produce 41 and 27 tokens when formatted as JSON and YAML, respectively. This means 34% context length difference!

JSON vs YAML

The explanation is simple: JSON contains characters such as {}",: and lots of whitespace indentation, which consumes additional tokens compared to YAML.

We can try the same thing again, this time comparing identical questions written in English and Czech (both formatted as YAML). English gives us 27 tokens, while Czech produces 42 tokens.

English vs Czech

The reason for this difference is that the tokenizer was trained on mostly English content from the Internet. It uses the byte-pair encoding algorithm that merges frequently occurring character sequences into single tokens. There is an implementation of the algorithm on my GitHub, if you are interested in more details.

The algorithm tokenizes the English word original into a single token because it has seen the word many times during the training. On the other hand, the Czech word původní is encoded into three separate tokens, as it was not very common in the training data.

If we take a sample of Czech and English questions from Czech-SimpleQA, we can see that even though Czech questions have on average lower word count, they produce higher token counts than their English counterparts.

words vs tokens

Syntactic Noise

The overall experiment results are summarized in the table below. As expected, retrieval accuracy decreases as the database size increases. It is also not surprising that additional JSON tokens produce longer context (column db_token_count), resulting in worse accuracy than YAML.

db_size	language	db_format	db_token_count	accuracy
100	cs	json	7156	0.98
100	cs	yaml	6539	0.98
100	en	json	6327	0.98
100	en	yaml	5727	1.00
500	cs	json	33964	0.63
500	cs	yaml	30900	0.97
500	en	json	30316	0.67
500	en	yaml	27345	0.95
1500	cs	json	101143	0.14
1500	cs	yaml	91975	0.71
1500	en	json	90357	0.20
1500	en	yaml	81437	0.66

However, it’s interesting that the Czech language shows performance very similar to English, even though it generates a significantly longer context. A possible explanation is that the additional Czech tokens still carry useful information, as opposed to the JSON tokens {}",:, which are mostly just clutter or syntactic noise, as the name of this section suggests.

We can test this hypothesis by randomly removing the noisy tokens {}",: from the JSON database and checking if the accuracy improves. The function remove_noisy_tokens looks at every occurrence of the most common clutter tokens and removes it with a probability remove_noisy_token_prob. Higher probability gets us lower db_token_count, because we delete more and more tokens.

def remove_noisy_tokens(message_content: str, prob: float) -> str:
    noisy_tokens = {
        tokenizer.encode(token)[0]
        for token in [
            ' "',
            '":',
            '   ',
            'uuid',
            'question',
            '",\n',
            ' {\n',
            ' },\n',
            '?"\n',
        ]
    }
    space_token = tokenizer.encode(" ")[0]  
    tokens = tokenizer.encode(message_content)
    # replace noisy tokens with a space with some probability
    noisy_tokens_removed = [
        space_token
        if token in noisy_tokens
        and np.random.rand() < prob else token
        for token in tokens
    ]
    # decode and replace any double/triple spaces
    # that could have been created by inserting
    # a space token
    return (
        tokenizer
        .decode(noisy_tokens_removed)
        .replace("  ", " ")
        .replace("   ", " ")
        .replace("    ", " ")
    )

The results in the following table don’t support the hypothesis stated above, because the more clutter we remove, the worse accuracy we get. It’s possible that without the seemingly useless tokens, the structure of the database is compromised making it harder for the model to find the correct questions to retrieve because it does not know where each questions starts and ends.

db_size	language	db_format	remove_noisy_token_prob	db_token_count	accuracy
500	en	json	0.0	30316	0.73
500	en	json	0.1	28656	0.40
500	en	json	0.3	27433	0.17
500	en	json	0.5	26348	0.04
500	en	json	0.8	24631	0.01
500	en	json	1.0	23158	0.02

We can also try removing clutter that does not alter the database structure by using a custom database formatting that just separates each question with a delimiter of our own choice.

def delimited_formatting(
    database: list[dict],
    delimiter: str
) -> str:
    formatted_questions = [
        f"uuid: {q['uuid']}, question: {q['question']}"
        for q in database 
    ]
    return delimiter.join(formatted_questions)

We will use 5 different delimiters of varying length. The longest consists of 482 characters, and the shortest is just three characters.

    delimiters = [
        " |\n",
        " " + "$~*\t@#" + "\n",
        " " + "$~*\t@#" * 10 + "\n",
        " " + "$~*\t@#" * 30 + "\n",
        " " + "$~*\t@#" * 50 + "\n",
    ]

The idea is that by moving from longer to shorter delimiters, we delete uninformative tokens from the context while stile preserving the structure of the data. This should improve the retrieval accuracy.

db_size	language	db_format	delimiter_length	total_token_count	accuracy
100	en	delimited	482	112844	0.44
100	en	delimited	242	63044	0.68
100	en	delimited	122	38144	0.86
100	en	delimited	8	14489	0.96
100	en	delimited	3	13244	0.96

Summary

We have already known that adding tokens and increasing context hurts language model information retrieval performance. What we have learned, though, is that not all tokens are equal.

Additional semantic tokens that construct words and provide useful information do not degrade the performance as much as syntactic noise ({}",: in JSON).
However, removing the noise from the context is not always a solution, as it can compromise the structure of the information leading to a worse model performance.

Notebook with all the code is available on my GitHub.