How Well Can Language Models Answer Questions in Czech?

12 Jan 2025

And how likely are they to hallucinate? In this blog post, I am setting up a Czech language eval based on SimpleQA to investigate this. You can find it on my GitHub.

OpenAI SimpleQA

I took the problems and answers from OpenAI’s SimpleQA eval and translated them into Czech language. The original eval I am using was published along with this paper:

Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus
arXiv preprint arXiv:2411.04368, 2024. https://arxiv.org/abs/2411.04368

Data Preparation

After experimenting with gpt-4o-mini and gpt-4o, I realized that language models will not be able to translate the questions with sufficient quality. I ended up with the following approach:

The whole process took way longer than I initially expected. I think I spent around 40 hours in total editing and cleaning the data as there were hundreds of problems that required doing at least few minutes of research into the topic to come up with the correct translation. Working on this project definitely helped me appreciate anyone who works as a data annotator or AI trainer.

I want to give an honorable mention to the following gotchas I came across in the data:

And I need to shout out the page biolib.cz that have the Czech names for even the most obscure animals and plants I needed to translate. Even more impressive that it’s mostly a hobby project, as far as I understand.

How the Eval Works

Czech-SimpleQA works the same as SimpleQA. It asks a language model the questions and then uses another language model to grade the answers. The grading model responds either with CORRECT, INCORRECT, or NOT_ATTEMPTED. The final eval metric is calculated as $F_1$ score. More information can be found in the original paper.

I am using a slightly modified grading template with additional in-context examples specifying how to grade answers containing names. The idea is to accept a translated name only if it has a well-known Czech version. The full template is available here.

predicted_answer target grade
Jhelum River řeka Džihlam CORRECT
UN General Assembly Valné shromáždění OSN CORRECT
Grey’s Anatomy Chirurgové CORRECT
Pozornost je vše potřebné Attention Is All You Need INCORRECT

Results

When comparing the $F_1$ scores with the original results, we can see a decrease in performance on the Czech-translated questions. The reported SimpleQA scores come from the README.md and the paper.

model SimpleQA Czech-SimpleQA
gpt-4o-mini-2024-07-18 9.5 8.1
gpt-4o-2024-11-20 38.8 31.4
claude-3-5-sonnet-20240620 35.0 25.8
claude-3-5-sonnet-20241022 N/A 31.1
claude-3-5-haiku-20241022 N/A 9.3

Full results are in the table below. Claude models are more likely to give up and not attempt to answer. The same finding was reported in the original paper.

model correct incorrect not_attempted
gpt-4o-mini-2024-07-18 347 3913 66
gpt-4o-2024-11-20 1345 2904 77
claude-3-5-sonnet-20240620 828 1277 2221
claude-3-5-sonnet-20241022 1225 2329 772
claude-3-5-haiku-20241022 338 2575 1413

When looking at the accuracy $P(\text{correct}|\text{attempted})$, we again see worse performance on Czech-SimpleQA. This suggests the models are more likely to hallucinate when attempting to answer a question in Czech.

model SimpleQA Czech-SimpleQA
gpt-4o-mini 8.7 8.1
gpt-4o 38.0 31.7
claude-3-5-sonnet-20240620 44.5 39.3

Questions All the Models Got Wrong

problem czech_problem
What was the original name of the town of Dolores, Abra, Philippines? Jaké bylo původní jméno města Dolores v provincii Abra na Filipínách?
What did Alison Garrs overdose on in the last episode of Season 2 of Happy Valley? Čím se Alison Garrs předávkovala v poslední epizodě 2. série Happy Valley?
Which commune in France is The Pont Serme located in? Ve které francouzské obci se nachází Pont Serme?

Questions All the Models Got Right

problem czech_problem
What month and date did Tina Turner, the singer, die? Který den, měsíc a rok zemřela zpěvačka Tina Turner?
What cabinet position did Sir Hector-Louis Langevin hold in 1870? Jakou vládní pozici zastával Sir Hector-Louis Langevin v roce 1870?
What was the population count in the 2011 census of the Republic of Nauru? Jaký byl počet obyvatel při sčítání lidu v roce 2011 v Republice Nauru?

Questions gpt-4o Got Right in English and Wrong in Czech

problem czech_problem
How many kilobytes of memory did the IBM 7030 Stretch have? Kolik kilobytů paměti měl IBM 7030 Stretch?
Jak se jmenovalo vůbec první jehně ovce Dolly? What was the name of Dolly the sheep’s very first lamb?
Kdo byl headliner na Arc stage během ArcTanGent 2013 v pátek večer? Who was the Friday night headliner of ArcTanGent 2013 on the Arc stage?

Answers with their grades from all the evaluated models can be found in the GitHub repository.