How Well Can Language Models Answer Questions in Czech?

12 Jan 2025

And how likely are they to hallucinate? In this blog post, I am setting up a Czech language eval based on SimpleQA to investigate this. You can find it on my GitHub.

OpenAI SimpleQA

I took the problems and answers from OpenAI’s SimpleQA eval and translated them into Czech language. The original eval I am using was published along with this paper:

Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus
arXiv preprint arXiv:2411.04368, 2024. https://arxiv.org/abs/2411.04368

Data Preparation

After experimenting with gpt-4o-mini and gpt-4o, I realized that language models will not be able to translate the questions with sufficient quality. I ended up with the following approach:

Generate two Czech translations of the eval with gpt-4o, each using a slightly different prompt. You can see the details in this notebook.
Go over every single problem and answer, pick the better translation, and manually edit it to fix any errors.

The whole process took way longer than I initially expected. I think I spent around 40 hours in total editing and cleaning the data as there were hundreds of problems that required doing at least few minutes of research into the topic to come up with the correct translation. Working on this project definitely helped me appreciate anyone who works as a data annotator or AI trainer.

I want to give an honorable mention to the following gotchas I came across in the data:

Elizabeth of Austria is known in Czechia as Alžběta Bavorská (Bavorská = of Bavaria) and not as Alžběta Rakouská (Rakouská = of Austria).
English uses the term spacecraft to refer to both spaceships (kosmická loď) and artificial satellites (družice). But the Czech terms can’t be interchanged. It’s not possible to correctly translate a question about the spacecraft Foton 8 without researching what kind of spacecraft it is
It seems the question: What day, month, and year was the Retlaw 1 combine car sold to the Carolwood Foundation? talks about a combine car named Retlaw 1 but it actually refers to a car that was part of a passanger train called Retlaw 1.
Maybe my favorite problem from the eval is: What actor appears in the playbill of Norman Rockwell’s illustration “Family Night Out”? I was very confused what the question was talking about until I looked up the painting and saw that there are people literally holding playbills in it (I also didn’t know what the word playbill meant).

And I need to shout out the page biolib.cz that have the Czech names for even the most obscure animals and plants I needed to translate. Even more impressive that it’s mostly a hobby project, as far as I understand.

How the Eval Works

Czech-SimpleQA works the same as SimpleQA. It asks a language model the questions and then uses another language model to grade the answers. The grading model responds either with CORRECT, INCORRECT, or NOT_ATTEMPTED. The final eval metric is calculated as $F_1$ score. More information can be found in the original paper.

I am using a slightly modified grading template with additional in-context examples specifying how to grade answers containing names. The idea is to accept a translated name only if it has a well-known Czech version. The full template is available here.

predicted_answer	target	grade
Jhelum River	řeka Džihlam	CORRECT
UN General Assembly	Valné shromáždění OSN	CORRECT
Grey’s Anatomy	Chirurgové	CORRECT
Pozornost je vše potřebné	Attention Is All You Need	INCORRECT

Results

When comparing the $F_1$ scores with the original results, we can see a decrease in performance on the Czech-translated questions. The reported SimpleQA scores come from the README.md and the paper.

model	SimpleQA	Czech-SimpleQA
gpt-4o-mini-2024-07-18	9.5	8.1
gpt-4o-2024-11-20	38.8	31.4
claude-3-5-sonnet-20240620	35.0	25.8
claude-3-5-sonnet-20241022	N/A	31.1
claude-3-5-haiku-20241022	N/A	9.3

Full results are in the table below. Claude models are more likely to give up and not attempt to answer. The same finding was reported in the original paper.

model	correct	incorrect	not_attempted
gpt-4o-mini-2024-07-18	347	3913	66
gpt-4o-2024-11-20	1345	2904	77
claude-3-5-sonnet-20240620	828	1277	2221
claude-3-5-sonnet-20241022	1225	2329	772
claude-3-5-haiku-20241022	338	2575	1413

When looking at the accuracy $P(\text{correct}|\text{attempted})$, we again see worse performance on Czech-SimpleQA. This suggests the models are more likely to hallucinate when attempting to answer a question in Czech.

model	SimpleQA	Czech-SimpleQA
gpt-4o-mini	8.7	8.1
gpt-4o	38.0	31.7
claude-3-5-sonnet-20240620	44.5	39.3

Questions All the Models Got Wrong

problem	czech_problem
What was the original name of the town of Dolores, Abra, Philippines?	Jaké bylo původní jméno města Dolores v provincii Abra na Filipínách?
What did Alison Garrs overdose on in the last episode of Season 2 of Happy Valley?	Čím se Alison Garrs předávkovala v poslední epizodě 2. série Happy Valley?
Which commune in France is The Pont Serme located in?	Ve které francouzské obci se nachází Pont Serme?

Questions All the Models Got Right

problem	czech_problem
What month and date did Tina Turner, the singer, die?	Který den, měsíc a rok zemřela zpěvačka Tina Turner?
What cabinet position did Sir Hector-Louis Langevin hold in 1870?	Jakou vládní pozici zastával Sir Hector-Louis Langevin v roce 1870?
What was the population count in the 2011 census of the Republic of Nauru?	Jaký byl počet obyvatel při sčítání lidu v roce 2011 v Republice Nauru?

Questions gpt-4o Got Right in English and Wrong in Czech

problem	czech_problem
How many kilobytes of memory did the IBM 7030 Stretch have?	Kolik kilobytů paměti měl IBM 7030 Stretch?
Jak se jmenovalo vůbec první jehně ovce Dolly?	What was the name of Dolly the sheep’s very first lamb?
Kdo byl headliner na Arc stage během ArcTanGent 2013 v pátek večer?	Who was the Friday night headliner of ArcTanGent 2013 on the Arc stage?

Answers with their grades from all the evaluated models can be found in the GitHub repository.