12 Jan 2025
And how likely are they to hallucinate? In this blog post, I am setting up a Czech language eval based on SimpleQA to investigate this. You can find it on my GitHub.
I took the problems and answers from OpenAI’s SimpleQA eval and translated them into Czech language. The original eval I am using was published along with this paper:
Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, William Fedus
arXiv preprint arXiv:2411.04368, 2024. https://arxiv.org/abs/2411.04368
After experimenting with gpt-4o-mini and gpt-4o, I realized that language models will not be able to translate the questions with sufficient quality. I ended up with the following approach:
The whole process took way longer than I initially expected. I think I spent around 40 hours in total editing and cleaning the data as there were hundreds of problems that required doing at least few minutes of research into the topic to come up with the correct translation. Working on this project definitely helped me appreciate anyone who works as a data annotator or AI trainer.
I want to give an honorable mention to the following gotchas I came across in the data:
And I need to shout out the page biolib.cz that have the Czech names for even the most obscure animals and plants I needed to translate. Even more impressive that it’s mostly a hobby project, as far as I understand.
Czech-SimpleQA works the same as SimpleQA. It asks a language
model the questions and then uses another language model to grade the answers. The grading
model responds either with CORRECT
, INCORRECT
, or NOT_ATTEMPTED
. The final eval metric
is calculated as $F_1$ score. More information can be found in the original paper.
I am using a slightly modified grading template with additional in-context examples specifying how to grade answers containing names. The idea is to accept a translated name only if it has a well-known Czech version. The full template is available here.
predicted_answer | target | grade |
---|---|---|
Jhelum River | řeka Džihlam | CORRECT |
UN General Assembly | Valné shromáždění OSN | CORRECT |
Grey’s Anatomy | Chirurgové | CORRECT |
Pozornost je vše potřebné | Attention Is All You Need | INCORRECT |
When comparing the $F_1$ scores with the original results, we can see a decrease in performance on the Czech-translated questions. The reported SimpleQA scores come from the README.md and the paper.
model | SimpleQA | Czech-SimpleQA |
---|---|---|
gpt-4o-mini-2024-07-18 | 9.5 | 8.1 |
gpt-4o-2024-11-20 | 38.8 | 31.4 |
claude-3-5-sonnet-20240620 | 35.0 | 25.8 |
claude-3-5-sonnet-20241022 | N/A | 31.1 |
claude-3-5-haiku-20241022 | N/A | 9.3 |
Full results are in the table below. Claude models are more likely to give up and not attempt to answer. The same finding was reported in the original paper.
model | correct | incorrect | not_attempted |
---|---|---|---|
gpt-4o-mini-2024-07-18 | 347 | 3913 | 66 |
gpt-4o-2024-11-20 | 1345 | 2904 | 77 |
claude-3-5-sonnet-20240620 | 828 | 1277 | 2221 |
claude-3-5-sonnet-20241022 | 1225 | 2329 | 772 |
claude-3-5-haiku-20241022 | 338 | 2575 | 1413 |
When looking at the accuracy $P(\text{correct}|\text{attempted})$, we again see worse performance on Czech-SimpleQA. This suggests the models are more likely to hallucinate when attempting to answer a question in Czech.
model | SimpleQA | Czech-SimpleQA |
---|---|---|
gpt-4o-mini | 8.7 | 8.1 |
gpt-4o | 38.0 | 31.7 |
claude-3-5-sonnet-20240620 | 44.5 | 39.3 |
problem | czech_problem |
---|---|
What was the original name of the town of Dolores, Abra, Philippines? | Jaké bylo původní jméno města Dolores v provincii Abra na Filipínách? |
What did Alison Garrs overdose on in the last episode of Season 2 of Happy Valley? | Čím se Alison Garrs předávkovala v poslední epizodě 2. série Happy Valley? |
Which commune in France is The Pont Serme located in? | Ve které francouzské obci se nachází Pont Serme? |
problem | czech_problem |
---|---|
What month and date did Tina Turner, the singer, die? | Který den, měsíc a rok zemřela zpěvačka Tina Turner? |
What cabinet position did Sir Hector-Louis Langevin hold in 1870? | Jakou vládní pozici zastával Sir Hector-Louis Langevin v roce 1870? |
What was the population count in the 2011 census of the Republic of Nauru? | Jaký byl počet obyvatel při sčítání lidu v roce 2011 v Republice Nauru? |
problem | czech_problem |
---|---|
How many kilobytes of memory did the IBM 7030 Stretch have? | Kolik kilobytů paměti měl IBM 7030 Stretch? |
Jak se jmenovalo vůbec první jehně ovce Dolly? | What was the name of Dolly the sheep’s very first lamb? |
Kdo byl headliner na Arc stage během ArcTanGent 2013 v pátek večer? | Who was the Friday night headliner of ArcTanGent 2013 on the Arc stage? |
Answers with their grades from all the evaluated models can be found in the GitHub repository.