Zhou, Lexin https://orcid.org/0000-0003-1161-4270
Pacchiardi, Lorenzo https://orcid.org/0000-0003-4760-7638
Martínez-Plumed, Fernando https://orcid.org/0000-0003-2902-6477
Collins, Katherine M. https://orcid.org/0000-0002-7032-716X
Moros-Daval, Yael https://orcid.org/0000-0001-5442-2055
Zhang, Seraphina https://orcid.org/0009-0009-3587-385X
Zhao, Qinlin
Huang, Yitian
Sun, Luning https://orcid.org/0000-0002-2470-4278
Prunty, Jonathan E. https://orcid.org/0000-0002-9180-1932
Li, Zongqian
Sánchez-García, Pablo
Jiang-Chen, Kexin https://orcid.org/0009-0005-2492-6531
Casares, Pablo A. M. https://orcid.org/0000-0001-5500-9115
Zu, Jiyun
Burden, John
Mehrbakhsh, Behzad https://orcid.org/0000-0001-9017-989X
Stillwell, David
Cebrian, Manuel https://orcid.org/0000-0002-3681-7982
Wang, Jindong
Henderson, Peter
Wu, Sherry Tongshuang
Kyllonen, Patrick C.
Cheke, Lucy
Xie, Xing https://orcid.org/0009-0009-3257-3077
Hernández-Orallo, José https://orcid.org/0000-0001-9746-7632
Article History
Received: 29 March 2025
Accepted: 19 February 2026
First Online: 1 April 2026
Competing interests
: We received support and free tokens from some of the providers of the LLMs evaluated in this paper or some of their direct competitors, namely OpenAI, Microsoft Research and Google. OpenAI, Microsoft Corporation and Google Inc. had no role in the ideas and research questions, study design, data collection and analysis, decision to publish or preparation of the manuscript.