‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
Crossref DOI link: https://doi.org/10.1007/s00146-024-02113-9
Published Online: 2024-11-27
Update policy: https://doi.org/10.1007/springer_crossmark_policy
Arvan, Marcus https://orcid.org/0000-0001-5683-1055
Text and Data Mining valid from 2024-11-27
Version of Record valid from 2024-11-27
Article History
Received: 15 April 2024
Accepted: 16 October 2024
First Online: 27 November 2024
Declarations
:
: The author has no conflicts of interest to report.