A framework for mitigating malicious RLHF feedback in LLM training using consensus based reward
Crossref DOI link: https://doi.org/10.1038/s41598-025-92889-7
Published Online: 2025-03-17
Update policy: https://doi.org/10.1007/springer_crossmark_policy
Haider, Zafaryab
Rahman, Md Hafizur
Devabhaktuni, Vijay
Moeykens, Shane
Chakraborty, Prabuddha
Text and Data Mining valid from 2025-03-17
Version of Record valid from 2025-03-17
Article History
Received: 29 December 2024
Accepted: 3 March 2025
First Online: 17 March 2025
Declarations
:
: The authors declare no competing interests.