Funding for this research was provided by:
MobileTeleSystems (MTS-Skoltech laboratory on AI)
Accepted: 13 July 2023
First Online: 21 October 2023
: This work has not been submitted to any other journal or conference before. A part of this work has already been published in the Workshop for Balto-Slavic Natural Language Processing (Babakov et al., CitationRef removed). The current work contains new results, descriptions of new methods, and new extended datasets. The differences from the previous work are listed in Sect. InternalRef removed. This manuscript describes the full conducted work on the collection of datasets of inappropriate messages and sensitive topics and on their use for training the classification models. The results presented in the manuscript can be replicated using the provided datasets, code for training the models, and pre-trained models. All of the above are made available.
: The data collection process involved human participants. These were workers hired via Toloka crowdsourcing platform. They were informed and accepted that any data that they produce can be used by employers in the public or private domain. The projects created within Toloka for data annotation fully comply with the rules of this service. Our research was motivated by the need for moderation of automatically generated content by neural models, such as the GPT family of decoder-based Transformers to avoid the reputations risk of companies deploying such models (including PR risks).