Carneiro, Fernando
Vianna, Daniela
Carvalho, Jonnathan https://orcid.org/0000-0003-0983-2308
Plastino, Alexandre
Paes, Aline
Funding for this research was provided by:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (311275/2020-6)
Conselho Nacional de Desenvolvimento Científico e Tecnológico (315750/2021-9)
Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (SEI-260003/000614/2023)
Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (E-26/201.139/2022)
Article History
Received: 30 January 2024
Accepted: 19 September 2024
First Online: 21 December 2024
Declarations
:
: The authors have no conflict of interest to disclose.
: All datasets considered in this manuscript were gathered from previous work that made them publicly available. Although we have not directly collected any tweets, we are aware that using data collected from the Twitter platform should raise ethical reflections. Even though Twitter users assume their posts are not private, they are usually not explicitly informed that what they write can be used for scientific – our case – or commercial – not our case – purposes. Besides, they might usually assume that their tweets are ephemeral while they, in fact, can be collected and stored by anyone anywhere. We tried our best not to include sensitive content in our examples and not disclose the identity of their authors.
: Given that this work strongly relies on large-scale language models and datasets composed of social media texts, despite the best intentions, we anticipate possible ethical and social risks by perpetuating social biases and providing false or misleading information. In the case of language models, these risks usually spring from the chosen training corpora used to pre-train such large models. If your intent is to use our pre-trained model or a fine-tuned version in production, please be aware that, while BERTweet.BR like many other models is a powerful tool, it comes with limitations. To enable pre-training on large amounts of data, we scrape all the content we could find from Twitter until the year 2020, taking the best as well as the worst of what was available on this social media.