Crossmark

BERTweet.BR: a pre-trained language model for tweets in Portuguese

Crossref DOI link: https://doi.org/10.1007/s00521-024-10711-3

Published Online: 2024-12-21

Published Print: 2025-02

Update policy: https://doi.org/10.1007/springer_crossmark_policy

Authors

Carneiro, Fernando

Vianna, Daniela

Carvalho, Jonnathan https://orcid.org/0000-0003-0983-2308
Plastino, Alexandre

Paes, Aline
Funding

Funding for this research was provided by:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (311275/2020-6)

Conselho Nacional de Desenvolvimento Científico e Tecnológico (315750/2021-9)

Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (SEI-260003/000614/2023)

Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (E-26/201.139/2022)
License Information

Text and Data Mining valid from 2024-12-21

Version of Record valid from 2024-12-21
More Information

Article History

Received: 30 January 2024

Accepted: 19 September 2024

First Online: 21 December 2024

Declarations

:

: The authors have no conflict of interest to disclose.

: All datasets considered in this manuscript were gathered from previous work that made them publicly available. Although we have not directly collected any tweets, we are aware that using data collected from the Twitter platform should raise ethical reflections. Even though Twitter users assume their posts are not private, they are usually not explicitly informed that what they write can be used for scientific – our case – or commercial – not our case – purposes. Besides, they might usually assume that their tweets are ephemeral while they, in fact, can be collected and stored by anyone anywhere. We tried our best not to include sensitive content in our examples and not disclose the identity of their authors.

: Given that this work strongly relies on large-scale language models and datasets composed of social media texts, despite the best intentions, we anticipate possible ethical and social risks by perpetuating social biases and providing false or misleading information. In the case of language models, these risks usually spring from the chosen training corpora used to pre-train such large models. If your intent is to use our pre-trained model or a fine-tuned version in production, please be aware that, while BERTweet.BR like many other models is a powerful tool, it comes with limitations. To enable pre-training on large amounts of data, we scrape all the content we could find from Twitter until the year 2020, taking the best as well as the worst of what was available on this social media.

Document is current

Any future updates will be listed below