BERTweet: A Pre-trained Language Model for English Tweets
Introduction
BERTweet is the first publicly available large-scale language model specifically pre-trained for working with English Tweets. It employs the robust RoBERTa methodology for its training process. The model is built on a massive collection of 850 million English Tweets, which generates around 16 billion word tokens, and occupies about 80GB of storage. These Tweets were gathered between January 2012 and August 2019, with an additional subset focusing on 5 million Tweets related to the COVID-19 pandemic. Detailed information and experimental outcomes related to BERTweet are documented in a dedicated paper titled "BERTweet: A Pre-trained Language Model for English Tweets," presented at the 2020 Conference on Empirical Methods in Natural Language Processing.
Main Results
BERTweet delivers a myriad of powerful outcomes in various linguistic tasks, illustrated through results obtained in areas such as part-of-speech tagging, named entity recognition, sentiment analysis, and the detection of irony within Tweets. These images and statistics highlight BERTweet’s efficiency and accuracy as a computational language tool.
Using BERTweet with transformers
Installation
To use BERTweet, one needs to install the transformers
library. This can be done effortlessly through pip, with installation command:
pip install transformers
For those interested in experimenting with a faster tokenizer, it is possible to clone a specific branch from the transformers library and install it directly from the source.
To deal with tokenization, the following pip installation is required:
pip3 install tokenizers
Pre-trained Models
BERTweet offers several pre-trained models, each tailored to different needs. The most notable ones include:
vinai/bertweet-base
: Handles a vast array of everyday English Tweets.vinai/bertweet-covid19-base-cased
andvinai/bertweet-covid19-base-uncased
: Specially further trained with COVID-19 related Tweets.vinai/bertweet-large
: A large model capable of managing extensive Tweet content.
Example Usage
BERTweet can be seamlessly integrated into Python projects utilizing the transformers
library. Here is an example to get started:
import torch
from transformers import AutoModel, AutoTokenizer
bertweet = AutoModel.from_pretrained("vinai/bertweet-large")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")
line = "DHEC confirms HTTPURL via @USER :crying_face:"
input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
features = bertweet(input_ids)
Normalize Raw Input Tweets
To maintain consistency and streamline processing, BERTweet uses normalization techniques to standardize Tweets before processing. This involves converting user mentions and URLs into special tokens. For those leveraging BERTweet in their applications, employing similar normalization processes as the pre-trained models is recommended.
To achieve this, users can install additional dependencies:
pip3 install nltk emoji==0.6.0
Using BERTweet with fairseq
For integration with the fairseq
library, users can refer directly to the detailed instructions available in the BERTweet documentation online.
License
BERTweet is released under the MIT License, making it freely available for use, modification, and distribution. This flexible licensing encourages wide application in varied linguistic and computational contexts.