BERTweet

BERTweet: A Pre-trained Language Model for English Tweets

Introduction

BERTweet is the first publicly available large-scale language model specifically pre-trained for working with English Tweets. It employs the robust RoBERTa methodology for its training process. The model is built on a massive collection of 850 million English Tweets, which generates around 16 billion word tokens, and occupies about 80GB of storage. These Tweets were gathered between January 2012 and August 2019, with an additional subset focusing on 5 million Tweets related to the COVID-19 pandemic. Detailed information and experimental outcomes related to BERTweet are documented in a dedicated paper titled "BERTweet: A Pre-trained Language Model for English Tweets," presented at the 2020 Conference on Empirical Methods in Natural Language Processing.

Main Results

BERTweet delivers a myriad of powerful outcomes in various linguistic tasks, illustrated through results obtained in areas such as part-of-speech tagging, named entity recognition, sentiment analysis, and the detection of irony within Tweets. These images and statistics highlight BERTweet’s efficiency and accuracy as a computational language tool.

Using BERTweet with `transformers`

Installation

To use BERTweet, one needs to install the transformers library. This can be done effortlessly through pip, with installation command:

pip install transformers

For those interested in experimenting with a faster tokenizer, it is possible to clone a specific branch from the transformers library and install it directly from the source.

To deal with tokenization, the following pip installation is required:

pip3 install tokenizers

Pre-trained Models

BERTweet offers several pre-trained models, each tailored to different needs. The most notable ones include:

vinai/bertweet-base: Handles a vast array of everyday English Tweets.
vinai/bertweet-covid19-base-cased and vinai/bertweet-covid19-base-uncased: Specially further trained with COVID-19 related Tweets.
vinai/bertweet-large: A large model capable of managing extensive Tweet content.

Example Usage

BERTweet can be seamlessly integrated into Python projects utilizing the transformers library. Here is an example to get started:

import torch
from transformers import AutoModel, AutoTokenizer 

bertweet = AutoModel.from_pretrained("vinai/bertweet-large")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")

line = "DHEC confirms HTTPURL via @USER :crying_face:"
input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)

Normalize Raw Input Tweets

To maintain consistency and streamline processing, BERTweet uses normalization techniques to standardize Tweets before processing. This involves converting user mentions and URLs into special tokens. For those leveraging BERTweet in their applications, employing similar normalization processes as the pre-trained models is recommended.

To achieve this, users can install additional dependencies:

pip3 install nltk emoji==0.6.0

Using BERTweet with `fairseq`

For integration with the fairseq library, users can refer directly to the detailed instructions available in the BERTweet documentation online.

License

BERTweet is released under the MIT License, making it freely available for use, modification, and distribution. This flexible licensing encourages wide application in varied linguistic and computational contexts.