beto - Spanish Pre-Trained BERT Model for Enhanced Language Processing

BETO: Spanish BERT

BETO is a powerful language model specifically developed for Spanish, based on the well-regarded BERT framework. It represents a significant advancement in natural language processing for Spanish speakers. BETO was trained using a large Spanish text collection, ensuring that it captures the nuances and intricacies of the language effectively. It's similar in size to the BERT-Base model and incorporates the Whole Word Masking technique for better understanding of context.

Model Availability

BETO comes in two versions: uncased and cased. These versions are accessible via the HuggingFace Model Repository:

BETO uncased: Available here.
BETO cased: Available here.

Both models use a vocabulary of approximately 31,000 subwords, which were built using the SentencePiece tool. They were trained over 2 million steps, ensuring robustness and reliability.

Performance Benchmarks

BETO has been tested against a variety of Spanish language tasks, achieving impressive results. The following tasks illustrate BETO's performance when compared to Multilingual BERT and other models:

Part-of-Speech Tagging (POS): BETO's cased version achieved a remarkable 98.97% accuracy.
Named Entity Recognition (NER-C): The cased model scored 88.43%.
MLDoc: The uncased version excelled with a score of 96.12%.
PAWS-X: BETO's results were competitive with scores around 89%.
XNLI: The cased version scored an excellent 82.01%.

These results highlight BETO's strength in tasks requiring understanding of the Spanish language.

Using BETO

For those interested in utilizing BETO for projects or research, comprehensive guidance can be found in the Huggingface Transformers library, especially within the Quickstart section. Users can seamlessly integrate BETO models using Huggingface's resources.

An example application using these models can be explored in this Colab notebook.

Acknowledgments and Citation

The development of BETO was supported by several organizations. Appreciation goes to Adereso, the Millennium Institute for Foundational Research on Data, and Google through the TensorFlow Research Cloud initiative.

For academic citations, reference the publication titled "Spanish Pre-Trained BERT Model and Evaluation Data" presented at PML4DC at ICLR 2020:

@inproceedings{CaneteCFP2020,
  title={Spanish Pre-Trained BERT Model and Evaluation Data},
  author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
  booktitle={PML4DC at ICLR 2020},
  year={2020}
}

License Information

BETO's creators suggest the CC BY 4.0 license to describe their work, though some datasets used in training may have different licensing terms. Users are advised to verify dataset licenses for compatibility with their intended use.

BETO represents a cutting-edge tool for advancing Spanish language processing, providing significant capabilities for researchers and developers working in multilingual contexts.