attention-is-all-you-need-pytorch - Simplified PyTorch Transformer Model for State-of-the-Art Translation

Introduction to Attention Is All You Need: A PyTorch Implementation

This project represents a PyTorch implementation of the renowned Transformer model, detailed in the research paper "Attention is All You Need" by Ashish Vaswani and his colleagues. The innovative model shifts away from traditional convolutional or recurrent operations, adopting a self-attention mechanism which has set new benchmarks in machine translation, specifically demonstrated on the WMT 2014 English-to-German translation task.

For those interested in the original TensorFlow implementation, it can be found in the tensorflow/tensor2tensor repository. A deeper understanding of the self-attention mechanism can be gained by exploring the paper "A Structured Self-attentive Sentence Embedding".

The project is not fully complete, with certain components like BPE still under testing. The community is encouraged to share feedback or report issues through the project's issue tracker.

Usage Instructions

WMT'16 Multimodal Translation: de-en

Below is a step-by-step guide on how to run a training example for the WMT'16 Multimodal Translation task:

Step 0: Install Spacy Language Models

Begin by downloading necessary language models via Spacy, an NLP library, using the following commands:

# Install spacy using conda
# conda install -c conda-forge spacy 
python -m spacy download en
python -m spacy download de

Step 1: Data Preprocessing

Execute the preprocessing script to prepare the dataset using Spacy and TorchText:

python preprocess.py -lang_src de -lang_trg en -share_vocab -save_data m30k_deen_shr.pkl

Step 2: Model Training

With the data preprocessed, train the model using the command below:

python train.py -data_pkl m30k_deen_shr.pkl -log m30k_deen_shr -embs_share_weight -proj_share_weight -label_smoothing -output_dir output -b 256 -warmup 128000 -epoch 400

Step 3: Model Testing

Finally, test the trained model using the provided script:

python translate.py -data_pkl m30k_deen_shr.pkl -model trained.chkpt -output prediction.txt

WMT'17 Multimodal Translation: de-en with BPE (Work In Progress)

For those wishing to experiment with Byte Pair Encoding (BPE), the instructions differ slightly and are still under development.

Step 1: Data Download and Preprocessing

Initiate the process by adjusting the main function call from main_wo_bpe to main, then run:

python preprocess.py -raw_dir /tmp/raw_deen -data_dir ./bpe_deen -save_data bpe_vocab.pkl -codes codes.txt -prefix deen

Step 2: Model Training

Proceed with training while noting the different dataset paths:

python train.py -data_pkl ./bpe_deen/bpe_vocab.pkl -train_path ./bpe_deen/deen-train -val_path ./bpe_deen/deen-val -log deen_bpe -embs_share_weight -proj_share_weight -label_smoothing -output_dir output -b 256 -warmup 128000 -epoch 400

Step 3: Model Testing (Upcoming)

Testing for BPE is not yet ready and needs further setup, including loading vocabulary and decoding post-translation.

Project Performance and Testing

Training Parameters

The model employs the following parameter settings during training:

Batch size: 256
Warmup steps: 4000
Number of epochs: 200
Learning rate multiplier: 0.5
Label smoothing applied
No BPE, shared vocabulary
Shared weights between target embedding and pre-softmax linear layer

Visual insights on the training performance are pending some enhancements and will be shared as available.

Testing (Future Release)

Testing outcomes and evaluations are yet to be released.

Future Work

Upcoming tasks include:

Evaluation techniques for the generated text
Plotting attention weights

Acknowledgements

The Byte Pair Encoding components of this project are adapted from subword-nmt. Additionally, significant elements including the project infrastructure, scripts, and data preprocessing steps derive from OpenNMT/OpenNMT-py. The project has benefited from suggestions by notable contributors like @srush, @iamalbert, @Zessay, @JulesGM, @ZiJianZhao, and @huanghoujing.