Introduction to Attention Is All You Need: A PyTorch Implementation
This project represents a PyTorch implementation of the renowned Transformer model, detailed in the research paper "Attention is All You Need" by Ashish Vaswani and his colleagues. The innovative model shifts away from traditional convolutional or recurrent operations, adopting a self-attention mechanism which has set new benchmarks in machine translation, specifically demonstrated on the WMT 2014 English-to-German translation task.
For those interested in the original TensorFlow implementation, it can be found in the tensorflow/tensor2tensor repository. A deeper understanding of the self-attention mechanism can be gained by exploring the paper "A Structured Self-attentive Sentence Embedding".
The project is not fully complete, with certain components like BPE still under testing. The community is encouraged to share feedback or report issues through the project's issue tracker.
Usage Instructions
WMT'16 Multimodal Translation: de-en
Below is a step-by-step guide on how to run a training example for the WMT'16 Multimodal Translation task:
Step 0: Install Spacy Language Models
Begin by downloading necessary language models via Spacy, an NLP library, using the following commands:
# Install spacy using conda
# conda install -c conda-forge spacy
python -m spacy download en
python -m spacy download de
Step 1: Data Preprocessing
Execute the preprocessing script to prepare the dataset using Spacy and TorchText:
python preprocess.py -lang_src de -lang_trg en -share_vocab -save_data m30k_deen_shr.pkl
Step 2: Model Training
With the data preprocessed, train the model using the command below:
python train.py -data_pkl m30k_deen_shr.pkl -log m30k_deen_shr -embs_share_weight -proj_share_weight -label_smoothing -output_dir output -b 256 -warmup 128000 -epoch 400
Step 3: Model Testing
Finally, test the trained model using the provided script:
python translate.py -data_pkl m30k_deen_shr.pkl -model trained.chkpt -output prediction.txt
WMT'17 Multimodal Translation: de-en with BPE (Work In Progress)
For those wishing to experiment with Byte Pair Encoding (BPE), the instructions differ slightly and are still under development.
Step 1: Data Download and Preprocessing
Initiate the process by adjusting the main function call from main_wo_bpe
to main
, then run:
python preprocess.py -raw_dir /tmp/raw_deen -data_dir ./bpe_deen -save_data bpe_vocab.pkl -codes codes.txt -prefix deen
Step 2: Model Training
Proceed with training while noting the different dataset paths:
python train.py -data_pkl ./bpe_deen/bpe_vocab.pkl -train_path ./bpe_deen/deen-train -val_path ./bpe_deen/deen-val -log deen_bpe -embs_share_weight -proj_share_weight -label_smoothing -output_dir output -b 256 -warmup 128000 -epoch 400
Step 3: Model Testing (Upcoming)
Testing for BPE is not yet ready and needs further setup, including loading vocabulary and decoding post-translation.
Project Performance and Testing
Training Parameters
The model employs the following parameter settings during training:
- Batch size: 256
- Warmup steps: 4000
- Number of epochs: 200
- Learning rate multiplier: 0.5
- Label smoothing applied
- No BPE, shared vocabulary
- Shared weights between target embedding and pre-softmax linear layer
Visual insights on the training performance are pending some enhancements and will be shared as available.
Testing (Future Release)
Testing outcomes and evaluations are yet to be released.
Future Work
Upcoming tasks include:
- Evaluation techniques for the generated text
- Plotting attention weights
Acknowledgements
The Byte Pair Encoding components of this project are adapted from subword-nmt. Additionally, significant elements including the project infrastructure, scripts, and data preprocessing steps derive from OpenNMT/OpenNMT-py. The project has benefited from suggestions by notable contributors like @srush, @iamalbert, @Zessay, @JulesGM, @ZiJianZhao, and @huanghoujing.