LongNet
LongNet, an advanced Transformer variant, scales sequence lengths to 1 billion tokens without affecting performance on shorter sequences. Using innovative dilated attention, it maintains linear computational complexity and a logarithmic token dependency, suitable for distributed training of lengthy sequences. The model integrates with existing Transformer optimizations, delivering strong results in long-sequence and general language tasks. Explore the possibilities of managing vast sequences like entire corpora or the Internet with improved efficiency and expressivity.