Transformer

The Transformer is a neural network architecture that processes sequential data using self-attention mechanisms, eschewing recurrence for parallel computation.

Introduced by Vaswani et al. from Google Brain in 2017, the Transformer architecture revolutionized sequence-to-sequence tasks, particularly in natural language processing. Its core innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when computing representations for each position. This is in contrast to recurrent neural networks (RNNs) that process tokens sequentially, and convolutional neural networks (CNNs) that rely on local receptive fields. The Transformer consists of an encoder and a decoder, each composed of multiple layers of multi-head self-attention and feed-forward neural networks, with residual connections and layer normalization.

The self-attention mechanism computes attention scores between every pair of positions in the input sequence, enabling the model to capture long-range dependencies directly without the vanishing gradient problems that plague RNNs. Multi-head attention runs multiple attention operations in parallel, allowing the model to attend to information from different representation subspaces. Positional encodings are added to the input embeddings to provide information about the order of tokens, as the architecture itself is permutation-invariant.

The Transformer’s ability to process all tokens in parallel during training significantly reduces training time compared to sequential models. This scalability has enabled the development of large pre-trained models like BERT and GPT. The architecture has been adapted for domains beyond text, including computer vision (Vision Transformer) and audio processing. Its design principles have influenced many subsequent models and remain foundational in modern deep learning.

Why it matters

The Transformer matters because it provides a computationally efficient and highly parallelizable architecture that captures long-range dependencies in sequential data. This has enabled the training of much larger models on massive datasets, leading to breakthroughs in machine translation, text generation, and other NLP tasks. Its versatility has extended to computer vision and multimodal applications, making it a cornerstone of modern AI systems.

First appeared

Vaswani et al., Google Brain, 2017 (“Attention Is All You Need”).

FAQ

How does it work?

The Transformer uses self-attention to compute weighted sums of all input positions for each output position. It processes the entire sequence in parallel, using multi-head attention to capture different types of relationships. Positional encodings are added to retain order information.

What is the main advantage over RNNs?

The main advantage is parallelization during training, as the Transformer does not require sequential token processing. This allows for much faster training on modern hardware and better handling of long-range dependencies without vanishing gradients.

When should I use a Transformer instead of a CNN?

Use a Transformer when the input data has long-range dependencies or when global context is important, such as in text or high-resolution images. CNNs are often more efficient for local pattern recognition tasks with limited receptive fields.