Attention Sink

Attention Sinks address a critical issue observed in the use of window attention in autoregressive language models. When window attention is applied, these models often exhibit a sudden decline in fluency as soon as the first token leaves the context window. The underlying reason for this decline lies in an intriguing aspect of LLMs: an overwhelming majority of attention is allocated to the first few tokens of the sequence, termed as "attention sinks." These tokens soak up a disproportionate amount of the attention score—even when they are not semantically relevant.

Why does this happen?

The model relies heavily on these "sink" tokens because the softmax operation in the attention mechanism enforces a sum-to-one constraint. In the absence of relevant tokens to match with the next generated token, the model compensates by dumping attention scores into these first few tokens. When window attention is employed and the first token exits the window, the model loses its default 'sink' to offload the attention. This leads to the attention scores being dispersed across all remaining tokens. Consequently, tokens that should not necessarily have high attention scores end up getting them, causing the model to "collapse" and lose fluency.

The Solution

To mitigate this issue, the authors propose an adaptation to the traditional window attention. The revised model always keeps the initial four tokens—i.e., the attention sink tokens—within the window. Moreover, instead of using positions from the original text, they use the positions within the cache to add positional information to the tokens. This ensures that the "sink" tokens remain spatially close to the rest, effectively serving as attention offloading points.