arxiv:2410.05258

Differential Transformer

Published on Oct 7

· Submitted by

unilm on Oct 8

#1 Paper of the day

Upvote

102

Authors:

Tianzhu Ye ,

Li Dong ,

Yuqing Xia ,

Yutao Sun ,

Furu Wei

Abstract

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture to advance large language models.

View arXiv page View PDF Add to collection

Community

unilm

Paper author Paper submitter 1 day ago

Bachstelze

1 day ago

Does the differential transformer get rid of the attention sink?

ytz20

Paper author 1 day ago

We observe that Diff Transformer allocates less attention scores to attention sinks, i.e., the first few tokens in the sequence.
Specifically, in language modeling task, Diff Transformer allocates less than 5% scores to the BOS token, while Transformer allocates about 25%. For the key information retrieval task, please refer to Figure 1 in the paper. We find that models attend the BOS token more when there is less useful information in the context.

browniepoints

1 day ago

Great stuff. I would love to see comparisons against MöbiusAttention, which is learns to forget...but this is seems way more computationally efficient.

ytz20

Paper author 1 day ago

Thanks for pointing out this paper. We will study into it.

espadrine

1 day ago

It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads.

I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.

ytz20

Paper author 1 day ago

In Diff Transformer, we split heads instead of doubling heads. No extra QK projection parameters are introduced. Heads of Q and K are split into two groups and compute in pairs. In a pair they share the same V with dimension 2d. With this design, we match flops and parameter counts with Transformer.
Using max(0, exp(x)-1) might be an approach that solves the problem. We didn't try this because we believe the property of exp() is important to learning.

tmp1234

1 day ago

Great work! Just wonder do you have any idea why two learned attentions tend to cancel noise, rather than canceling signals? For instance, if attention 1 learns S + N_1, and attention 2 learns S + N_2 (where S is signal, N_1, N_2 are different noises), by subtracting these two, the signal S gets canceled while noise becomes N_1 - N_2 which could be more complicated. Is there any reason why the model would not do this instead?

ytz20

Paper author 1 day ago

•

edited 1 day ago

It's a good question. Our observation is that the model knows what signal is and what noise is. Notice that attention_1 and attention_2 are both calculated with learnable parameters, they can "perceive" each other in the training process. Then they can adjust themselves according to each other, to achieve lower loss. The result is that the model chooses to preserve signal and cancel out noise as long as we give it the chance to do so. And for a single softmax, it's difficult for it to learn the same solution, due to its formulation and gradient properties.

librarian-bot

about 20 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

acnagle

about 19 hours ago

Exciting work! Do the authors plan to release model weights on hugging face?

ytz20

Paper author about 13 hours ago

No, we won't release model weights on hugging face.

quyettv

about 15 hours ago

What if it applies to the Linear Attention acceleration version?

ytz20

Paper author about 13 hours ago

That's an interesting problem to explore. We haven't tried that yet. We will look into it in the future.

mathlfs

about 13 hours ago

Can you provide an intuition why \lambda is re-parameterized as the form shown in the paper?

ytz20

Paper author about 13 hours ago

Sure. lambda is multiplied to softmax, where softmax = exp(qk) / Sigma(exp(qk)). Parameters in lambda learns with the same rate as other parameters in the model, therefore lambda should take a similar formulation as softmax. That's why lambda = exp(lambda_q * lambda_k) + lambda_init. Moreover, to enable lambda to learn values smaller than lambda_init, we add the second term, i.e., lambda = exp(lambda_q1 * lambda_k1) - exp(lambda_q2 * lambda_k2) + lambda_init

jnemecek

about 7 hours ago

What kind of hardware was required to train this, and how did the tokens per second output compare with transformers?

ytz20

Paper author about 6 hours ago

No requirements for hardware if you use the naive implementation. If you use flashdiff, refer to FlashAttention repo (https://github.com/Dao-AILab/flash-attention) for hardware and datatype requirements.
Our speed test is performed on Nvidia H100-80GB GPU cards and we calculate throughput (tokens per second). The same cards and environment are used for both Diff and Transformer.

vrcoder045

about 1 hour ago

The work looks exciting and I really like the motivation coming from noise cancellation!
I have a few questions -

Won't this model let the post-attention weight (softmax(...) - \lambda * softmax(...)) for some value vectors be negative? Is that a design choice? One explanation does come to mind i.e. wanting to get opposing contributions from some tokens specifically but I am unsure if this is desired.
This recent work (https://arxiv.org/pdf/2410.01104) shows that attention will disperse given a few conditions (see Lemma 2.1, Page 3). Do you think differential attention is any different? If I understand the proposal correctly, I think it still satisfies Lemma 2.1 with some minor modifications in the proof.

Thanks again for your wonderful work!