Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face ๐Ÿค— LLMs, Agents, RAG, Multimodal.

Articles

Organizations

Posts 49

view post
Post
1049
๐Ÿ’ฅ ๐‹-๐Œ๐ฎ๐ฅ: ๐€๐๐๐ข๐ญ๐ข๐จ๐ง-๐Ž๐ง๐ฅ๐ฒ ๐Œ๐ฎ๐ฅ๐ญ๐ข๐ฉ๐ฅ๐ข๐œ๐š๐ญ๐ข๐จ๐ง ๐œ๐š๐ง ๐ฌ๐ฅ๐š๐ฌ๐ก ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐š๐ญ๐ข๐จ๐ง๐š๐ฅ ๐œ๐จ๐ฌ๐ญ๐ฌ ๐›๐ฒ ๐Ÿ–๐ŸŽ%!

Microsoft researchers dropped a groundbreaking technique that could slash the energy use in transformer computations : their novel "linear-complexity multiplication" (L-Mul) algorithm approximates floating-point multiplication using energy-efficient integer addition instead of costly multiplications.

๐Ÿ’ก Quick reminder on how floats are coded on 8 bits (FP8):
In the e4m3 FP8 standard, you encode a number as:
Sign (1 bit) | Exponent (4 bits) | Mantissa (3 bits)
Example: 0 (positive) | 1000 (8) | 101 (1/2 + 1/8 = 0.625)
Calculation: you add one to the mantissa, and multiply it by 2 power (the exponent - a bias term which is 7 for e4m3):

โžก๏ธย You get (1 + 0.625) ร— 2^(8-7) = 3.25

Now back to the paper. ๐—ž๐—ฒ๐˜† ๐—ถ๐—ป๐˜€๐—ถ๐—ด๐—ต๐˜๐˜€:

โšก๏ธ Multiplication is extremely energy-intensive compared to addition. For 32-bit operations, multiplication (3.7 pJ) uses 37x more energy than addition (0.1 pJ)!

๐Ÿงฎ Traditional floating-point multiplication go like (noting xm the mantissa and xe the exponent): Mul(x,y) = (1 + xm) ยท 2^xe ยท (1 + ym) ยท 2^ye = (1 + xm + ym + xm ยท ym) ยท 2^(xe+ye)

๐Ÿ’ก L-Mul cleverly approximates this as: L-Mul(x,y) = (1 + xm + ym + 2^-l(m)) ยท 2^(xe+ye), eliminating the costly xm ยท ym term

๐Ÿ”ง l(m) term is adaptively set based on mantissa size for optimal accuracy

๐Ÿ“Š Benchmarks on the Llama-3.1-8B-Instruct model show L-Mul preserves precision across various NLP tasks, with performance nearly identical to full BFloat16 precision

๐Ÿ’ฌ Authors claim: "We can achieve the same model inference performance while reducing the energy cost of attention computations by 80%."

This breakthrough is still theoretical and would need implementation on dedicated hardware to confirm real-world gains, but itโ€™s a really exciting path for more sustainable AI! ๐ŸŒฑ

Read the paper here ๐Ÿ‘‰ย  Addition is All You Need for Energy-efficient Language Models (2410.00907)
view post
Post
2869
๐Ÿ“œ ๐Ž๐ฅ๐-๐ฌ๐œ๐ก๐จ๐จ๐ฅ ๐‘๐๐๐ฌ ๐œ๐š๐ง ๐š๐œ๐ญ๐ฎ๐š๐ฅ๐ฅ๐ฒ ๐ซ๐ข๐ฏ๐š๐ฅ ๐Ÿ๐š๐ง๐œ๐ฒ ๐ญ๐ซ๐š๐ง๐ฌ๐Ÿ๐จ๐ซ๐ฆ๐ž๐ซ๐ฌ!

Researchers from Mila and Borealis AI just have shown that simplified versions of good old Recurrent Neural Networks (RNNs) can match the performance of today's transformers.

They took a fresh look at LSTMs (from 1997!) and GRUs (from 2014). They stripped these models down to their bare essentials, creating "minLSTM" and "minGRU". The key changes:
โถ Removed dependencies on previous hidden states in the gates
โท Dropped the tanh that had been added to restrict output range in order to avoid vanishing gradients
โธ Ensured outputs are time-independent in scale (not sure I understood that well either, don't worry)

โšก๏ธ As a result, you can use a โ€œparallel scanโ€ algorithm to train these new, minimal RNNs, in parallel, taking 88% more memory but also making them 200x faster than their traditional counterparts for long sequences

๐Ÿ”ฅ The results are mind-blowing! Performance-wise, they go toe-to-toe with Transformers or Mamba.

And for Language Modeling, they need 2.5x fewer training steps than Transformers to reach the same performance! ๐Ÿš€

๐Ÿค” Why does this matter?

By showing there are simpler models with similar performance to transformers, this challenges the narrative that we need advanced architectures for better performance!

๐Ÿ’ฌย Franรงois Chollet wrote in a tweet about this paper:

โ€œThe fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning)โ€

โ€œCurve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape.โ€

Itโ€™s the Bitter lesson by Rich Sutton striking again: donโ€™t need fancy thinking architectures, just scale up your model and data!

Read the paper ๐Ÿ‘‰ย  Were RNNs All We Needed? (2410.01201)