Attention Is All You Need

How can we improve the efficiency and accuracy of sequence-based models while overcoming the limitations of recurrence?

In English that translates to, how can we make AI more useful?

That was the question a team of researchers at Google in 2017 set off to answer.

In the status quo LLMs faced a problem.

Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks (yes, that’s probably the most engineering name you could possibly come up with) limited models to only processing one word at a time, in sequential order.

The impact was that when given a long sentence, models would often forget the context from earlier parts of the sentence.

Eventually there was a breakthrough, and it probably involved a lot of caffeine.

A strong but simply thesis was presented.

Attention Is All You Need.

Instead of relying on reusing and remembering data, via RNNs and LSTM, they found a way to design the model so that it could consider all words at once and weigh the importance of words even if they were far apart.

That architecture is known as transformer architecture, and it’s transformed how we work and live.

Leave a comment