Transcript
Title text: This is how you all fucking sound
[A smug tech bro wearing a sideways cap, watch, chain around his neck stands in front of a data center by a lake with dead fish. A smoke stack blows pollution into the air]
Tech bro: AI is already here, there’s no going back.
[A smug man in a suit with cigarette in hand stands in a restaurant while two disgruntled diners cough from the smoke]
Suit: Smoking indoors is already here, there’s no going back.
[A smug man in a top hat and suit stands in a factory with two sad and dirty children]
Hat: Child labor is already here, there’s no going back.
[A smug plantation owner stands in front of a field with with two angry slaves]
Plantation owner: The Atlantic Slave trade is already here, there’s no going back.


I’m saying there is no “big leap” necessary. As the paper that introduced the transformer said, attention is all you need.
If we’re going to pull up other people’s pithy phrases that aren’t intended to be taken entirely literally, then the relevant one here is machine learning is the second best solution to any problem. In the (approximately, depending on how you define it) century people have been thinking about computers, we’ve already found better solutions to lots of problems. If a transformer-based neural network can get 99% accuracy in sixty seconds on 92 billion transistors of GPU and billions more for its VRAM, that’s pretty useless if we can also do it with 100% accuracy in sixty microseconds on a $1 microcontroller, or even faster on a less constrained device.
The attention is all you need phrase is specifically in the context of sequence transduction models, and specifically referring to the discovery that they don’t need a combination of attention, recurrence and convolution, but actually only need attention if it’s used in the novel way introduced by the paper. If you don’t need to transduce any sequences, then this isn’t necessarily relevant, and it’s critically not a claim that you can do everything by transducing sequences. It was a surprise that applying it to generating new text instead of just converting it worked as well as it did, and a surprise that it kept getting better with larger models instead of plateauing around the GPT-1 and GPT-2 era, and a surprise that the text generation could be used to do other things, even ones as basic as addition. These things weren’t predicted by the Attention Is All You Need paper.