Transformers and LLMs · Module 18
Transformers → Large Language Models
Stack attention layers, train on a large fraction of the public text on the internet, and remarkable things happen.
Scale
GPT-2 (2019): 1.5 B params. Today: 1–10 T params. Cost: $100 M+ per training run. The phenomenon scales with compute.
Emergence
Capabilities like multi-step reasoning, code generation, and abstract analogy appear only above certain scales. They were not engineered in.
Generality
One model, many tasks. Translate, summarise, write code, answer medical questions, plan a trip — same weights.
Alignment
After pre-training, models are tuned with human preference data (RLHF) to be helpful, honest, and harmless.
Reader progress is stored locally in this browser.
Source slide 19