Transformers and LLMs · Module 18

Transformers → Large Language Models

Stack attention layers, train on a large fraction of the public text on the internet, and remarkable things happen.

Scale

GPT-2 (2019): 1.5 B params. Today: 1–10 T params. Cost: $100 M+ per training run. The phenomenon scales with compute.

Emergence

Capabilities like multi-step reasoning, code generation, and abstract analogy appear only above certain scales. They were not engineered in.

Generality

One model, many tasks. Translate, summarise, write code, answer medical questions, plan a trip — same weights.

Alignment

After pre-training, models are tuned with human preference data (RLHF) to be helpful, honest, and harmless.

Reader progress is stored locally in this browser.

Source slide 19