- 14
- 9 659 975
Andrej Karpathy
United States
Приєднався 7 вер 2013
FAQ
Q: How can I pay you? Do you have a Patreon or etc?
A: As UA-cam partner I do share in a small amount of the ad revenue on the videos, but I don't maintain any other extra payment channels. I would prefer that people "pay me back" by using the knowledge to build something great.
Q: How can I pay you? Do you have a Patreon or etc?
A: As UA-cam partner I do share in a small amount of the ad revenue on the videos, but I don't maintain any other extra payment channels. I would prefer that people "pay me back" by using the knowledge to build something great.
Let's build the GPT Tokenizer
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)
Exercises:
- Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code github.com/karpathy/minbpe/blob/master/exercise.md
Links:
- Google colab for the video: colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing
- GitHub repo for the video: minBPE github.com/karpathy/minbpe
- Playlist of the whole Zero to Hero series so far: ua-cam.com/video/VMj-3S1tku0/v-deo.html
- our Discord channel: discord.gg/3zy8kqD9Cp
- my Twitter: karpathy
Supplementary links:
- tiktokenizer tiktokenizer.vercel.app
- tiktoken from OpenAI: github.com/openai/tiktoken
- sentencepiece from Google github.com/google/sentencepiece
Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)
Exercises:
- Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code github.com/karpathy/minbpe/blob/master/exercise.md
Links:
- Google colab for the video: colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing
- GitHub repo for the video: minBPE github.com/karpathy/minbpe
- Playlist of the whole Zero to Hero series so far: ua-cam.com/video/VMj-3S1tku0/v-deo.html
- our Discord channel: discord.gg/3zy8kqD9Cp
- my Twitter: karpathy
Supplementary links:
- tiktokenizer tiktokenizer.vercel.app
- tiktoken from OpenAI: github.com/openai/tiktoken
- sentencepiece from Google github.com/google/sentencepiece
Переглядів: 458 073
Відео
[1hr Talk] Intro to Large Language Models
Переглядів 1,8 млн5 місяців тому
This is a 1 hour general-audience introduction to Large Language Models: the core technical component behind systems like ChatGPT, Claude, and Bard. What they are, where they are headed, comparisons and analogies to present-day operating systems, and some of the security-related challenges of this new computing paradigm. As of November 2023 (this field moves fast!). Context: This video is based...
Let's build GPT: from scratch, in code, spelled out.
Переглядів 4,2 млнРік тому
We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framewo...
Building makemore Part 5: Building a WaveNet
Переглядів 154 тис.Рік тому
We take the 2-layer MLP from previous video and make it deeper with a tree-like structure, arriving at a convolutional neural network architecture similar to the WaveNet (2016) from DeepMind. In the WaveNet paper, the same hierarchical architecture is implemented more efficiently using causal dilated convolutions (not yet covered). Along the way we get a better sense of torch.nn and what it is ...
Building makemore Part 4: Becoming a Backprop Ninja
Переглядів 168 тис.Рік тому
We take the 2-layer MLP (with BatchNorm) from the previous video and backpropagate through it manually without using PyTorch autograd's loss.backward(): through the cross entropy loss, 2nd linear layer, tanh, batchnorm, 1st linear layer, and the embedding table. Along the way, we get a strong intuitive understanding about how gradients flow backwards through the compute graph and on the level o...
Building makemore Part 3: Activations & Gradients, BatchNorm
Переглядів 242 тис.Рік тому
We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you'd want to use to understand the health of your deep network. We learn why training deep neural nets can be fragile and ...
Building makemore Part 2: MLP
Переглядів 271 тис.Рік тому
We implement a multilayer perceptron (MLP) character-level language model. In this video we also introduce many basics of machine learning (e.g. model training, learning rate tuning, hyperparameters, evaluation, train/dev/test splits, under/overfitting, etc.). Links: - makemore on github: github.com/karpathy/makemore - jupyter notebook I built in this video: github.com/karpathy/nn-zero-to-hero/...
The spelled-out intro to language modeling: building makemore
Переглядів 580 тис.Рік тому
We implement a bigram character-level language model, which we will further complexify in followup videos into a modern Transformer language model, like GPT. In this video, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently evaluating neural networks and (2) the overall framework of language modeling that includes model training, sampling, and the evaluation ...
Stable diffusion dreams of psychedelic faces
Переглядів 33 тис.Рік тому
Prompt: "psychedelic faces" Stable diffusion takes a noise vector as input and samples an image. To create this video I smoothly (spherically) interpolate between randomly chosen noise vectors and render frames along the way. This video was produced by one A100 GPU taking about 10 tabs and dreaming about the prompt overnight (~8 hours). While I slept and dreamt about other things. Music: Stars ...
Stable diffusion dreams of steampunk brains
Переглядів 24 тис.Рік тому
Prompt: "ultrarealistic steam punk neural network machine in the shape of a brain, placed on a pedestal, covered with neurons made of gears. dramatic lighting. #unrealengine" Stable diffusion takes a noise vector as input and samples an image. To create this video I smoothly (spherically) interpolate between randomly chosen noise vectors and render frames along the way. This video was produced ...
Stable diffusion dreams of tattoos
Переглядів 64 тис.Рік тому
Dreams of tattoos. (There are a few discrete jumps in the video because I had to erase portions that got just a little 🌶️, believe I got most of it) Links - Stable diffusion: stability.ai/blog - Code used to make this video: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355 - My twitter: karpathy
The spelled-out intro to neural networks and backpropagation: building micrograd
Переглядів 1,5 млнРік тому
This is the most step-by-step spelled-out explanation of backpropagation and training of neural networks. It only assumes basic knowledge of Python and a vague recollection of calculus from high school. Links: - micrograd on github: github.com/karpathy/micrograd - jupyter notebooks I built in this video: github.com/karpathy/nn-zero-to-hero/tree/master/lectures/micrograd - my website: karpathy.a...
Stable diffusion dreams of "blueberry spaghetti" for one night
Переглядів 47 тис.Рік тому
Prompt: "blueberry spaghetti" Stable diffusion takes a noise vector as input and samples an image. To create this video I simply smoothly interpolate between randomly chosen noise vectors and render frames along the way. Links - Stable diffusion: stability.ai/blog - Code used to make this video: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355 - My twitter: karpathy
Stable diffusion dreams of steam punk neural networks
Переглядів 36 тис.Рік тому
A stable diffusion dream. The prompt was "ultrarealistic steam punk neural network machine in the shape of a brain, placed on a pedestal, covered with neurons made of gears. dramatic lighting. #unrealengine" the new and improved v2 version of this video is now here: ua-cam.com/video/2oKjtvYslMY/v-deo.html generated with this hacky script: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af55335...
speak louder pleaase
You've inspired me to start my own channel. This is the best video I've seen all year! Thank you for your contributions and for continuing to inspire us!
Amazing video, thank you so much
Amazing video...
Thank you Mr. Karpathy. I am in love with your teaching. Given how accomplished and experienced you are, you are still teaching with such grace. I dream about sitting in one of your classes and learning from you, and I hope to meet you one day. May you be blessed with good health. Lots of love and respect.
You are incredible generous. Thanks immensily for sharing your knowledge.
I tried to apply the reverse curse but Chatgpt is giving the right answer in both ways. Can anyone tell me why so?
This tutorial was perfect, so helpful! thank you so much for sharing!!!
Thank Andrej, Great video! You it very easy to understand a complex subject
Such a superstar!!!!!!
Thank you so much Andrej!
I love how you take nn, and explain to us, not by already built in function in pytorch, but by how things works, then giving us what the equivelent lf it in pytorch
GOD
This is the best back propagation explanation I ever got. Congratulations! This is great!
In this lecture at time arount 18.50 Andrej says that you get white when the statement is true and black if the statement is false. This is actually the opposite. The dead neuron problem occurs when you get completely black neurons under the imshow. In those cases the gradient does not change for the upstream parameters. Does anybody agree+
The legend that introduced me to DL all those years ago continues to awe us. Bravo
Hi! Can you make video explanation about new Kolmogorov Arnold Network (KAN) ? It's seems to be a new revolution in NN 🤔
This was very effective, knowledgeable and overwhelming at the same time 😂
Brilliant video!
extremely simple and interesting! Thank you!
does anyone else hear that crazy loud ringing noise?
thank-you! audio is very low though
My tiny brain just blew up!
hi everyone if i have losses like this (train loss 0.6417 val loss 6.7432) but an output not bad, is it normal?))
Hi Andrej, thanks a ton for the free education, I am enjoying learning throught the lessions! A quick observation, the output of the explicit counting method and the nn method seem different in the end: counting table sampling -> nn sampling: koneraisah. -> kondlaisah. andhumizarie. -> anchthizarie. can you comment? my intuitions is telling me that the nn is just an approximation of the countering table, never exact. Not sure if this is the right way of reasoning. Thanks!
Thank you so much Mr Andrey kaparthey, I watched and practice a pytorch course of like 52 hours and it was awesome, but after watching your video, it's seems that I was more of like learning how to build a neural network, more then how neural network works. With your video I know exactly how it works, and iam planning to watch all of this playlist, and see all of almost all your blog posts ❤ thank you and have a nice day.
Brilliant
Thank you so much!!! For democratizing education and this technology for all of us! AMAZING! Much much love!
❤ > It's (a) pleasure
fine tuning next please
Is it possible to just calculate the gradients once and then you know them? Not resetting and recalculating. What am I missing?
Thanks!! I have never understood better how a neural network works!
One of the best videos on under the hood look in LLMs. Love the clarity and patience Andrej imparts considering he’s such a legend in AI.
Was i the only one invested with the leaf story😭
Hi Not sure what happens, but I can't join discord, it says "unable to accept invite". I have issue to get the "m" from multinormial distribution, I got 3 instead 13 if I use num_samples=1. if I use num_samples>1, I got 13. but I got "mi." if num_samples=2. people on line suggest to use touch 1.13.1, I dont have this old version on my macOs.
Wow!!! This is extremely helpful and well-made. Thank you so much!
This guy has become my favorite tutor .
I love this learning resource and the simplicity of the tutorial style. Thanks at Andrej KArpathy.
In the case of a simple Bigram model @32:38 we are sampling only one character, however, while calculating loss we consider the character with the highest probability. The character sampled is unlikely to be the same to the character with the highest probability in the row unless we sample a large number of characters from the multinomial distribution. So, my question is, does the loss function reflect the correct loss? Can anyone help me understand this.
Thanks I really need this level of details to understand what’s going on. ❤
1:25:23 Why is the last layer made "less confident like we saw" and where did we see this?
This is the best video ever about Intro to LLM.
How were you able to run the loop without adding a requires_grad command in the "implementing the training loop, overfitting one batch" section of the video? For me it only worked when I changed the lines to: g = torch.Generator().manual_seed(2147483647) # For Reproducibility C = torch.randn((27, 2), generator=g, requires_grad=True) W1 = torch.randn((6, 100), generator=g, requires_grad=True) b1 = torch.randn(100, generator=g, requires_grad=True) W2 = torch.randn((100, 27), generator=g, requires_grad=True) b2 = torch.randn(27, generator=g, requires_grad=True) parameters = [C, W1, b1, W2, b2]
This is the most amazing video on neural network mathematics knowledge I've ever seen; thank you very much, Andrej!
Something you didn't explain - 51:30 - if we want L to go up we simply need to increase the variables in the direction of the gradient? How come it is so if some gradients are negative?
first question: yes second question: because thats the definition of a gradient -> If the gradient is negative, this means that if you make the data smaller, the loss will increase
This is the single best explanation of backpropagation in code that I've seen so far. Thanks Andrej.
Thanks a lot for the insight. and demonstration. I really look forward more videos from you, Andrej!
just a random fun fact, with gen = torch.Generator().manual_seed(2147483647), the bigram generated name i got was `c e x z e .`, amazing.
Thanks a ton for this Andrej! Explained and presented in such simple and relatable terms. Gives confidence to get into the weeds now.
The best ML tutorial video I have watched this year. I really like detailed example, and how these difficult concepts are explained in a simple manner. What a treat for me to watch and learn!