Andrej Karpathy
Andrej Karpathy
  • 14
  • 9 659 975
Let's build the GPT Tokenizer
The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough
00:27:02 starting the implementation
00:28:35 counting consecutive pairs, finding most common pair
00:30:36 merging the most common pair
00:34:58 training the tokenizer: adding the while loop, compression ratio
00:39:20 tokenizer/LLM diagram: it is a completely separate stage
00:42:47 decoding tokens to strings
00:48:21 encoding strings to tokens
00:57:36 regex patterns to force splits across categories
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary
01:43:27 how to set vocabulary set? revisiting gpt.py transformer
01:48:11 training new tokens, example of prompt compression
01:49:58 multimodal [image, video, audio] tokenization with vector quantization
01:51:41 revisiting and explaining the quirks of LLM tokenization
02:10:20 final recommendations
02:12:50 ??? :)
Exercises:
- Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you're getting stuck are in the minbpe code github.com/karpathy/minbpe/blob/master/exercise.md
Links:
- Google colab for the video: colab.research.google.com/drive/1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L?usp=sharing
- GitHub repo for the video: minBPE github.com/karpathy/minbpe
- Playlist of the whole Zero to Hero series so far: ua-cam.com/video/VMj-3S1tku0/v-deo.html
- our Discord channel: discord.gg/3zy8kqD9Cp
- my Twitter: karpathy
Supplementary links:
- tiktokenizer tiktokenizer.vercel.app
- tiktoken from OpenAI: github.com/openai/tiktoken
- sentencepiece from Google github.com/google/sentencepiece
Переглядів: 458 073

Відео

[1hr Talk] Intro to Large Language Models
Переглядів 1,8 млн5 місяців тому
This is a 1 hour general-audience introduction to Large Language Models: the core technical component behind systems like ChatGPT, Claude, and Bard. What they are, where they are headed, comparisons and analogies to present-day operating systems, and some of the security-related challenges of this new computing paradigm. As of November 2023 (this field moves fast!). Context: This video is based...
Let's build GPT: from scratch, in code, spelled out.
Переглядів 4,2 млнРік тому
We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framewo...
Building makemore Part 5: Building a WaveNet
Переглядів 154 тис.Рік тому
We take the 2-layer MLP from previous video and make it deeper with a tree-like structure, arriving at a convolutional neural network architecture similar to the WaveNet (2016) from DeepMind. In the WaveNet paper, the same hierarchical architecture is implemented more efficiently using causal dilated convolutions (not yet covered). Along the way we get a better sense of torch.nn and what it is ...
Building makemore Part 4: Becoming a Backprop Ninja
Переглядів 168 тис.Рік тому
We take the 2-layer MLP (with BatchNorm) from the previous video and backpropagate through it manually without using PyTorch autograd's loss.backward(): through the cross entropy loss, 2nd linear layer, tanh, batchnorm, 1st linear layer, and the embedding table. Along the way, we get a strong intuitive understanding about how gradients flow backwards through the compute graph and on the level o...
Building makemore Part 3: Activations & Gradients, BatchNorm
Переглядів 242 тис.Рік тому
We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you'd want to use to understand the health of your deep network. We learn why training deep neural nets can be fragile and ...
Building makemore Part 2: MLP
Переглядів 271 тис.Рік тому
We implement a multilayer perceptron (MLP) character-level language model. In this video we also introduce many basics of machine learning (e.g. model training, learning rate tuning, hyperparameters, evaluation, train/dev/test splits, under/overfitting, etc.). Links: - makemore on github: github.com/karpathy/makemore - jupyter notebook I built in this video: github.com/karpathy/nn-zero-to-hero/...
The spelled-out intro to language modeling: building makemore
Переглядів 580 тис.Рік тому
We implement a bigram character-level language model, which we will further complexify in followup videos into a modern Transformer language model, like GPT. In this video, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently evaluating neural networks and (2) the overall framework of language modeling that includes model training, sampling, and the evaluation ...
Stable diffusion dreams of psychedelic faces
Переглядів 33 тис.Рік тому
Prompt: "psychedelic faces" Stable diffusion takes a noise vector as input and samples an image. To create this video I smoothly (spherically) interpolate between randomly chosen noise vectors and render frames along the way. This video was produced by one A100 GPU taking about 10 tabs and dreaming about the prompt overnight (~8 hours). While I slept and dreamt about other things. Music: Stars ...
Stable diffusion dreams of steampunk brains
Переглядів 24 тис.Рік тому
Prompt: "ultrarealistic steam punk neural network machine in the shape of a brain, placed on a pedestal, covered with neurons made of gears. dramatic lighting. #unrealengine" Stable diffusion takes a noise vector as input and samples an image. To create this video I smoothly (spherically) interpolate between randomly chosen noise vectors and render frames along the way. This video was produced ...
Stable diffusion dreams of tattoos
Переглядів 64 тис.Рік тому
Dreams of tattoos. (There are a few discrete jumps in the video because I had to erase portions that got just a little 🌶️, believe I got most of it) Links - Stable diffusion: stability.ai/blog - Code used to make this video: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355 - My twitter: karpathy
The spelled-out intro to neural networks and backpropagation: building micrograd
Переглядів 1,5 млнРік тому
This is the most step-by-step spelled-out explanation of backpropagation and training of neural networks. It only assumes basic knowledge of Python and a vague recollection of calculus from high school. Links: - micrograd on github: github.com/karpathy/micrograd - jupyter notebooks I built in this video: github.com/karpathy/nn-zero-to-hero/tree/master/lectures/micrograd - my website: karpathy.a...
Stable diffusion dreams of "blueberry spaghetti" for one night
Переглядів 47 тис.Рік тому
Prompt: "blueberry spaghetti" Stable diffusion takes a noise vector as input and samples an image. To create this video I simply smoothly interpolate between randomly chosen noise vectors and render frames along the way. Links - Stable diffusion: stability.ai/blog - Code used to make this video: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355 - My twitter: karpathy
Stable diffusion dreams of steam punk neural networks
Переглядів 36 тис.Рік тому
A stable diffusion dream. The prompt was "ultrarealistic steam punk neural network machine in the shape of a brain, placed on a pedestal, covered with neurons made of gears. dramatic lighting. #unrealengine" the new and improved v2 version of this video is now here: ua-cam.com/video/2oKjtvYslMY/v-deo.html generated with this hacky script: gist.github.com/karpathy/00103b0037c5aaea32fe1da1af55335...

КОМЕНТАРІ

  • @nikakutaladze5544
    @nikakutaladze5544 29 хвилин тому

    speak louder pleaase

  • @DigitalMirrorComputing
    @DigitalMirrorComputing 7 годин тому

    You've inspired me to start my own channel. This is the best video I've seen all year! Thank you for your contributions and for continuing to inspire us!

  • @alexanderbowler1701
    @alexanderbowler1701 15 годин тому

    Amazing video, thank you so much

  • @IshiShiv
    @IshiShiv 17 годин тому

    Amazing video...

  • @ragibshahriyear3682
    @ragibshahriyear3682 18 годин тому

    Thank you Mr. Karpathy. I am in love with your teaching. Given how accomplished and experienced you are, you are still teaching with such grace. I dream about sitting in one of your classes and learning from you, and I hope to meet you one day. May you be blessed with good health. Lots of love and respect.

  • @albertoprb1
    @albertoprb1 21 годину тому

    You are incredible generous. Thanks immensily for sharing your knowledge.

  • @stutisrivastava2925
    @stutisrivastava2925 День тому

    I tried to apply the reverse curse but Chatgpt is giving the right answer in both ways. Can anyone tell me why so?

  • @despoinakallirroikosmopoul4042

    This tutorial was perfect, so helpful! thank you so much for sharing!!!

  • @user-ep6mb5kw7r
    @user-ep6mb5kw7r День тому

    Thank Andrej, Great video! You it very easy to understand a complex subject

  • @Mahi-sf8ck
    @Mahi-sf8ck День тому

    Such a superstar!!!!!!

  • @Ali-lt1kb
    @Ali-lt1kb День тому

    Thank you so much Andrej!

  • @FireFly969
    @FireFly969 День тому

    I love how you take nn, and explain to us, not by already built in function in pytorch, but by how things works, then giving us what the equivelent lf it in pytorch

  • @mnbvzxcv1
    @mnbvzxcv1 День тому

    GOD

  • @PopescuAlexandruCristian
    @PopescuAlexandruCristian 2 дні тому

    This is the best back propagation explanation I ever got. Congratulations! This is great!

  • @AndresNamm
    @AndresNamm 2 дні тому

    In this lecture at time arount 18.50 Andrej says that you get white when the statement is true and black if the statement is false. This is actually the opposite. The dead neuron problem occurs when you get completely black neurons under the imshow. In those cases the gradient does not change for the upstream parameters. Does anybody agree+

  • @Alley00Cat
    @Alley00Cat 2 дні тому

    The legend that introduced me to DL all those years ago continues to awe us. Bravo

  • @dead_d20dice67
    @dead_d20dice67 2 дні тому

    Hi! Can you make video explanation about new Kolmogorov Arnold Network (KAN) ? It's seems to be a new revolution in NN 🤔

  • @drtech6521
    @drtech6521 2 дні тому

    This was very effective, knowledgeable and overwhelming at the same time 😂

  • @alexanderchernikov7497
    @alexanderchernikov7497 2 дні тому

    Brilliant video!

  • @jianjielu9835
    @jianjielu9835 2 дні тому

    extremely simple and interesting! Thank you!

  • @matthewhendricks648
    @matthewhendricks648 2 дні тому

    does anyone else hear that crazy loud ringing noise?

  • @herashak
    @herashak 3 дні тому

    thank-you! audio is very low though

  • @percevil8050
    @percevil8050 3 дні тому

    My tiny brain just blew up!

  • @GlitchiPitch
    @GlitchiPitch 3 дні тому

    hi everyone if i have losses like this (train loss 0.6417 val loss 6.7432) but an output not bad, is it normal?))

  • @hongwuhuai
    @hongwuhuai 3 дні тому

    Hi Andrej, thanks a ton for the free education, I am enjoying learning throught the lessions! A quick observation, the output of the explicit counting method and the nn method seem different in the end: counting table sampling -> nn sampling: koneraisah. -> kondlaisah. andhumizarie. -> anchthizarie. can you comment? my intuitions is telling me that the nn is just an approximation of the countering table, never exact. Not sure if this is the right way of reasoning. Thanks!

  • @FireFly969
    @FireFly969 3 дні тому

    Thank you so much Mr Andrey kaparthey, I watched and practice a pytorch course of like 52 hours and it was awesome, but after watching your video, it's seems that I was more of like learning how to build a neural network, more then how neural network works. With your video I know exactly how it works, and iam planning to watch all of this playlist, and see all of almost all your blog posts ❤ thank you and have a nice day.

  • @warwicknexus160
    @warwicknexus160 3 дні тому

    Brilliant

  • @oleksandrasaskia
    @oleksandrasaskia 3 дні тому

    Thank you so much!!! For democratizing education and this technology for all of us! AMAZING! Much much love!

  • @monocles.IcedPeaksOfFire
    @monocles.IcedPeaksOfFire 3 дні тому

    ❤ > It's (a) pleasure

  • @akhilphilnat
    @akhilphilnat 4 дні тому

    fine tuning next please

  • @user-cb3pf6qf2z
    @user-cb3pf6qf2z 4 дні тому

    Is it possible to just calculate the gradients once and then you know them? Not resetting and recalculating. What am I missing?

  • @lielbn0
    @lielbn0 4 дні тому

    Thanks!! I have never understood better how a neural network works!

  • @Clammer999
    @Clammer999 4 дні тому

    One of the best videos on under the hood look in LLMs. Love the clarity and patience Andrej imparts considering he’s such a legend in AI.

  • @anthonyjackson7644
    @anthonyjackson7644 4 дні тому

    Was i the only one invested with the leaf story😭

  • @mikezhao1838
    @mikezhao1838 4 дні тому

    Hi Not sure what happens, but I can't join discord, it says "unable to accept invite". I have issue to get the "m" from multinormial distribution, I got 3 instead 13 if I use num_samples=1. if I use num_samples>1, I got 13. but I got "mi." if num_samples=2. people on line suggest to use touch 1.13.1, I dont have this old version on my macOs.

  • @AsmaKhan-lk3wb
    @AsmaKhan-lk3wb 5 днів тому

    Wow!!! This is extremely helpful and well-made. Thank you so much!

  • @AdityaAVG
    @AdityaAVG 5 днів тому

    This guy has become my favorite tutor .

  • @ezekwu77
    @ezekwu77 5 днів тому

    I love this learning resource and the simplicity of the tutorial style. Thanks at Andrej KArpathy.

  • @switchwithSagar
    @switchwithSagar 5 днів тому

    In the case of a simple Bigram model @32:38 we are sampling only one character, however, while calculating loss we consider the character with the highest probability. The character sampled is unlikely to be the same to the character with the highest probability in the row unless we sample a large number of characters from the multinomial distribution. So, my question is, does the loss function reflect the correct loss? Can anyone help me understand this.

  • @meow-mi333
    @meow-mi333 5 днів тому

    Thanks I really need this level of details to understand what’s going on. ❤

  • @MagicBoterham
    @MagicBoterham 5 днів тому

    1:25:23 Why is the last layer made "less confident like we saw" and where did we see this?

  • @marcelomenezes3796
    @marcelomenezes3796 5 днів тому

    This is the best video ever about Intro to LLM.

  • @MrManlify
    @MrManlify 5 днів тому

    How were you able to run the loop without adding a requires_grad command in the "implementing the training loop, overfitting one batch" section of the video? For me it only worked when I changed the lines to: g = torch.Generator().manual_seed(2147483647) # For Reproducibility C = torch.randn((27, 2), generator=g, requires_grad=True) W1 = torch.randn((6, 100), generator=g, requires_grad=True) b1 = torch.randn(100, generator=g, requires_grad=True) W2 = torch.randn((100, 27), generator=g, requires_grad=True) b2 = torch.randn(27, generator=g, requires_grad=True) parameters = [C, W1, b1, W2, b2]

  • @AlexTang99
    @AlexTang99 6 днів тому

    This is the most amazing video on neural network mathematics knowledge I've ever seen; thank you very much, Andrej!

  • @adirmashiach4639
    @adirmashiach4639 6 днів тому

    Something you didn't explain - 51:30 - if we want L to go up we simply need to increase the variables in the direction of the gradient? How come it is so if some gradients are negative?

    • @yourxylitol
      @yourxylitol 6 днів тому

      first question: yes second question: because thats the definition of a gradient -> If the gradient is negative, this means that if you make the data smaller, the loss will increase

  • @howardbaek5413
    @howardbaek5413 6 днів тому

    This is the single best explanation of backpropagation in code that I've seen so far. Thanks Andrej.

  • @ThefirstrobloxCEO989
    @ThefirstrobloxCEO989 6 днів тому

    Thanks a lot for the insight. and demonstration. I really look forward more videos from you, Andrej!

  • @debdeepsanyal9030
    @debdeepsanyal9030 6 днів тому

    just a random fun fact, with gen = torch.Generator().manual_seed(2147483647), the bigram generated name i got was `c e x z e .`, amazing.

  • @mehulchopra1517
    @mehulchopra1517 7 днів тому

    Thanks a ton for this Andrej! Explained and presented in such simple and relatable terms. Gives confidence to get into the weeds now.

  • @wangcwy
    @wangcwy 7 днів тому

    The best ML tutorial video I have watched this year. I really like detailed example, and how these difficult concepts are explained in a simple manner. What a treat for me to watch and learn!