Writing an LLM from scratch, part 22 – training our LLM

This post wraps up my notes on chapter 5 of Sebastian Raschka’s book “Build a Large Language Model (from Scratch)”. Understanding cross entropy loss and perplexity were the hard bits for me in this chapter — the remaining 28 pages were more a case of plugging bits together and running the code, to see what happens.

The shortness of this post almost feels like a damp squib. After writing so much in the last 22 posts, there’s really not all that much to say — but that hides the fact that this part of the book is probably the most exciting to work through. All these pieces developed with such care, and with so much to learn, over the preceding 140 pages, with not all that much to show — and suddenly, we have a codebase that we can let rip on a training set — and our model starts talking to us!

I trained my model on the sample dataset that we use in the book, the 20,000 characters of “The Verdict” by Edith Wharton, and then ran it to predict next tokens after “Every effort moves you”. I got:

The next step was to download the weights for the original 124M-parameter version of GPT-2 from OpenAI, following the instructions in the book, and then to load them into my model. With those weights, against the same prompt, I got this:

That’s amazingly cool. Coherent enough that you could believe it’s part of the instructions for a game.

Now, I won’t go through the remainder of the chapter in detail — as I said, it’s essentially just plugging together the various bits that we’ve gone through so far, even though the results are brilliant. In this post I’m just going to make a few brief notes on the things that I found interesting.

One thing I really do recommend to anyone working through the book is that you type in all of the code, and run it yourself — it really will help you remember how stuff fits together.

There is one slight issue I found with that, however: the book has a number of examples where you get output from code that uses randomness — for example, where you take a look at the loss it has on some sample text before you start training, or make it generate samples during the train.

Now, in theory, because Raschka puts torch.manual_seed calls before all of these, the results you get should be exactly the same as the outputs in the book. However, the amount of code we’re working with at this stage is quite large — we have various helper functions that were created in earlier sections, for example. And some of these use randomness.

That means that to get the same results as the ones in the book, you would need to ensure that all of the code that uses randomness was running in exactly the same order as it was when Raschka did it for the book. That turns out to be surprisingly hard!

When I have built simple backpropagation through neural networks in the past, I’ve generally updated parameters by multiplying the gradients by a small number, the learning rate, and then subtracting them from their respective parameters to get updated ones — classic stochastic gradient descent.

In practice, with AdamW, you initialise it at the start of your training loop, with a learning rate (which I imagine is similar to the one my older code used, a scaling factor for gradients) and a weight decay (:shrug:). You also provide it with the parameters it’s going to be managing.

In the training loop, at the start of each input batch, you tell it to zero out the gradients it’s managing with optimizer.zero_grad(), run the data through your model and calculate your loss, and then after calling loss.backward() to get your gradients, you just call optimizer.step(), and that does the parameter update.

Again, I want to dig into how optimisers work in more detail in the future. But for now, I think that’s all I need to know.

The book tells you how to train on a public domain book, “The Verdict” by Edith Wharton. Full training on the hardware that people are likely to have to hand would be extremely expensive, so we just train on that short example, then later on learn how to download and use the weights that OpenAI made available for their GPT-2 models.

This makes perfect sense, of course — there’s a really good reason why AI training is normally done on GPUs or custom hardware, and the MacBook Air would presumably be training on the CPU. But I was a little surprised at how huge the difference was in this simple example!

Andrej Karpathy was able to train a 124M GPT-2 model for $20, using his hand-written C/CUDA LLM system llm.c. That is undoubtedly more efficient than the PyTorch code that we’re working on in this book. But it really would be interesting to find out whether it would be doable for me at all! The training data he used is the 10B-token version of the FineWeb collection, which is freely available. 1

One thing I found a little confusing in this chapter — and this is very much a nit — was the section on preventing “memorisation”; I think this was due to a mismatch in the meaning I attach to the word, and the way it’s used here.

In the book, “memorisation” is being used to mean something more like what I’d call “parroting” — issues with the model just repeating the stuff that it has memorised, because it was always choosing the most-probable next word. Avoiding this is super-important, of course! It’s just the framing that confused me a little.

And finally, the top-k technique — only consider the k most probable tokens, and then do the temperature/softmax calculations — was a sensible addition to add on top of that. The code is clever: identify the top k logits, get the value of the lowest one of them, and then replace every logit less than that with minus infinity. When you run that through softmax, you get zeros for the ones that were replaced, and the probability distribution is based on the remainder.

So: excellent stuff, and very well explained in the book — it just didn’t feel like preventing “memorisation” specifically was what it was doing, at least based on what I take the word to mean.

At the end of the chapter, we download the weights for the original GPT-2 model that OpenAI produced from their site, and load them into our own model.

One thing I did notice while going through that section was that I’d been making a mistake as I wrote up this series; I’d thought that all GPT-2 models had 768 embedding dimensions. It turns out that this is only true of the 124M model in that series, and the larger ones have more. That makes a lot of sense — and I’ve updated the older posts to reflect it.

Next up: using it to classify text. Will this be quick and easy? Or will it lead down another fascinating rabbit hole? Time will tell…

My instinct is that it doesn’t actually matter all that much. So long as the loss numbers that you see are in the same ballpark as the ones in the book, and the outputs you see are roughly equally incoherent (before training) and become more coherent at what feels like the same kind of rate, you’re fine. Probably the most important one to look out for is when the training run starts — you should see loss on the training set decreasing steadily, just like in the book, and likewise as in the book, the validation loss should plateau out pretty early.

I don’t understand how optimisers work in any detail, and I’m going to have to dig into that in the future. However, my high-level simplified picture right now is that they dynamically adjust the learning rate over time, so that it’s easier to take big “jumps” downwards on the gradients when you start, and then smaller ones later. I believe they can also sometimes avoid local minima in the loss landscape — a nice metaphor I read somewhere (lost the source, sadly) was that simple gradient descent was like rolling a ball down a hill, but (some?) optimisers give the ball a bit of momentum so that it can coast over a small uphill portion, so long as the general slope is downwards.

I think I have a good candidate for a next project when I’ve finished the book; see how many tokens/second I can train on locally — that will allow me to estimate how long it would take to train one epoch over the whole training set. I imagine that will be longer than I’m willing to leave my desktop machine tied up doing this, but then I can try mixing in the lessons I learned doing fine-tuning, and see if I can get it up and running on Lambda Labs. If the cost is in the tens of dollars, or even a hundred or so, I really think it would be worthwhile!

The techniques are nifty, anyway. The first cut — just use the softmaxed logits as a probability distribution and sample from it — is obvious enough. Temperature is a clever trick on top of that — just divide the logits by some number greater than one before softmax, and you can make the distribution that comes out flatter (or you can make it more “pointy” by dividing by a number less than 1). The graphs in the book showing how that works are great, but I asked Claude to knock together a temperature playground website, which I found made things even clearer to me.

Not bad for a model trained on such a small amount of data (in just over ten seconds).

On my machine using CUDA on an RTX 3090, it took just less than eleven seconds.

Now, while the book mentions that Llama 2 probably cost hundreds of thousands of dollars to train, I must admit that I do wonder how much it really would cost to train a 124M parameter model on my own hardware — or, indeed, on the machines with 8x 80GiB A100 GPUs that I rented from Lambda Labs during my fine-tuning experiments.

His new nanochat — a from-scratch trainable chatbot — is even cooler. ↩

Archives Categories Blogroll

His new nanochat — a from-scratch trainable chatbot — is even cooler. ↩

Writing an LLM from scratch, part 22 — finally training our LLM!

Agile Abstractions Astral Codex Ten :: (Bloggable a) => a -> IO () David Friedman’s Substack Econ & Energy Entrepreneurial Geekiness For some value of “Magic” Hackaday kaleidic.ai newsletter Knowing.NET Language Log Millennium Hand ntoll.org Obey the Testing Goat! PK PythonAnywhere News Simon Willison’s Weblog Societive Software Deviser Some opinions, held with varying degrees of certainty tartley.com

XDEFiANCE'e Quality Internet Shop

Writing an LLM from scratch, part 22 – training our LLM

Writing an LLM from scratch, part 22 — finally training our LLM!

This is the xdefiance Online Web Shop.

Reaching Outwards

Join the fun!

Recent blog posts

How to Build Reactive Declarative UI in Vanilla JavaScript

Fossil versus Git

Lightpanda migrate DOM implementation to Zig

Ai, Japanese chimpanzee who counted and painted dies at 49

CDC staff 'blindsided' as child vaccine schedule unilaterally overhauled

MIT Non-AI License

Your cart (items: 0)