Language models and transformer from scratch
I recently did some exercises on (small) language models.
It is still quite a foreign field to me, so the only way to appreciate it better is to start from a very simple structure, directly implement each piece of the engine and spell out all the matrices and vectors. I found Andrej Karpathy’s lectures useful - I was able to learn the basics and implement a few simple models including the transformer. There are also many other good resources I found (see references in the end).
All of my exercises are done in scratch notebooks. Each is self-contained and contains narratives, so I won’t copy them here. Instead, I will highlight only a few points that I found interesting.
Sample results
Let’s first see what I got in the end. Here are sample results - names generated from different models. Obviously, the names generated from MLP are more realistic than those from Trigram or Bigram.
Bigram | Trigram | MLP |
---|---|---|
ya |
se |
yeesha |
syahavilin |
emahle |
leilanni |
dleekahmangwnya |
em |
camryn |
tryahdachen |
jlbim |
maryda |
ena |
bannya |
trysten |
da |
oryah |
hender |
amiiae |
niewena |
ellysoniah |
a |
va |
abdijas |
keles |
fajiah |
louanne |
ly |
fakeles |
alivia |
Next, here is a sample of generated Shakespeare work from a simple (decoder-only) Transformer. Although many words are not actual English words, mainly because I tokenized each character, not each word, I guess this artificial writing is not super crazy :)
I which I long them-fares here surked,
Whil with pursection my faid do.
Shephered' and my stande couse.
All, I you dreece! the nature! If 'He even grong mise's suck which you more
You evens as thou somet. I do hear,
That at untery that longs doin the,
For my uncred-hen arming to the conked.
vire some, becurie 'tis queel!
QUEEN VI ELIVA:
Yill greaste at same insomest my my diest,
That's but comelful; she appayon'd take to Musire herse;
Too did and was for undom to by do eath that Edvil
He lornk to end not one breada master: hold,
inle your shighter thee of right
Hast onchemes mytredom of ac his dardient him.
AUTOLUMBERLA:
It lord?
But SBoht that you unsues I dare,
Hut stren tongentord of him distrain.
Thou do to his it; I to I know out to his lose grast,
Whom shame do lady the comegare. Where Cath wive, your prince the is incice
Look him, if with I will love!
...
UKE OF YORK:
Wife our will looke.
AUTOLYCUS:
Is them sidoim arming na
Bigram
This is one of the simplest ideas in language models, but I found that it is not really bad, even with the character level tokens. For example, if someone gives me a letter q
and asks what would be the next letter, then I would be highly likely to say u
. This is simply because the likelihood from one letter to the other is NOT uniform in English. See the frequency table.
Frequency table. The likelihood from one character to the other is not homogeneous.
Trigram
This was not covered in Andrej’s lecture, but I just revised the codes from the Bigram, such that the context is not a single character, but two characters. Alghouth this performs better than bigram, it can suffer from the sparsity in the mapping table (e.g, there is no name containing xmk
), so further naive extensions for longer context would not be workable. That’s one of the motivations for the MLP approach (the neural probabilistic language model).
MLP
Core idea
The implementation of multi-layer perceptrons for language models here is based on a paper. What is fascinating here is that, it passes certain representation of words that the neural networks can ingest, and let them work inside for emitting the next word.
To be specific, this model associates each character to feature vectors. For example, consider the string emma
. A representation of each of consecutive 3 characters (.
as an empty character) can be done by embedding each character to a 2D-vector1:
.em = [ 0 5 13] --> [ 0.43047658 -0.34990272 -2.3860264 0.69941294 -1.1255676 0.02481801]
emm = [ 5 13 13] --> [-2.3860264 0.69941294 -1.1255676 0.02481801 -1.1255676 0.02481801]
mma = [13 13 1] --> [-1.1255676 0.02481801 -1.1255676 0.02481801 0.47494158 0.9040745 ]
Why does this work?
- Curse of dimensionality: the dimension for the feature vector (<100) is much smaller than the size of vocabulary (>10000), but with iterative tuning processes, the vector representation can learn the knowledge and the contexts effectively.
- Generalizability: the numerical vectors can capture similarities between tokens. For example,
dog
/cat
orthe
/a
have semantic and syntactic similarities, the neural machine naturally generalize fromThe dog was walking in the room
toA cat was walking in a room
, even if the latter sentence is not in the training data.
The implemented architecture. Image from A Neural Probabilistic Language Model, Y. Bengio, et al., JMLR 2003
Side-track: monitoring neurons
Aside from building language models, Andrej also gave a separate lecture on how to diagnose neural networks. It allows me to observe how the values and gradients in each neuron in each layer behaves. For example, we see that the distribution of activation values is shrinking, from one layer to the next one. This is due to Tanh
layers, which squash the distributions more and more.
Before normalizations and corrections are applied
But we don’t want this shrinking distribution. Why? Because we want some information (fluctuations) to transfer from one layer to the other. After adding batch normalizations and gain factor for the Tanh
activation function2, we have much stabilized activation values.
After normalizations and corrections are applied
I personally really loved this part, having better controls on the neurons.
Transformer
The key innovation of the transformer is to overcome the limitation of previous neural network models about long-range correlations. Suppose a batch of data has a sequence of 9 tokens:
[1, 58, 46, 43, 47, 56, 1, 39, 45]
Then inputs (X) and the targets (Y) we want to train would be:
when input is [1] the target: 58
when input is [1, 58] the target: 46
when input is [1, 58, 46] the target: 43
when input is [1, 58, 46, 43] the target: 47
when input is [1, 58, 46, 43, 47] the target: 56
when input is [1, 58, 46, 43, 47, 56] the target: 1
when input is [1, 58, 46, 43, 47, 56, 1] the target: 39
when input is [1, 58, 46, 43, 47, 56, 1, 39] the target: 45
That is, we expect the transformer will learn about short and long correlations of tokens. Note that we don’t want to look ahead (a.k.a., our next-character prediction is solely based on previous characters). For training, the authoers introduce the following attention formula[^5]:
\[\textsf{softmax} \Big(\frac{QK^T}{\sqrt{d}}\Big)V\]and its implementation is surprisingly simple:
k = self.key(x) # (B,T,C)
q = self.query(x) # (B,T,C)
v = self.value(x) # (B,T,C)
wei = q @ k.transpose(-2, -1) * C ** (-0.5) # dot-product with normalization
wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) # (B,T,T)
wei = F.softmax(wei, dim=-1) # probabilities
out = wei @ v #
where each element of the wei
tensor is a lower-triangular matrix. For example wei[0]
looks like this:
wei[0] =
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.1236, 0.8764, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.1274, 0.0991, 0.7735, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.1018, 0.5800, 0.0192, 0.2989, 0.0000, 0.0000, 0.0000, 0.0000],
[0.1007, 0.3380, 0.2122, 0.2630, 0.0862, 0.0000, 0.0000, 0.0000],
[0.2143, 0.2812, 0.2176, 0.1199, 0.0668, 0.1003, 0.0000, 0.0000],
[0.1067, 0.1181, 0.0842, 0.1078, 0.0097, 0.5417, 0.0319, 0.0000],
[0.0138, 0.0266, 0.0078, 0.0239, 0.6866, 0.0381, 0.1671, 0.0361]]
)
That’s it! Once you implement this core part, the rest parts are mostly3 about increasing the number of heads and stacking them with linear and normalization layers.
The implemented decoder-only transformer. It only contains self-attention.
Note that the term “self-attention” is named so because all the keys, queries, and values are all from the same source, x
.
Summary
- This exercise scratches some of the nuts and bolts of language models.
- For more practical usages, we need to tokenize words, not characters.
- Throughout the exercise, the main challenge for me personally was to figure out dimensions of the tensors. It took time for me to learn how to manipulate tensors in PyTorch.
- Applying batch / layer norm blindly is not enough. Monitoring the activation values and gradients in each neuron at each layer is useful and necessary.
- The transformer architecture is beyond the language field. It is a very general and powerful machine.
References
- The Annotated Transformer, Harvard NLP group [link]
- An even more annotated Transformer, \pi-tau [link]
- The Transformer Family Version 2.0, Lilian Weng [link]
- The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture, Jean Nyandwi [link]
- Andrej Karpathy’s Youtube lecture [link] and nanoGPT [link]
- Understanding Large Language Models [link]
- Attention is all you need, A Vaswani, et al., NeurIPS, 2017 [link]
- A Neural Probabilistic Language Model, Y. Bengio, et al., JMLR 2003 [link]
Codes: https://github.com/uriyeobi/makemore
Notes