Language models and transformer from scratch

I recently did some exercises on (small) language models.

It is still quite a foreign field to me, so the only way to appreciate it better is to start from a very simple structure, directly implement each piece of the engine and spell out all the matrices and vectors. I found Andrej Karpathy’s lectures useful - I was able to learn the basics and implement a few simple models including the transformer. There are also many other good resources I found (see references in the end).

All of my exercises are done in scratch notebooks. Each is self-contained and contains narratives, so I won’t copy them here. Instead, I will highlight only a few points that I found interesting.



Sample results

Let’s first see what I got in the end. Here are sample results - names generated from different models. Obviously, the names generated from MLP are more realistic than those from Trigram or Bigram.

Bigram Trigram MLP
ya se yeesha
syahavilin emahle leilanni
dleekahmangwnya em camryn
tryahdachen jlbim maryda
ena bannya trysten
da oryah hender
amiiae niewena ellysoniah
a va abdijas
keles fajiah louanne
ly fakeles alivia

Next, here is a sample of generated Shakespeare work from a simple (decoder-only) Transformer. Although many words are not actual English words, mainly because I tokenized each character, not each word, I guess this artificial writing is not super crazy :)

I which I long them-fares here surked,
Whil with pursection my faid do.
Shephered' and my stande couse.
All, I you dreece! the nature! If 'He even grong mise's suck which you more
You evens as thou somet. I do hear,
That at untery that longs doin the,
For my uncred-hen arming to the conked.
vire some, becurie 'tis queel!

QUEEN VI ELIVA:
Yill greaste at same insomest my my diest,
That's but comelful; she appayon'd take to Musire herse;
Too did and was for undom to by do eath that Edvil
He lornk to end not one breada master: hold,
inle your shighter thee of right
Hast onchemes mytredom of ac his dardient him.

AUTOLUMBERLA:
It lord?
But SBoht that you unsues I dare,
Hut stren tongentord of him distrain.
Thou do to his it; I to I know out to his lose grast,
Whom shame do lady the comegare. Where Cath wive, your prince the is incice
Look him, if with I will love!

...

UKE OF YORK:
Wife our will looke.

AUTOLYCUS:
Is them sidoim arming na


Bigram

[Notebook]

This is one of the simplest ideas in language models, but I found that it is not really bad, even with the character level tokens. For example, if someone gives me a letter q and asks what would be the next letter, then I would be highly likely to say u. This is simply because the likelihood from one letter to the other is NOT uniform in English. See the frequency table.

Frequency table. The likelihood from one character to the other is not homogeneous.

Trigram

[Notebook]

This was not covered in Andrej’s lecture, but I just revised the codes from the Bigram, such that the context is not a single character, but two characters. Alghouth this performs better than bigram, it can suffer from the sparsity in the mapping table (e.g, there is no name containing xmk), so further naive extensions for longer context would not be workable. That’s one of the motivations for the MLP approach (the neural probabilistic language model).

MLP

[Notebook]

Core idea

The implementation of multi-layer perceptrons for language models here is based on a paper. What is fascinating here is that, it passes certain representation of words that the neural networks can ingest, and let them work inside for emitting the next word.

To be specific, this model associates each character to feature vectors. For example, consider the string emma. A representation of each of consecutive 3 characters (. as an empty character) can be done by embedding each character to a 2D-vector1:

.em = [ 0  5 13] --> [ 0.43047658 -0.34990272 -2.3860264   0.69941294 -1.1255676   0.02481801]
emm = [ 5 13 13] --> [-2.3860264   0.69941294 -1.1255676   0.02481801 -1.1255676   0.02481801]
mma = [13 13  1] --> [-1.1255676   0.02481801 -1.1255676   0.02481801  0.47494158  0.9040745 ]

Why does this work?

The implemented architecture. Image from A Neural Probabilistic Language Model, Y. Bengio, et al., JMLR 2003

Side-track: monitoring neurons

Aside from building language models, Andrej also gave a separate lecture on how to diagnose neural networks. It allows me to observe how the values and gradients in each neuron in each layer behaves. For example, we see that the distribution of activation values is shrinking, from one layer to the next one. This is due to Tanh layers, which squash the distributions more and more.

Before normalizations and corrections are applied

But we don’t want this shrinking distribution. Why? Because we want some information (fluctuations) to transfer from one layer to the other. After adding batch normalizations and gain factor for the Tanh activation function2, we have much stabilized activation values.

After normalizations and corrections are applied

I personally really loved this part, having better controls on the neurons.

Transformer

[Notebook]

The key innovation of the transformer is to overcome the limitation of previous neural network models about long-range correlations. Suppose a batch of data has a sequence of 9 tokens:

[1, 58, 46, 43, 47, 56, 1, 39, 45]

Then inputs (X) and the targets (Y) we want to train would be:

when input is [1] the target: 58
when input is [1, 58] the target: 46
when input is [1, 58, 46] the target: 43
when input is [1, 58, 46, 43] the target: 47
when input is [1, 58, 46, 43, 47] the target: 56
when input is [1, 58, 46, 43, 47, 56] the target: 1
when input is [1, 58, 46, 43, 47, 56, 1] the target: 39
when input is [1, 58, 46, 43, 47, 56, 1, 39] the target: 45

That is, we expect the transformer will learn about short and long correlations of tokens. Note that we don’t want to look ahead (a.k.a., our next-character prediction is solely based on previous characters). For training, the authoers introduce the following attention formula[^5]:

\[\textsf{softmax} \Big(\frac{QK^T}{\sqrt{d}}\Big)V\]

and its implementation is surprisingly simple:

k = self.key(x)  # (B,T,C)
q = self.query(x)  # (B,T,C)
v = self.value(x)  # (B,T,C)
wei = q @ k.transpose(-2, -1) * C ** (-0.5)  # dot-product with normalization
wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) # (B,T,T)
wei = F.softmax(wei, dim=-1)  # probabilities
out = wei @ v  # 

where each element of the wei tensor is a lower-triangular matrix. For example wei[0] looks like this:

wei[0] = 
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1236, 0.8764, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1274, 0.0991, 0.7735, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1018, 0.5800, 0.0192, 0.2989, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1007, 0.3380, 0.2122, 0.2630, 0.0862, 0.0000, 0.0000, 0.0000],
        [0.2143, 0.2812, 0.2176, 0.1199, 0.0668, 0.1003, 0.0000, 0.0000],
        [0.1067, 0.1181, 0.0842, 0.1078, 0.0097, 0.5417, 0.0319, 0.0000],
        [0.0138, 0.0266, 0.0078, 0.0239, 0.6866, 0.0381, 0.1671, 0.0361]]
       )

That’s it! Once you implement this core part, the rest parts are mostly3 about increasing the number of heads and stacking them with linear and normalization layers.

The implemented decoder-only transformer. It only contains self-attention.

Note that the term “self-attention” is named so because all the keys, queries, and values are all from the same source, x.

Summary

References


Codes: https://github.com/uriyeobi/makemore

Notes

  1. I used the vector dimension of 10, instead of 2, in the actual implementation. 

  2. 5/3 as a gain for Tanh is empirically suggested. There is no theoretical proof for that. 

  3. skipping technical implementation details in this article.