The Anatomy of an LLM

Greetings! In our last post we looked at a cool piece of hardware called an FPGA which I want to eventually use to try to run LLMs more efficiently. To get there we need to build a fairly deep understanding of what an LLM even is so we can take one apart and put it back together on different hardware. Today we're gonna start to do just that!

Over the next couple of posts we will take a moderately sized LLM, examine its parts, build a solid mental model of how they interact, and ultimately write some code that can run the LLM with barebones python without the assistance of any libraries. This is certainly the harder and less performant way to run an LLM. We won't win any benchmarks any time soon, but the goal is for us to tinker and build our own understanding, not build something production-ready. This understanding will be a stepping stone to building something wholly new. We're aiming for that "aha! that's how it works!" moment, so let's get to it.

Breaking Down an LLM

TinyStories-33M is a good place to start. According to its original research paper it's a modest model that can still produce coherent English because it has been trained on short stories. The principles we get from dissecting this model should translate well to understanding much larger models like GPT-x.

Starting off with a basic sanity check we'll "run" the model by making the following script

from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL = "roneneldan/TinyStories-33M"

model = AutoModelForCausalLM.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

inputs = tokenizer("Once upon a time", return_tensors="pt")
print(inputs)
outputs = model.generate(**inputs, max_new_tokens=50)
print(outputs[0])
print(tokenizer.decode(outputs[0]))

and running it to produce this text

$ python llm_run_tiny_llm.py
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a bite. It was so juicy and delicious!

So this gives us a basic black box look at what our model is. We give it some seed text, our "Once upon a time", and using the model our program is able to predict what would logically follow that text based on what the model has been trained on. In this case it gave us a cute little story. Cool.

Now let's get an overview of the parts of our model that allow us to do this.

Black Box Model

To start, we'll use some 3rd party libraries to summarize our LLM (we'll be using these a lot to poke around, but our eventual goal is to run our model with just barebones python).

from transformers import AutoModelForCausalLM

MODEL = "roneneldan/TinyStories-33M"

model = AutoModelForCausalLM.from_pretrained(MODEL)

print(model)

Running this we get quite an output

$ python llm_summarize_model.py
GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-3): 4 x GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

It's a lot to process, but we'll work at it chunk by chunk and it'll all make sense. Each of these lines corresponds to a part of the language model that we can treat as a black box with some input and some output. If we visualize the model this way it becomes easier to understand how these parts interact.

Each class of black box behaves slightly differently and thus serves a different function in the overall model so let's take each in turn.

Neural Networks in Review

Before we get to that let's do the briefest review of how neural networks work. Explainers on this tend to start with the Perceptron, the first artificial neural network developed in 1943 so we'll start there:

This system used "threshold logic", much like what was discovered to work in the brain, to perform basic fuzzy recognition tasks like recognizing shapes. We can visualize this component as though it is a neuron with dendrites sticking out, each of which gets some input activation signal. We multiply these inputs with weights and sum the results. If this combination meets some threshold the perceptron itself is "activated" - so far so simple.

As it turns out, we can take this whole computation and translate it into a single equivalent matrix operation

\[ y = \sigma\!\left(\mathbf{w}^\top \mathbf{x} + b\right), \quad \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, \quad \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} \]
\[ \mathbf{w}^\top\mathbf{x} = \begin{bmatrix} w_1 & w_2 & \ldots & w_3 \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \\ \ldots \\ x_3 \\ \end{bmatrix} \]

Recognizing this has been an important part of the story of AI because in parallel (pun intended) we have gotten very good at high performance matrix multiplication with things like GPUs and SIMD.

Now in principle you can imagine how this basic building block when repeated can allow us to build much more interesting systems. Put together a cluster of neurons that can recognize whether or an image features a nose or a trunk or a snout with one that can classify the size of the main object of an image with one that can recognize bipeds vs quadrupeds and you have yourself a pretty good animal classifier. You get the idea.

More Sophisticated Architectures

The perceptron in principle can give us any operations we like, but modern neural network architectures feature more specialized components. These components can efficiently handle tasks we have found to be common for particular applications like interpreting language or recognizing images. This is where our summary above comes in. Each of the boxes listed has a specialized function for modeling language and while we could just get the same behavior from a particular cluster of artificial neurons, it's simpler to reason about the system in terms of some more abstract or mathematical behavior that it has and how that helps the overall model. Each of these components is what we call a transformer because it transforms its input data in some way that is unique to its particular function within the system. So let's work our way from the inside out tackling these:

Embedding

We'll start with the simplest of these which are our "emeddings" - these are the two transformations that happen way at the start of the model called wte (weight matrix for token embedding) and wpe (weight matrix for positional embedding).

So first, what's a token? We hear this work a lot when discussing things about LLMs like "how many tokens can fit into its context window?" - Think of tokens loosely as the words in the sentence you're trying to interpret, only sometimes we group two or three words into the same token because they're more meaningful when combined. Say we want to interpret the sentence

The dog ran for many miles and then it rested

We might clump this into the tokens, "The dog", "ran for", "many", "miles", "and then" - you get the picture. Each of these clumps of words should tease out a part of the sentence that regardless of context has a distinct base meaning on its own. That base meaning is what we then get from our token embedding. The token embedding takes the token and converts it into a vector, in our case a 768 dimensional vector, that the rest of the model can then process. This sounds complicated and mathy, but really you can think of this step as taking the word and associating 768 different categories (or "features") to it. Maybe you have one space for part of speech where .1 = noun, .2 = verb, etc. and maybe you have one space for emotional tone of the word like -1 = very negative, 0 = neutral, and 1 = very positive. Once you put 768 of these categories side-by-side you get a vector that distinctly captures the meaning of your word and can be processed by the rest of the model. How does this stage of the model actually transform it? It actually just has a dictionary that translates known tokens ("English") into vectors ("neural-net-ease") so all of the hard work has already been done for us.

Positional embeddings are even simpler. Once we convert a single context-less token into a vector we've lost an important part of the meaning of it because we've lost word order. A token embedding cannot tell you where in the sentence that token occurred. Hence we add "positional embeddings" - these are also vectors that indicate where in the sentence the token occurred. These are even simpler, we might indicate the first space by having the first dimension of the token be high and the rest be low, etc.

By combining these two vectors we have our full embedding that is able to be processed by the rest of the network.

GPTNeoSelfAttention

So our model has four of these GPTNeo blocks that seem to do the brunt of the work. Each of these has an attention component and an MLP component. So what are these?

Self attention was a breakthrough idea in neural network architecture popularized by the paper Attention Is All You Need. This approach simplified earlier approaches to interpreting language like Luong attention, Bahdanau attention, and earlier stateful approaches that relied on recurrent neural networks.

Previous methods didn't really distinguish ealier parts of the text based on what we were trying to interpret at the moment. It was all treated as one big blur. So imagine we are interpreting the sentence

The dog ran for many miles and then it...

and want to predict what follows (e.g. rested). When we get to the word "it," it's important for us to know what "it" refers to. So imagine we've preserved the meaning of all of the text that proceeded but in a sort of jumble of facts and lost the context on what language actually gave us those facts. That's a bit of what it's like to interpret language without attention modules. An attention module lets you query the previous language for which bits of it may be most relevant to the text you are trying to understand now (e.g. "it refers to the dog that ran for many miles") and then weight that information much more highly when determining what should follow.

How it works

Attention modules have three parameters of interest. * Query (Q): what am I looking for? * Key (K): what do I match against? * Value (V): what information do I retrieve?

So you'll notice in our above summary architecture that our attention modules feature similarly named linear transformations called q_proj, k_proj, and v_proj. These allow us to calculate our values for Q, K, and V.

So back to our analogy, we start with the token "it" and project it onto q_proj to get our query vector Q, a vector representation of what we're looking for. Think of this Q as being able to tell us "we should pay more attention to entities earlier in the sentence because that's what the word 'it' refers to". We similarly take the individual tokens up to "it" and project them onto k_proj, those things which are entities ("the dog") will score higher in the "entity" dimension of the resulting vector so their resulting K will match better with our Q. The values V represent the meaning of the tokens when we pay attention to them (for simplicity imagine values are just the meanings of the tokens themselves). So finally we combine all the values we've seen so far into a vector representation of the information we have, but each value will be waited by how relevant it is. And that in a nutshell (we'll into more detail next time) is how attention works.

\[ Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}}V) \]

GPTNeoBlockMLP

Next we see that the attention block feeds into the MLP block. This block takes the vector we generated before which has the meaning of the text we have seen up to this point, weighted based on the particular bits of it we should be paying attention to, and infers what additional meaning we can make of it. Think of this module as where the bulk of the actual new conclusions are drawn which will allow us to predict what text should come next. This layer recombines the categories we've seen as relevant so far (e.g. "animate subject", "is tired") into richer more abstract categories ("in need of rest", "likely to act on its needs") and then outputs the most salient of these new categories back in the form of a new feature vector for us to convert to tokens ("rested").

For this reason, you can see our MLP block expand our input vector from 768 features into a much larger 3072 features in the first linear transformation c_fc (convolutional fully connected). This operation introduces a lot more of these abstract categories. Then you can use the c_proj (convolutional projection) transormation turn this intermediate representation back into our standard smaller set of concepts for which we have a vocabulary. Finally this component has a non-linear activation function act which you can think of allowing us to go from a fuzzy notion of categories being relevant to a binary is/is-not relevant. In general this turns out to give us much more richness in how our neural network can behave when we have multiple layers. Without it, many layers when combined would behave not differently from an equivalent single layer so we would never have that logic of "categories x,y,z are relevant to what I'm observing therefore I'll invoke categories a,b,c to make sense of it".

ModuleList

Stepping out one space we see we take the GPTNeo process we just described and we repeat it four times. This allows each block to add more attention, deeper abstract concepts, and its own inference. This sort of repeated pattern is where we get the deep in deep learning. That being said, in principle it's no different than what we've already described so we don't need to cover any more to understand it.

Normalization and Dropout

You'll notice that leaves us with two final classes of components to describe: normalization and dropout. Normalization just means that we take our vector representation of meaning and we scale it down so that all the features have the same proportions to each other but the whole thing has one standard magnitude. We mainly need this because we feed our vectors through non-linear activation functions like we described before. These steps take a fuzzy vector with different magnitudes for different categories and draw a clean line saying "these features are present and these ones are not" which as we described is what allows a deep neural network to have richer behavior than a purely linear one would. The thresholds for these activation functions are absolute (e.g. the cut line will be an input that's below .5 or above .5) which means we need to scale our inputs to all be in the neighborhood of this threshold to make it meaningful. That's what normalization is.

Putting it All Together

So there you have it. We have in principle taken apart our neural network and understood how all the parts function independently and combine to process and predict language. So now that we have our concpetual understanding let's put it into practice and take a real network apart and put it back together.

Writing it Ourselves

As we saw at the start of the post, it's fairly easy to invoke and run our language model. This process is what we want to ultimately reproduce on different hardware. I like to think of this process has having four layers and these four layers. Over the next few posts we'll strip away each of these layers.

  1. In the packaging layer we take our model and can share it with others or we can take a model someone else has made and reproduce it on our machine.
  2. The transformer layer we have our transformers, specialized components that can be assembled to make sophisticated network architectures
  3. The torch layer allows us to do fairly high level mathematical operations efficiently
  4. The hardware layer takes our basic linear operations like matrix multiplication and does them efficiently using things like CUDA

So to start with, we're going to re-write the packaging layer. Our goal will be to take a running neural network and export/import it to/from a file. If we're successful we'll be able to poke around the file we exported so that we can see all the things that make our particular neural network unique and thus what we'll need to reproduce fully later.

I vetted this idea by taking our network and seeing what are the essential pieces of it that need to be copied in order to reproduce it in a different environment (our eventual goal). Working from the inside out I noted the core-most piece we can use to vet this idea is the self attention transformer. The code for this is readily available so I took a peak and saw how it worked. Each module in our architecture has a forward method that transforms its input tensor into an output tensor. For self attention it looks like this.

In the next post we'll dive deep into these particular transformation methods, but for now we're just trying to copy the essential bits of the components as they are to see if we can export/import a neural network. I should note that the transformer package already does this and our goal is to understand how this works by recreating what it does. Our goal is not to reverse engineer the exact format of this particular package but get a lens into what it looks like when a neural network is actually alive in the machine. And then we want to convince ourselves that our understanding is right by actually testing it via export/import. I feel like this approach should come with a disclaimer: this is not how you would build a resilient production system. This is a hack to help illuminate our understanding of how this thing works .

As a sanity check, I wanted to see if I just copied all the attributes referenced in this method from one self attention block to another, would I get the same behavior?

import torch

from torch import nn
from transformers import AutoModelForCausalLM
from transformers.models.gpt_neo import modeling_gpt_neo as neo

MODEL = "roneneldan/TinyStories-33M"


class ImportedGPTNeoSelfAttention(neo.GPTNeoSelfAttention):
    def __init__(self, original):
        nn.Module.__init__(self)
        self.attn_dropout = original.attn_dropout
        self.resid_dropout = original.resid_dropout
        self.k_proj = original.k_proj
        self.v_proj = original.v_proj
        self.q_proj = original.q_proj
        self.out_proj = original.out_proj
        self.num_heads = original.num_heads
        self.head_dim = original.head_dim
        self.bias = original.bias

def _test_copy_model(model):
    self_attention_0 = model.transformer.h[0].attn.attention

    copy = ImportedGPTNeoSelfAttention(self_attention_0)
    batch_size, seq_len = 2, 16 # arbitrary for testing
    test_input = torch.randn(batch_size, seq_len, self_attention_0.config.hidden_size)
    print(all((copy(test_input)[0] == self_attention_0(test_input)[0]).detach().cpu().tolist()))

def main():
    torch.manual_seed(0)
    model = AutoModelForCausalLM.from_pretrained(MODEL)
    _test_copy_model(model)

if __name__ == "__main__":
    main()

This code snippet takes just the instantiated attributes of the neo self attention block that are used during the forward pass of the component and sets them on a new self attention block and then confirms that when we provide it a random input we get the same output. And lo it worked! Cool.

$ python llm_anatomy_copy_self_attention.py
True

Next I wanted to vet that we can take these attributes and actually export them to some serial, human-readable format so that so we mere mortals can gaze upon it and see its guts firsthand. To do so I worked from the inside out and took the linear models that are the most fundamental elements of our network and tried to export them to and import them from json.

import json
import torch

from torch import nn
from torch.nn.modules.linear import Linear
from transformers import AutoModelForCausalLM
from transformers.models.gpt_neo import modeling_gpt_neo as neo

MODEL = "roneneldan/TinyStories-33M"


class ExportableModelComponent(object):
    pass

class ExportableLinearTransformation(object):
    def __init__(self, transformation):
        self.transformation = transformation

    def get_exports(self):
        bias = None
        if self.transformation.bias:
            bias = self.transformation.bias.detach().cpu().tolist()
        return {
            "shape": {
                "in": self.transformation.in_features,
                "out": self.transformation.out_features,
            },
            "params": {
                "weight": self.transformation.weight.detach().cpu().tolist(),
                "bias": bias,
            },
        }

class ModelExporter(object):
    def __init__(self):
        self.exportable_classes = {
            Linear: ExportableLinearTransformation,
        }

    def make_model_exportable(self, model):
        exportable_class = self.exportable_classes[type(model)]
        exportable = exportable_class(model)
        exports = exportable.get_exports()
        return exports


def _import_layer(export):
    # To vet the idea that we can actually export/import layers we convert to/from JSON here
    data = json.loads(json.dumps(export))
    layer_in = Linear(data["shape"]["in"], data["shape"]["out"], bias = data["params"]["bias"] is not None)
    layer_in.weight.data = torch.tensor(data["params"]["weight"])
    if data["params"]["bias"]:
        layer_in.bias.data = torch.tensor(data["params"]["bias"])
    return layer_in

def _export_model(model, path):
    exporter = ModelExporter()
    layer_out = model.transformer.h[0].attn.attention.k_proj
    export = exporter.make_model_exportable(layer_out)
    layer_in = _import_layer(export)

    x = torch.randn(layer_in.in_features)
    y1 = layer_out(x)
    y2 = layer_in(x)
    print(all((y1 == y2).detach().cpu().tolist()))


def main():
    torch.manual_seed(0)
    model = AutoModelForCausalLM.from_pretrained(MODEL)
    _export_model(model, 'model.json')

if __name__ == "__main__":
    main()

This snippet takes one of the copied attributes of the self attention block, a particular linear transformation, exports it, imports it, and again tests that the behavior is the same with a random input. It also sets up a bit of scaffolding code that we can extend as we work outward to include more and more components. Run it and we get...

$ python llm_anatomy_export_linear_component.py
True

Awesome!

So finally, I extend this proof of concept to work across the entire network and let's see what we get.

import json
import torch

from torch import nn
from torch.nn.modules.linear import Linear
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import activations
from transformers.generation import configuration_utils
from transformers.models.gpt_neo import modeling_gpt_neo as neo
from transformers.models.gpt_neo import configuration_gpt_neo

MODEL = "roneneldan/TinyStories-33M"


class ExportableNeuralNetworkModule(nn.Module):

    def __init__(self, exported_attrs):
        nn.Module.__init__(self)
        for attr in self.attributes:
            setattr(self, attr, exported_attrs[attr])

    @classmethod
    def get_exports(cls, component):
        return {attr: getattr(component, attr) for attr in cls.attributes}


class ExportableGPTNeoSelfAttention(ExportableNeuralNetworkModule,
                                    neo.GPTNeoSelfAttention):
    attributes = [
        "attn_dropout", "resid_dropout", "k_proj", "v_proj", "q_proj",
        "out_proj", "num_heads", "head_dim", "bias", "layer_id",
    ]


class ExportableGPTNeoAttention(ExportableNeuralNetworkModule,
                                neo.GPTNeoAttention):
    attributes = ["attention"]


class ExportableGPTNeoBlock(ExportableNeuralNetworkModule, neo.GPTNeoBlock):
    attributes = ["ln_1", "attn", "ln_2", "mlp"]


class ExportableGPTNeoMLP(ExportableNeuralNetworkModule, neo.GPTNeoMLP):
    attributes = ["c_fc", "c_proj", "act", "dropout"]


class ExportableGPTNeoModel(ExportableNeuralNetworkModule, neo.GPTNeoModel):
    attributes = ["wte", "wpe", "drop", "h", "ln_f", "config", "gradient_checkpointing", "embed_dim"]


class ExportableGPTNeoForCausalLM(ExportableNeuralNetworkModule,
                                  neo.GPTNeoForCausalLM):
    attributes = ["transformer", "generation_config", "config", "lm_head"]


class ExportableLinearTransformation(ExportableNeuralNetworkModule, nn.Linear):
    attributes = ["in_features", "out_features", "weight", "bias"]


class ExportableDropout(ExportableNeuralNetworkModule, nn.Dropout):
    attributes = ["p", "inplace"]


class ExportableGELUActivation(ExportableNeuralNetworkModule,
                               activations.NewGELUActivation):
    attributes = []


def _generation_config_from_dict(d):
    c = configuration_utils.GenerationConfig.from_dict(d)
    # The public initializer doesn't set this necessary private attribute. Sigh.
    c._original_object_hash = hash(c)
    # HACK disable some legacy behavior
    c._from_model_config = False
    return c


def dump_config(c):
    attrs = [
        "vocab_size", "max_position_embeddings", "hidden_size", "num_layers",
        "attention_types", "num_heads", "intermediate_size", "window_size",
        "activation_function", "resid_dropout", "embed_dropout",
        "attention_dropout", "classifier_dropout", "layer_norm_epsilon",
        "initializer_range", "use_cache", "bos_token_id", "eos_token_id"
    ]
    return {attr: getattr(c, attr) for attr in attrs}

def dump_embedding(e):
    return {
        "num_embeddings": e.num_embeddings,
        "embedding_dim": e.embedding_dim,
        "weight": e.weight.detach().cpu().tolist(),
    }

def load_embedding(e):
    return nn.Embedding(
        num_embeddings=e['num_embeddings'],
        embedding_dim=e['embedding_dim'],
        _weight=torch.tensor(e['weight'])
    )

class ModelExporter(object):
    exportable_classes = {
        activations.NewGELUActivation: ExportableGELUActivation,
        Linear: ExportableLinearTransformation,
        neo.GPTNeoAttention: ExportableGPTNeoAttention,
        neo.GPTNeoBlock: ExportableGPTNeoBlock,
        neo.GPTNeoSelfAttention: ExportableGPTNeoSelfAttention,
        neo.GPTNeoMLP: ExportableGPTNeoMLP,
        neo.GPTNeoModel: ExportableGPTNeoModel,
        neo.GPTNeoForCausalLM: ExportableGPTNeoForCausalLM,
        nn.Dropout: ExportableDropout,
        nn.modules.normalization.LayerNorm: None,
    }
    exporter_functions = {
        nn.Parameter:
        (lambda p: p.detach().cpu().tolist(), lambda p: torch.tensor(p)),
        torch.Tensor:
        (lambda t: t.detach().cpu().tolist(), lambda t: torch.tensor(t)),
        nn.modules.normalization.LayerNorm:
        ((lambda ln: [ln.normalized_shape, ln.eps]),
         (lambda attrs: nn.modules.normalization.LayerNorm(attrs[0],
                                                           eps=attrs[1]))),
        nn.Embedding: (dump_embedding, load_embedding),
        configuration_utils.GenerationConfig: ((lambda c: c.to_dict()),
                                               _generation_config_from_dict),
        configuration_gpt_neo.GPTNeoConfig:
        ((lambda c: dump_config(c)), lambda c: 
         configuration_gpt_neo.GPTNeoConfig(**c)),
    }
    primitive_types = [float, int, type(None), bool]

    def __init__(self):
        self.exportable_classes_by_name = {
            cls.__name__: exportable_class
            for cls, exportable_class in self.exportable_classes.items()
        }
        self.exporter_functions_by_name = {
            cls.__name__: pair
            for cls, pair in self.exporter_functions.items()
        }

    def export_model(self, model):
        model_class = type(model)
        if model_class in self.primitive_types:
            return model
        if model_class in self.exporter_functions:
            exporter, _ = self.exporter_functions[model_class]
            exportable_attributes = exporter(model)
        elif model_class == nn.modules.container.ModuleList:
            exportable_attributes = [
                self.export_model(module) for module in model
            ]
        else:
            exportable_class = self.exportable_classes[model_class]
            exports = exportable_class.get_exports(model)
            exportable_attributes = {
                name: self.export_model(attr)
                for name, attr in exports.items()
            }
        return {
            "type": model_class.__name__,
            "attributes": exportable_attributes,
        }

    def import_model(self, exported_model):
        model_class = type(exported_model)
        if model_class in self.primitive_types:
            return exported_model
        if model_class != dict:
            raise ValueError("Cannot infer original type for non-dict", model)
        original_type_name = exported_model['type']
        if original_type_name in self.exporter_functions_by_name:
            _, importer = self.exporter_functions_by_name[original_type_name]
            return importer(exported_model['attributes'])
        if original_type_name == nn.modules.container.ModuleList.__name__:
            return nn.modules.container.ModuleList([
                self.import_model(attr)
                for attr in exported_model['attributes']
            ])
        importer = self.exportable_classes_by_name[original_type_name]
        imported_attributes = {
            name: self.import_model(attr)
            for name, attr in exported_model['attributes'].items()
        }
        return importer(imported_attributes)


def _test_export_model(model_out, path):
    exporter = ModelExporter()
    export = exporter.export_model(model_out)
    with open('model.json', 'w') as f:
        json.dump(export, f, indent=4)
    with open('model.json') as f:
        mport = json.load(f)
    model_in = exporter.import_model(mport)

    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    inputs = tokenizer("Once upon a time", return_tensors="pt")
    print(inputs)
    outputs = model_in.generate(**inputs, max_new_tokens=50)
    print(outputs[0])
    print(tokenizer.decode(outputs[0]))


def main():
    torch.manual_seed(0)
    model = AutoModelForCausalLM.from_pretrained(MODEL)
    _test_export_model(model, 'model.json')


if __name__ == "__main__":
    main()

Did it Work?

At the end of all this armchair philosophizing about the anatomy of neural networks I had to ask myself, was my exporter-importer actually successful? To check this I compared the output of the original with the new output. So running the imported model I get

$ python llm_anatomy_export_linear_component.py
...
Once upon a time, there was a little girl named Lily. She loved to play outside with her friends. One day, they decided to play hide-and seek. They ran and ran until they found a hidden spot behind a tree. Lily hid behind a bush.

which is different from our original

$ python llm_run_tiny_llm.py
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple on the ground. She picked it up and took a bite. It was so juicy and delicious!

Interestingly, while the outputs differ the output of the imported model is still quite coherent and quite valid. I attribute this to some loss of precision in the numerical values when I exported/imported them while still preserving the overall structure and function of the model so I'll call it a win, though maybe in a future post I'll try to get that perfect deterministic recreation. With that, let's take a look at what we made.

The Innards of a Model

Now we've taken a snapshot of our model and successfully written it to a file in a way that we can poke around and see with our own eyes what makes this thing up. Let's see what we made!

First question: how big is this thing?

$ ls -lh model.json
-rw-rw-r-- 1 rabrams rabrams 7.3G Apr 15 08:37 model.json

A whopping 7.3Gb! Holy moly! Even though we didn't write a lot of individual components and attributes to our snapshot, those attributes (the matrices that model our linear transformations) make up a lot of data. Mind you, this is epsecially true when we encode in space inefficient JSON with no compression. So let's take a look inside.

This jq command will allow us to get the contours of the model by replacing all the actual numerical arrays with a description of their size.

$ jq 'walk(if type == "array" then {"_summary":"array","length":length} else . end)' model.json
{
  "type": "GPTNeoForCausalLM",
  "attributes": {
    "transformer": {
      "type": "GPTNeoModel",
      "attributes": {
        "wte": {
          "type": "Embedding",
          "attributes": {
            "num_embeddings": 50257,
            "embedding_dim": 768,
            "weight": {
              "_summary": "array",
              "length": 50257
            }
          }
        },
        "wpe": {
          "type": "Embedding",
          "attributes": {
            "num_embeddings": 2048,
            "embedding_dim": 768,
            "weight": {
              "_summary": "array",
              "length": 2048
            }
          }
        },
        "drop": {
          "type": "Dropout",
          "attributes": {
            "p": 0,
            "inplace": false
          }
        },
        "h": {
          "type": "ModuleList",
          "attributes": {
            "_summary": "array",
            "length": 4
          }
        },
        "ln_f": {
          "type": "LayerNorm",
          "attributes": {
            "_summary": "array",
            "length": 2
          }
        },
        "config": {
          "type": "GPTNeoConfig",
          "attributes": {
            "vocab_size": 50257,
            "max_position_embeddings": 2048,
            "hidden_size": 768,
            "num_layers": 4,
            "attention_types": {
              "_summary": "array",
              "length": 1
            },
            "num_heads": 16,
            "intermediate_size": null,
            "window_size": 256,
            "activation_function": "gelu_new",
            "resid_dropout": 0,
            "embed_dropout": 0,
            "attention_dropout": 0,
            "classifier_dropout": 0.1,
            "layer_norm_epsilon": 1e-05,
            "initializer_range": 0.02,
            "use_cache": true,
            "bos_token_id": 50256,
            "eos_token_id": 50256
          }
        },
        "gradient_checkpointing": false,
        "embed_dim": 768
      }
    },
    "generation_config": {
      "type": "GenerationConfig",
      "attributes": {
        "max_length": 20,
        "max_new_tokens": null,
        "min_length": 0,
        "min_new_tokens": null,
        "early_stopping": false,
        "max_time": null,
        "stop_strings": null,
        "do_sample": false,
        "num_beams": 1,
        "num_beam_groups": 1,
        "penalty_alpha": null,
        "dola_layers": null,
        "use_cache": true,
        "cache_implementation": null,
        "cache_config": null,
        "return_legacy_cache": null,
        "prefill_chunk_size": null,
        "temperature": 1,
        "top_k": 50,
        "top_p": 1,
        "min_p": null,
        "typical_p": 1,
        "epsilon_cutoff": 0,
        "eta_cutoff": 0,
        "diversity_penalty": 0,
        "repetition_penalty": 1,
        "encoder_repetition_penalty": 1,
        "length_penalty": 1,
        "no_repeat_ngram_size": 0,
        "bad_words_ids": null,
        "force_words_ids": null,
        "renormalize_logits": false,
        "constraints": null,
        "forced_bos_token_id": null,
        "forced_eos_token_id": null,
        "remove_invalid_values": false,
        "exponential_decay_length_penalty": null,
        "suppress_tokens": null,
        "begin_suppress_tokens": null,
        "forced_decoder_ids": null,
        "sequence_bias": null,
        "token_healing": false,
        "guidance_scale": null,
        "low_memory": null,
        "watermarking_config": null,
        "num_return_sequences": 1,
        "output_attentions": false,
        "output_hidden_states": false,
        "output_scores": false,
        "output_logits": null,
        "return_dict_in_generate": false,
        "pad_token_id": null,
        "bos_token_id": 50256,
        "eos_token_id": 50256,
        "encoder_no_repeat_ngram_size": 0,
        "decoder_start_token_id": null,
        "is_assistant": false,
        "num_assistant_tokens": 20,
        "num_assistant_tokens_schedule": "constant",
        "assistant_confidence_threshold": 0.4,
        "prompt_lookup_num_tokens": null,
        "max_matching_ngram_size": null,
        "assistant_early_exit": null,
        "assistant_lookbehind": 10,
        "target_lookbehind": 10,
        "disable_compile": false,
        "generation_kwargs": {},
        "_from_model_config": true,
        "transformers_version": "4.52.4"
      }
    },
    "config": {
      "type": "GPTNeoConfig",
      "attributes": {
        "vocab_size": 50257,
        "max_position_embeddings": 2048,
        "hidden_size": 768,
        "num_layers": 4,
        "attention_types": {
          "_summary": "array",
          "length": 1
        },
        "num_heads": 16,
        "intermediate_size": null,
        "window_size": 256,
        "activation_function": "gelu_new",
        "resid_dropout": 0,
        "embed_dropout": 0,
        "attention_dropout": 0,
        "classifier_dropout": 0.1,
        "layer_norm_epsilon": 1e-05,
        "initializer_range": 0.02,
        "use_cache": true,
        "bos_token_id": 50256,
        "eos_token_id": 50256
      }
    },
    "lm_head": {
      "type": "Linear",
      "attributes": {
        "in_features": 768,
        "out_features": 50257,
        "weight": {
          "type": "Parameter",
          "attributes": {
            "_summary": "array",
            "length": 50257
          }
        },
        "bias": null
      }
    }
  }
}

This approach hides our module list since it is also an array so let's take a look at one of its elements separatly

jq '.attributes.transformer.attributes.h.attributes[0] | walk(if type == "array" then {"_summary":"array","length":length} else . end)' model.json
{
  "type": "GPTNeoBlock",
  "attributes": {
    "ln_1": {
      "type": "LayerNorm",
      "attributes": {
        "_summary": "array",
        "length": 2
      }
    },
    "attn": {
      "type": "GPTNeoAttention",
      "attributes": {
        "attention": {
          "type": "GPTNeoSelfAttention",
          "attributes": {
            "attn_dropout": {
              "type": "Dropout",
              "attributes": {
                "p": 0,
                "inplace": false
              }
            },
            "resid_dropout": {
              "type": "Dropout",
              "attributes": {
                "p": 0,
                "inplace": false
              }
            },
            "k_proj": {
              "type": "Linear",
              "attributes": {
                "in_features": 768,
                "out_features": 768,
                "weight": {
                  "type": "Parameter",
                  "attributes": {
                    "_summary": "array",
                    "length": 768
                  }
                },
                "bias": null
              }
            },
            "v_proj": {
              "type": "Linear",
              "attributes": {
                "in_features": 768,
                "out_features": 768,
                "weight": {
                  "type": "Parameter",
                  "attributes": {
                    "_summary": "array",
                    "length": 768
                  }
                },
                "bias": null
              }
            },
            "q_proj": {
              "type": "Linear",
              "attributes": {
                "in_features": 768,
                "out_features": 768,
                "weight": {
                  "type": "Parameter",
                  "attributes": {
                    "_summary": "array",
                    "length": 768
                  }
                },
                "bias": null
              }
            },
            "out_proj": {
              "type": "Linear",
              "attributes": {
                "in_features": 768,
                "out_features": 768,
                "weight": {
                  "type": "Parameter",
                  "attributes": {
                    "_summary": "array",
                    "length": 768
                  }
                },
                "bias": {
                  "type": "Parameter",
                  "attributes": {
                    "_summary": "array",
                    "length": 768
                  }
                }
              }
            },
            "num_heads": 16,
            "head_dim": 48,
            "bias": {
              "type": "Tensor",
              "attributes": {
                "_summary": "array",
                "length": 1
              }
            },
            "layer_id": 0
          }
        }
      }
    },
    "ln_2": {
      "type": "LayerNorm",
      "attributes": {
        "_summary": "array",
        "length": 2
      }
    },
    "mlp": {
      "type": "GPTNeoMLP",
      "attributes": {
        "c_fc": {
          "type": "Linear",
          "attributes": {
            "in_features": 768,
            "out_features": 3072,
            "weight": {
              "type": "Parameter",
              "attributes": {
                "_summary": "array",
                "length": 3072
              }
            },
            "bias": {
              "type": "Parameter",
              "attributes": {
                "_summary": "array",
                "length": 3072
              }
            }
          }
        },
        "c_proj": {
          "type": "Linear",
          "attributes": {
            "in_features": 3072,
            "out_features": 768,
            "weight": {
              "type": "Parameter",
              "attributes": {
                "_summary": "array",
                "length": 768
              }
            },
            "bias": {
              "type": "Parameter",
              "attributes": {
                "_summary": "array",
                "length": 768
              }
            }
          }
        },
        "act": {
          "type": "NewGELUActivation",
          "attributes": {}
        },
        "dropout": {
          "type": "Dropout",
          "attributes": {
            "p": 0,
            "inplace": false
          }
        }
      }
    }
  }

And we see that despite the overbearingly large size of the file as a whole, the actual parameters to determine what our computation roughly looks like are quite simple. The structure of our snapshot follows the model structure we have seen so far with the individual transformations mapping to the operations that we've seen each model perform.

If we take a look at one of these individual transformations, we see unsurprisingly a list of numbers.

$ jq .attributes.transformer.attributes.wte.attributes.weight model.json | head -10
[
  [
    -0.013794858939945698,
    0.06520742177963257,
    -0.14263242483139038,
    -0.024976538494229317,
    0.026861567050218582,
    0.06825800985097885,
    0.008955910801887512,
    -0.02322300523519516,

These parameters are the overwhelming majority of what make a language model work which is great because they are in principle quite simple to understand. All we need to do is look at how these numbers feed into the computations that transform our language inputs.

What's Next?

We've taken our first steps! At the highest level we have taken a language model apart and put it back together. We are one post closer to being able to run a model on different hardware. Join us next time where I plan to have us dive even deeper into individual language model parts and see in detail how their computations work. Layer by layer we will break it down and hopefully build a rich understanding in the process. Until next time!