LLMs contain a LOT of parameters. But what’s a parameter?
The Parameters of Language Models: Unpacking the Math Behind AI's Greatest Achievements
Imagine a planet-sized pinball machine, with billions of paddles and bumpers set just so, sending balls pinging from one end to the other. This is roughly the scale of a large language model (LLM), a type of artificial intelligence that has revolutionized the way we interact with technology. But what makes these models tick? What are the "dials and levers" that control their behavior?
At the heart of every LLM are its parameters, a mind-boggling number of values that are assigned to each word in the model's vocabulary. These parameters are used to set limits or determine output, much like the parameters in a mathematical equation. But while a math equation might have a few dozen parameters, an LLM can have hundreds of billions.
What are Parameters?
Think back to middle school algebra, like 2a + b. Those letters are parameters: assign them values and you get a result. In math or coding, parameters are used to set limits or determine output. The parameters inside LLMs work in a similar way, just on a mind-boggling scale.
How are Parameters Assigned?
Short answer: an algorithm. When a model is trained, each parameter is set to a random value. The training process then involves an iterative series of calculations (known as training steps) that update those values. In the early stages of training, a model will make errors. The training algorithm looks at each error and goes back through the model, tweaking the value of each of the model's many parameters so that next time that error is smaller. This happens over and over again until the model behaves in the way its makers want it to.
The Three Types of Parameters
There are three different types of parameters inside an LLM that get their values assigned through training: embeddings, weights, and biases. Let's take each of those in turn.
Embeddings
An embedding is the mathematical representation of a word (or part of a word, known as a token) in an LLM's vocabulary. An LLM's vocabulary, which might contain up to a few hundred thousand unique tokens, is set by its designers before training starts. But there's no meaning attached to those words. That comes during training.
When a model is trained, each word in its vocabulary is assigned a numerical value that captures the meaning of that word in relation to all the other words, based on how the word appears in countless examples across the model's training data. Each word gets replaced by a kind of code?
Weights
A weight is a parameter that represents the strength of a connection between different parts of a model—and one of the most common types of dial for tuning a model's behavior. Weights are used when an LLM processes text.
When an LLM reads a sentence (or a book chapter), it first looks up the embeddings for all the words and then passes those embeddings through a series of neural networks, known as transformers, that are designed to process sequences of data (like text) all at once. Every word in the sentence gets processed in relation to every other word.
Biases
Biases are another type of dial that complement the effects of the weights. Weights set the thresholds at which different parts of a model fire (and thus pass data on to the next part). Biases are used to adjust those thresholds so that an embedding can trigger activity even when its value is low.
Neurons: The Containers for Weights and Biases
Neurons are more a way to organize all this math—containers for the weights and biases, strung together by a web of pathways between them. It's all very loosely inspired by biological neurons inside animal brains, with signals from one neuron triggering new signals from the next and so on.
How it All Fits Together
When an LLM processes a piece of text, the numerical representation of that text—the embedding—gets passed through multiple layers of the model. In each layer, the value of the embedding gets updated many times by a series of computations involving the model's weights and biases until it gets to the final layer.
The Hyperparameters: Temperature, Top-p, and Top-k
LLM designers can also specify a handful of other parameters, known as hyperparameters. The main ones are called temperature, top-p, and top-k. Temperature is a parameter that acts as a kind of creativity dial. It influences the model's choice of what word comes next. Top-p and top-k are two more dials that control the model's choice of next words.
The Future of LLMs: Smaller Models, More Efficient
Researchers are still figuring out ways to get the most out of a model's parameters. As the gains from straight-up scaling tail off, jacking up the number of parameters no longer seems to make the difference it once did. It's not so much how many you have, but what you do with them.
Conclusion
The parameters of language models are a complex and fascinating topic. By understanding how these parameters work, we can gain insights into the inner workings of AI and how it can be used to improve our lives. As researchers continue to push the boundaries of what is possible with LLMs, we can expect to see even more impressive achievements in the years to come.
Source: https://www.technologyreview.com/2026/01/07/1130795/what-even-is-a-parameter/




