What this blog is not

There is plenty of thoughtful journalism out there about DeepSeek, some of which is based on hard work by investigative reporters contextualizing the sociological, political and sector impacts of the DeepSeek organization and immediate impacts from the R1 model drop. This blog is not about those topics.

Instead, in this piece we focus on introducing the machine learning concepts relevant to DeepSeek and present some practical, actionable facts and empirical results that can be gleaned from the models released by R1. The R1 paper is very readable, so don't be afraid to read it, especially after reading this blog. Here is the link: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

The first two parts of this blog introduce and explore the relevant terminology and concepts related to the R1 model, while the third part presents some results gleaned from testing R1-enabled models in WWT's AI Proving Ground.

Training a foundational LLM is hard

What does it take to create one of today's "foundational" large language models (LLMs)?

The Llama 3 Herd of Models paper from Meta gives us a glimpse into this complex process. The paper has well over 500 authors, and the breadth and depth of the expertise implied by that number speaks to the level of effort required to achieve the capabilities we've come to expect from foundational LLMs.

Beyond the human capital, estimates (like this one) indicate the cost of the training compute for Llama 3 to be in the $60m ballpark. Training compute costs are on top of the very significant costs to gather and prepare the training data set. Ultimately, Llama 3 models were trained on about 10 trillion words of unlabeled text (estimated from 15.6 trillion training tokens @ 1.5 tokens per word), most of which were gathered from public data sources.

As outlined in the paper, merely gathering the raw data was not enough. Extensive data preparation was required to ensure a successful outcome. The complexity of that methodology leaves no doubt that not only the execution but also the development of the training pipeline was a highly non-trivial and expensive task.

Yet, sourcing, storing, curating and preparing this volume of training data is far from the only challenge faced by creators of foundational models.

Training is a balancing act

A crucial aspect of training an LLM is the machine learning methodology used to learn the model weights from the training data. Successful machine learning requires a careful balance between the various attributes of the training methodology and aspects of the model's structure. These methodological attributes are called "hyperparameters."

An example of a hyperparameter is the number of times a particular piece of training data is used during the training process. This is analogous to the number of times a child might review a specific word while studying for a spelling bee. Interestingly, LLMs tend to be trained by reviewing each element of the bulk training corpus once (i.e., this hyperparameter setting is 1).

Separately from hyperparameters, model architects must first decide on the structure of the model itself. One such decision is the size of the model — the number of internal parameters (numbers), also called weights. When you see a model name such as "Llama 70B," the 70B indicates the model has 70 billion internal weights.

Generalization is the goal

There is a balancing act between hyperparameters, training set size and model size. Training will be unsuccessful for a model that has too few weights given the size of the training data set or, conversely, when an adequately sized model is trained on too small a data set. It is also often necessary to search for the right hyperparameter settings to achieve optimal results.

One example of unsuccessful training is when the model ends up memorizing the training data rather than learning more general patterns within it. When learning a multiplication table, we may be OK with a child memorizing the table, but we ultimately want the child to figure out the relationships between the numbers in the table. Any particular multiplication table is merely an example of a more general pattern. 

Similarly, imagine what you might think of an AI that could only regurgitate copies of web pages or snippets of text from its training data. Would you consider it to be more useful than a search engine?

Instead of reproducing training examples, we expect models to generate novel sequences of text. In fact, our expectations for these models have advanced to the stage where model performance is measured against expert humans. This ability to provide answers that go beyond the training data is known as "generalization." Generalization is a powerful phase during training where models obtain capabilities beyond memorization, even on par with humans.

For example, OpenAI's "Deep Research" model, which is reportedly built on the company's yet-to-be-released o3 model, has reportedly achieved a score of nearly 27 percent on "Humanity's Last Exam," more than doubling the next nearest competitor's score. While 27 percent may not sound like a high score, keep in mind that correctly answering any question on that exam requires the model to have capabilities well beyond simply memorizing its training set.

Unsupervised learning, the key to LLM training

Another key ingredient in a model's training architecture is the training algorithm itself. The bulk of LLM training happens using unsupervised learning algorithms that use unlabeled datasets.

We won't get into the weeds about learning algorithms. However, it is important to understand that unsupervised learning in LLMs requires very large training datasets (recall that Llama3 was trained on 10 trillion words); and that "unsupervised" algorithms use training examples that are not independently labeled. When independent labels are necessary, they are frequently generated through human review of the training data, which can be impractical at scale.

Labeled data means that each input training example can somehow be provided to the model together with a "ground truth" label. The label is used to guide the learning process by determining if the model has correctly processed the training input to arrive at the same label. That determination allows the training algorithm to modify the model's weights according to its success via a technique called "gradient descent."

Gradient descent is relatively easy to understand: If the model output matches the label, we increase the model weights that contributed to that good result. This is done in accordance with the "gradient" (think "derivative" from calculus) of the functions that compose the model itself. Conversely, if the model output does not match the label, those same weights are reduced. By processing trillions of inputs, the model weights slowly converge to weights that tend to produce good results.

Technically, the words "increase" and "decrease" in the foregoing description are not strictly correct, but you get the idea: The weights are changed in a way that either increases or decreases the likelihood of a correct response.

Examples of supervised vs unsupervised data

Supervised training data for a model that classifies images would consist of images of a thing plus labels that identify what class each image falls in. For example, an image of a bird plus the word "bird." Supervised training of mathematical word problems might consist of the problem plus the answer, possibly also including the reasoning steps.

Unsupervised training data would consist only of the images and the math problems, without the classification labels, the answers or the reasoning.

Note: While an image classifier trained without labels could still learn to distinguish between birds and boats, it could not also learn those text labels; moreover, the training process would not be able to benefit from the ground truth in those labels. Similarly, a model could learn to answer math problems without explicitly seeing answers if the totality of its training set included math facts or explanatory text, such as textbooks. It is possible to learn, even without supervision. A model may learn to distinguish between animals without ever having or coming to know the names of those animals, similar to how an infant may be able to register surprise when shown the "odd thing" in a group of objects before having acquired language.

Unsupervised learning makes today's LLMs possible because it allows large datasets to be prepared for training using automation, without human labelers in the loop. It is particularly well suited for training LLMs because language is self-labeling via its grammatical structure. Techniques such as blanking out the last word of a sentence can teach a model how to guess the missing word, making that missing word akin to a label for the input sentence.

Crucially, in this example, that word already exists in the data, and the training process itself automatically causes it to act like a label.

Scaling laws

Researchers have identified "scaling laws" for language models. These scaling laws allow us to project how a model of a given size will perform on standardized tests of model success. In other words, scaling laws quantify the relationship between the size of models and the expected performance against success metrics "at convergence."

Convergence is the point at which continuing to train a model no longer appreciably improves it. As alluded to above, larger models need more data to reach convergence. This points to one of the challenges with unsupervised learning.

Larger models require piles of new input data, and we are running out of sources for new input data. Furthermore, by its very nature, unsupervised learning is about learning patterns that exist in the input data and learning to generalize from those examples. Yet, other types of learning algorithms can yield successful models without ever seeing a single example of what is eventually learned.

Trial-and-error in humans is an example. Another example is called reinforcement learning, which we will discuss in due time below. Briefly, you can think of it as a carrot and stick, or reward and punish, type of method.

Next token prediction

After training, a "model runner" is employed to use the model. The runner's job is called "inference," which involves computing the mathematical functions of the model using the model weights plus the inputs to calculate the outputs.

When an LLM is prompted, the runner passes that prompt to the model as its initial input and calculates an output token. For purposes of this discussion, a token is a word. That first output token constitutes the first element of the prompt response. However, the model runner uses the entire initial input plus that new token as a new (longer) input to guess the following token.

Inference is repeated, tacking on additional tokens to the output, until some stopping condition is met. For example, in some cases, the LLM emits a token that signals the "end of output." This repeated process, where an input sequence of ever-increasing length is fed back into the same model, is called an "autoregressive" loop.

Many possibilities

When an LLM predicts the next token, it predicts individual probabilities for every one of its entire vocabulary of possible output tokens: It generates a probability distribution over that vocabulary. These raw values are known as the logits.

You may be wondering why you don't see multiple token (word) possibilities in the output of LLMs. This is because the logits are further processed by the output stage of the model runner to select a single token for display. One simple way to do this is to select the most probable token. Another way is to sample from the distribution (a fancy way of saying "throw weighted dice, weighted according to the distribution"), selecting a single token at random.

At one end of the spectrum, the runner always selects a token deterministically (e.g., the one with the largest logit); and at the other end of the spectrum, the runner may (with low likelihood) give you a token having a small logit. This "knob" is called the "temperature" of the output stage of the model runner.

Note that because LLMs are autoregressive, their output may diverge significantly when selecting a different (especially low likelihood) token early in the output sequence. For use cases like creative writing, that's a good thing; for use cases like running a repeatable demo with an LLM component, it can be a bad thing.

Tip: To get deterministic outputs from an LLM, set the output temperature to zero.

Reasoning models and chain of thought (CoT)

Vast amounts of compute go into training baseline models like ChatGPT 4o. Models like OpenAI's o1, o3 and Deep Research improve on these baseline models by incorporating "reasoning" capabilities. To elicit reasoning capabilities, the baseline models are trained via supervised learning using labeled data sets.

Each training input includes a correct answer prefixed by the reasoning required to arrive at that answer. The goal is to create a model that performs well via "chain of thought" (CoT) prompting. CoT prompts ask the model to emit its thinking before generating the final answer, mimicking its training.

Some researchers hypothesize that CoT training is a process that surfaces latent capabilities already present in bulk pre-trained LLMs, where the unsupervised nature of their training did not explicitly tune the model to emit the CoT tokens that condition a better final result.

Leading up to DeepSeek R1, it was believed that CoT training may require extensive supervised fine tuning to elicit desirable behaviors. For more on this subject, see Part 3 of this blog series.

Chain of thought is not human reasoning

Here is an example output from a model (let's say Llama) not trained for reasoning and without CoT prompting. The prompt is not important to this discussion:

The movie 'Jurassic Park' was directed by Steven Spielberg. Released in 1993, it is a science fiction adventure film based on the novel of the same name by Michael Crichton. The film was a massive commercial success and received widespread critical acclaim, winning several awards, including three Academy Awards.

Now, let's take that Llama model and somehow add R1's reasoning capabilities, and let's prompt it to emit its reasoning as a prelude to its answer. Here's the output now:

<think>Okay, so I need to figure out who directed the movie 'Jurassic Park.' I remember that it's a pretty famous movie [324 words removed] So, putting it all together: 'Jurassic Park' was a major film directed by a well-known director, likely Spielberg. His style matches the movie's themes and effects, and he's known for handling such large-scale projects. I don't recall any other big-name directors being associated with the original 'Jurassic Park' film, so I'm pretty confident it was Spielberg.
</think>

The movie 'Jurassic Park' was directed by Steven Spielberg. Spielberg, renowned for his work on films like 'Jaws' and 'E.T. the Extra-Terrestrial,' brought the sci-fi adventure to life with groundbreaking special effects and a thrilling narrative, aligning perfectly with his directing style.

As humans, it's natural for us to interpret what's going on above as the model "thinking out loud" to generate a better answer. However, that's not really the case. Instead, we should think of this result in simpler terms: The <think> text increased the probability of the output response. Remember that LLMs are always simply predicting the next token.

Perhaps surprisingly, researchers have shown that CoT improves answers even when the CoT "thinking" tokens contain incorrect "reasoning." This indicates the model is not performing a reasoning process the way a logician might. Instead, the model has been conditioned to emit better answers when those CoT tokens exist in its context (more on this later).

A more rigorous way to think about CoT is: "We need certain tokens in the first part of the context to get a better stream of tokens out later." Training the model to "reason" yields a model that knows which token to emit first to get a better answer even if those tokens talk about strawberries when answering a prompt about the moon.

For more, continue to Part 2 of this series.