DeepSeek R1: Technical Insights (Part 2)

Read Part 1 of this three-part blog series here.

Less training, more inference

We can think of the computation that goes into generating a response from a reasoning large language model (LLM) as occurring in several layers. At the bottom layer, we have the model runner (e.g., NVIDIA's Triton TensorRT-LLM or the open-source vLLM) running on the CPU and GPU, passing the input tokens through the model to get output tokens. Even when using a reasoning model, this is no different than what happens in a baseline LLM.

However, in an LLM that has learned to reason, and/or if our prompt explicitly requests it, the model will first generate chain of thought (CoT) tokens before proceeding to generate responsive tokens. We could characterize this as the runner executing the LLM, and the LLM then executing its "CoT engine" — with some poetic liberties taken….

The model is not really reasoning in the mechanistically human sense, but it is certainly doing more work. Once finished with that task, the LLM's input context is now conditioned such that the tokens that follow are more likely to be a good answer.

Irrespective of whether this constitutes actual reasoning, more tokens have been generated in return for higher success metrics on the requested end task.

Rewards for reasoning

DeepSeek R1's reasoning stems from reinforcement learning (RL) techniques applied to instill CoT capability into their prior (non-reasoning) DeepSeek V3 LLM. RL uses a reward and penalty mechanism to guide a model's learning without the need for the large datasets required for (un)supervised learning. OpenAI's o1 model is also trained to reason via RL.

Quoting from the R1 research paper: "Rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies."

However, that is not the end of the story: The methodology used to create R1 is not directly RL. Instead, it is a multi-step training strategy, starting with the RL-trained V3 LLM (DeepSeek-R1-Zero). That resulting model is used to derive a training set that is ultimately used to re-train the original non-reasoning V3 LLM via non-RL methods.

This diagram depicts the training pipeline for R1:

Diagram depicting the training pipeline for DeepSeek's R1 model. — Diagram of the training pipeline for DeepSeek's R1 model. Source

The success of RL in eliciting CoT capabilities from the DeepSeek V3 LLM bolsters the hypothesis that CoT capabilities are latent in baseline LLMs. Baseline LLMs just need more training to learn to leverage CoT conditional priors for better answers. The hypothesis is further bolstered by the small size of the derived supervised fine-tuning (SFT) training set used to instill reasoning into the original V3 model, leading to DeepSeek-R1 in the bottom right of the diagram above.

Non-RL techniques for training CoT are more challenging (and potentially more costly) because the process of creating the necessary training set is non-trivial. Certainly, one of the contributions from the DeepSeek team is to have openly proved that RL is a viable mechanism to attain CoT without access to another CoT model, and without the trouble and expense of generating a training set manually.

This SFT training set, derived from the predecessor models, could be interpreted as exfiltration of capabilities.

Note: DeepSeek has also released open-source Llama and Qwen-based reasoning models (see the bottom, middle of diagram). More on these in Part 3 of this blog series.

The rise of inference compute, continued…

As discussed above, developments in AI have shifted compute from training to inference. Note that, in the literature, inference is sometimes referred to as "test-time" compute. In the past 12 months alone, the academic paper hub lists nearly 400 papers that contain the phrase "test-time" in their title.

Inference compute is on the rise for a number of reasons, including:

Model-level/technical reasons
- Chain of thought (CoT) and other developments that improve model success by increasing test-time compute (e.g., # generated tokens).
- Retrieval-augmented generation (RAG) and other similar model use cases that dramatically increase context length when prompting models.
- Agentic and multi-step systems that make multiple inference requests stemming from a single prompt or command.
- Multi-model and mixture of experts (MoE) methods that run inference across more than one model to improve results.
- Continuous fine-tuning, adaptation and other techniques which, while technically still using compute for training, nevertheless shift this compute out of the initial development stage of models and to compute outside of large-scale training clusters.
Population/societal reasons
- Infrequent training at the "large foundational model" stage.
- Growing demand for real-time AI applications.
- The rise of AI at the edge.
- The ubiquitous adoption of AI into applications' computer-use modalities.
- Cost considerations.
- Specialized AI hardware for inference.
- The proliferation of AI APIs.

To sum it up

DeepSeek's models, and in particular DeepSeek R1, represents an important step toward democratization of development of models capable of achieving high success metrics on a variety of end tasks. DeepSeek's R1 model, as well as the Qwen and Llama derived CoT models, are proof that CoT reasoning can be instilled at much lower training time compute costs than other competing models.

Continue on to Part 3 for the results of testing certain DeepSeek models in WWT's AI Proving Ground.

DeepSeek R1: Technical Insights (Part 2)

In this blog

Less training, more inference

Rewards for reasoning

The rise of inference compute, continued…

To sum it up