Please consider reading Parts 1 and 2 of this blog series, which offer introductory material relevant to this content.

Analysis: DeepSeek's R1

In this final installment, we focus on actionable facts from the DeepSeek R1 paper and empirical evidence gathered by WWT from testing the R1 models in our AI Proving Ground. We encourage you to read DeepSeek's paper. It will provide an unfiltered look into the model's training and development, plus it is well-written and not overly technical.

You've probably seen plenty of breathless reporting about DeepSeek R1's impacts on the sector's stock prices and on the industry. Here, we'll tell the DeepSeek R1 story by first reviewing some of DeepSeek's publication history. We'll then look at the behavior of one of the R1 models.

DeepSeek's ongoing research

DeepSeek has published several papers, some of which are listed below. You'll note that the R1 model is not the only model developed by DeepSeek, and you can see that R1 is the culmination of a series of incremental improvements over time. In fact, the R1 paper explicitly details that R1 is based on their V3 model, and the literature shows the longer lineage that culminates in the V3 and R1 models. In other words, R1 did not come "out of nowhere."

  1. arXiv:2501.12948  [pdfother]  
    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
    Submitted 22 January, 2025; originally announced January 2025.
     
  2. arXiv:2412.19437  [pdfother
    DeepSeek-V3 Technical Report
    Submitted 26 December, 2024; originally announced December 2024.
     
  3. arXiv:2412.10302  [pdfother
    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
    Submitted 13 December, 2024; originally announced December 2024.
     
  4. arXiv:2408.08152  [pdfother
    DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
    Submitted 15 August, 2024; originally announced August 2024.
     
  5. arXiv:2406.11931  [pdfother
    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
    Submitted 17 June, 2024; originally announced June 2024.
     
  6. arXiv:2405.14333  [pdfother
    DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data
    Submitted 23 May, 2024; originally announced May 2024.
     
  7. arXiv:2405.04434  [pdfother
    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
    Submitted 19 June, 2024; v1 submitted 7 May, 2024; originally announced May 2024.
     
  8. arXiv:2403.05525  [pdfother]
    DeepSeek-VL: Towards Real-World Vision-Language Understanding
    Submitted 11 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.
     
  9. arXiv:2402.03300  [pdfother
    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
    Submitted 27 April, 2024; v1 submitted 5 February, 2024; originally announced February 2024.
     
  10. arXiv:2401.14196  [pdfother
    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
    Submitted 26 January, 2024; v1 submitted 25 January, 2024; originally announced January 2024.
     
  11. arXiv:2401.06066  [pdfother
    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
    Submitted 11 January, 2024; originally announced January 2024.
     
  12. arXiv:2401.02954  [pdfother
    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
    Submitted 5 January, 2024; originally announced January 2024.
     
  13. arXiv:1801.03406  [pdfother
    DeepSeek: Content Based Image Search & Retrieval
    Submitted 11 January, 2018; v1 submitted 9 January, 2018; originally announced January 2018.

The impact of chain of thought (CoT) reasoning

As you read in Part 2 of our blog series, the researchers were able to endow previously available open-source models with R1's reasoning capabilities. DeepSeek released these models into the public domain. These open models include a Llama-based model with R1 reasoning, providing us with a unique opportunity to quantify the cost of CoT reasoning.

Here at WWT, our data scientist, Pradeep Singh Gaur, leveraged the resources of our AI Proving Ground and Advanced Technology Center (ATC) to quantify the increased demand for inference computing when using models equipped with R1's reasoning capability. These results are indicative of the expected increase in inference compute from the use of any model with CoT prompting and reasoning capacity.

A graph with blue and orange lines that shows the visual quantification of the increased demand on inference compute when using models equipped with R1's reasoning capability.
Visual quantification of the increased demand on inference compute when using models equipped with R1's reasoning capability.

What you see in the figure above is a series of prompts comparing the number of tokens emitted by the non-reasoning (orange) and reasoning (blue) versions of a given model. This analysis is made possible because we can compare the output from DeepSeek's R1 open-source models to the output from the same non-R1 open-source model it is based on. Such a side-by-side comparison of a reasoning model against the exact base model to which reasoning has been added is not possible for available closed-source reasoning models such as o1 (which is more distant from 4o).

When performing this comparison, the prompt given to the reasoning model is slightly different in that it is amended to ask the model for its thinking. This naturally results in more output tokens. Of course, it is fair game to count those tokens because the point of CoT is to improve the quality of the final response by requesting that the model "reason." 

After cleaning the results by removing outliers, we find that reasoning requires 300 percent more tokens at inference. Even if we remove the thinking part of the reasoning model's response, the "answer" portion of the response is 70 percent longer.

Given the small sample size, these results are indicative — not rigorous; but it is clear that CoT prompting of a reasoning model significantly increases inference costs. Moreover, our tests indicate that the increase could range well above 3x for some prompts. Several large outliers were removed.

This is not a great surprise to anyone who has read the R1 paper or followed CoT research. In fact, the graph below, pulled from the paper, depicts the increase in token generation from the model as the RL training pipeline that instilled CoT progressed. However, our side-by-side comparison above quantifies the increased cost of inference relative to the equivalent non-reasoning model.

A graph showing a number of steps

AI-generated content may be incorrect.
Visual representation of increased token generation for R1 during training. Source.

Other academic research on the topic further supports these conclusions. For example,  this paper contains the following table and figure, concluding: "The results clearly demonstrate that DeepSeek R1, while capable of solving complex mathematical problems that eluded other models in the previous constrained experiment, does so at the cost of significantly higher token usage."

A table with numbers and a number

AI-generated content may be incorrect.
Histogram of average tokens per response by model.
Histogram of average tokens per response by model. Source.

Finally, in another paper with prominent authors such as Li Fei-Fei and Emmanuel Candès, we find further empirical evidence for the additional costs implicit in improving success metrics via CoT reasoning and other approaches that leverage test-time compute:

A graph of numbers and a line of numbers

AI-generated content may be incorrect.
Source.

Mitigating inference costs

Processing context tokens that are initially provided in the prompt is less costly than generating that same set of tokens during inference. Why? Because there is no need to run the autoregressive compute across the increasingly long context leading to the full final input set.

This fact yields an important insight relative to controlling inference costs in CoT prompting: To reduce inference costs, you should provide the reasoning for a good answer in your prompt rather than asking the model to reason. Note, however, that for a given model, experimentation is required to empirically verify that the reasoning you provide (versus the reasoning the model would generate) results in similar improvements in success metrics. After all, as mentioned in the previous installments, research has shown that while CoT nevertheless improves outcomes, it does not always contain valid reasoning.

Teaching others to reason

The R1 paper does not focus on the fact that R1's reasoning capabilities were successfully transferred from it (the teacher) to other unrelated LLMs (the students), but this is a very important element in the paper. The authors call their methodology "distillation," and they used it on several student models. Namely, various sizes of the Qwen 2.5 and Llama 3 model series. The resulting models were subsequently released into the public domain (see Hugging Face) and one of these was used in our analysis above.

Distillation is not an entirely new concept. However, the distillation used in the R1 paper is somewhat novel. It is not the "Knowledge Distillation" (KD) of Hinton et al from 2015. Hinton's KD requires access to relatively raw outputs of the model— the logits described in Part 2 of our blog series.

Instead, DeepSeek's distillation involves prompting their R1-Zero model to derive a training set without access to logits, which is then used to train student models via standard supervised fine-tuning (SFT). In other words, the R1 reasoning training set is derived by normal run-of-the-mill prompting of R1-Zero.

New attack vectors

Hidden in the foregoing discussion about distillation is a significant insight into the challenges of securing intellectual property in AI systems. Reasoning and other capabilities of a given model may have been learned at great expense, including the expense of the human capital spent developing the learning pipeline itself, or because the training dataset from which the capabilities arose is highly proprietary.

This paper from March 2024 is titled "Logits of API-Protected LLMs Leak Proprietary Information," and states that "it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries."

The paper's analysis is based on accessing OpenAI's APIs. So, exfiltration is a serious concern, as demonstrated by this story relative to OpenAI's restriction of their APIs. Even ChatGPT 4o's response, when asked about the topic, indicates concern: "OpenAI's APIs do not offer direct access to the full logits (raw model outputs) for all possible tokens. This limitation is in place to prevent potential misuse, such as model distillation or replication by third parties… OpenAI and Microsoft are investigating potential data theft by a Chinese AI start-up, DeepSeek, which allegedly accessed OpenAI's technology via its API. This incident underscores the importance of restricting access to certain model outputs to protect intellectual propertyaccess to full logits remains restricted to safeguard the integrity and security of its models." (emphasis added)

Now, DeepSeek has come up with a way to distill model capabilities without access to raw logits, and they used it as a core method in their development of R1 via V3 and R1-Zero. One potential conclusion is: A model's capabilities can potentially be exfiltrated even when the only access you have to that model is the ability to prompt it.

Research on this topic is ongoing, and may conclude that capabilities must already be latent in student models for this kind of distillation to work. However, that is not the only security concern relative to CoT models: As noted in the previous section, providing these models with `<thinking>…</thinking>`(CoT) context may save inference compute by giving the model the CoT tokens it needs rather than having the model generate them.

Could these tokens, however, be an attack vector on CoT LLMs that aids threat actors when trying to coax these models out of their safety and alignment training? This too, is an active area of research.

Conclusion

DeepSeek's research has been an influential contribution to the use of reinforcement learning to elicit latent capabilities in large pre-trained models; and they have also proven that small SFT training sets can be very effective for transferring those capabilities, even to unrelated models. The behavior of models, both during training and inference, continues to exhibit both surprises and opportunities for researchers and industry alike. DeepSeek's contributions go well beyond the release of their R1 model, and it's available on the front end on the internet. Undoubtedly, more innovations will follow from across the machine learning sphere.