Description
Artificial intelligence (AI) isa technology that enables machines to simulate human intelligence.It can perform tasks such as learning, problem solving, and decision making.
Overview
We introduce our
first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1.
DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL)
without supervised fine-tuning (SFT) as a preliminary step, demonstrates
remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally
emerges with numerous powerful and intriguing reasoning behaviors. However, it
encounters challenges such as poor readability, and language mixing. To address
these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates multi-stage training and cold-start data before
RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL.
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. AIME 2024 (Pass@1) Codeforces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini DeepSeek-
Content
1 Introduction
1.1 Contributions
1.2 Summary of Evaluation Results
2 Approach
2.1 Overview
2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model
2.2.1 Reinforcement Learning Algorithm
2.2.2 Reward Modeling
2.2.3 Training Template
2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero
2.3 DeepSeek-R1: Reinforcement Learning with Cold Start . . . . . . . . . . . .. . . . . . . . . . . .. . . 9
2.3.1 Cold Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . ... . . .. 9
2.3.2 Reasoning-oriented Reinforcement Learning . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .10
2.3.3 Rejection Sampling and Supervised Fine-Tuning . . . . . . . . . .. . . . . . . . . . . .. . . .10
2.3.4 Reinforcement Learning for all Scenarios . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .... 11
2.4 Distillation: Empower Small Models with Reasoning Capability . . . . .. . . . . . . . . .. . .. ..11
3 Experiment. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . .11
3.1 DeepSeek-R1 Evaluation . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . .. . . . . . . . . . .. 13
3.2 Distilled Model Evaluation . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 14
4 Discussion. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . .14
4.1 Distillation v.s. Reinforcement Learning . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . .. . . . 14
4.2 Unsuccessful Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . 15
5 Conclusion, Limitations, and Future Work . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . 16
DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on and Llama. AIME 2024 (Pass@1) Code forces (Percentile) GPQA Diamond (Pass@1) MATH-500 (Pass@1) MMLU (Pass@1) SWE-bench Verified (Resolved) 0 20 40 60 80 100 Accuracy / Percentile (%) 79.8 96.3 71.5 97.3 90.8 49.2 79.2 96.6 75.7 96.4 91.8 48.9 72.6 90.6 62.1 94.3 87.4 36.8 63.6 93.4 60.0 90.0 85.2 41.6 39.2 58.7 59.1 90.2 88.5 42.0 DeepSeek-R1 OpenAI-o1-1217 DeepSeek-R1-32B OpenAI-o1-mini -
Form for Brochure / Sample Report Request
Enter your details to download the sample report.