Thyag's Blog

chain-of-thought in large language models - what has changed over the years?

Abstract

Chain of Thought (CoT) has been one of the most important breakthroughs in Large Language Models (LLMs), particularly because it significantly improves reasoning capabilities. CoT is a simple prompting technique in which the model is given a few examples that include both the questions and the step-by-step reasoning used to solve them.

Though CoT prompting proved improvement in the reasoning abilities of the early LLMs, our replication study with GPT-4 variants and Claude Sonnet-3.7 reveals this advantage has largely disappeared.

On GSM8K and CSQA datasets, we find minimal performance differences between baseline and CoT prompting, with all models achieving 93-96% accuracy regardless of prompting strategy.

All code and logged results are provided in the GitHub repository.

1. Introduction

Large Language Models (LLMs) excel in a wide range of language understanding tasks such as question answering, analysis, translation, and information extraction. However, logical reasoning problems and arithmetic problems require multi-step reasoning, which has been historically challenging for the models. To address this, the Chain of Thought paper [1] proposed a few-shot prompting method that significantly improves performance.

The key ideas of the original paper can be summarized as follows:

  1. CoT prompting generates a sequence of intermediate reasoning steps before producing the final answer.
  2. By providing few-shot examples of reasoning steps alongside solutions, models achieve significantly higher accuracy.
  3. These prompts follow the structure: <input, chain of thought, output>, which serves as a reference for solving new problems.

We designed a set of experiments to replicate the original results using newer models. Since the original CoT paper was released a few years ago, model capabilities have improved considerably [2]. In our replication, we used GPT-4.1, GPT-4o, and Sonnet-3.7 to study CoT performance on GSM8K and CSQA. We use the InstructGPT, PaLM, and LaMDA 137B accuracy scores from the original paper as our baseline to understand the evolution in the abilities.

We find that the newer models achieved near-perfect accuracy. Specifically, GPT-4.1 achieved 93% accuracy on CSQA and 94% on GSM8K, with GPT-4o showing similar results along the lines. At the same time, the Sonnet-3.7 model achieved the top score, reaching 96% accuracy on GSM8K.

2. Experiment Setup

This section outlines the experiment setup which best aligns with the original paper and how we performed our experiments.

2.1. Baseline

The baseline uses in-context examples consisting of input–output pairs without intermediate reasoning. The examples are presented in a question–answer format, and at test time, the model is expected to directly produce the answer. We provide 4–6 examples in the prompt. For example:

For example:
Q: A coin is heads up. Ka flips the coin. Sherrie flips the coin. Is the coin still heads up?
A: The answer is yes.

Q: A coin is heads up. Jamey flips the coin. Teressa flips the coin. Is the coin still heads up?
A: The answer is yes.

2.2. Chain of Thought Prompting

In this setup, the examples include both the reasoning steps and the final answers. We provide 6–8 such examples in the prompt, followed by the test-time question. The example points can be better illustrated as follows:

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.

2.3. Datasets and Models

Below is the summary of datasets and models used in the original paper.


Table 1: Datasets used for model evaluation


Task Type Datasets
Arithmetic Reasoning GSM8K, SVAMP, ASDiv, AQuA, MAWPS
Commonsense Reasoning CSQA, StrategyQA, Date, Sports, SayCan
Symbolic Reasoning Last letter concatenation, Coin flip

Table 2: Model families evaluated in the original CoT paper


Model Family Variants
GPT-3 text-ada-001 (350M), text-babbage-001 (1.3B), text-curie-001 (6.7B), text-davinci-002 (175B)
LaMDA 422M, 2B, 8B, 68B, 137B
PaLM 8B, 62B, 540B
UL2 20B
Codex code-davinci-002

We ran our evaluations of gpt-4.1-2025-04-14 and gpt-4o-2024-08-06 on GSM8K and CSQA, and additionally assessed claude-3-7-sonnet-20250219 on GSM8K.

2.4 Ablation Studies

We found the ablation studies in the original paper particularly insightful. The authors demonstrated that the performance gains from CoT prompting could not be replicated through other prompting methods.

Building on this idea, we extended the study by evaluating newer models under the same ablation setup. There are three ablation experiments performed in the paper.

  1. Variable Compute: In this experiment reasoning steps are replaced by sequence of dots (...) equal to the number of characters in the equation needed to solve the problem.
  2. Equation Only: The model is prompted only to output the equations necessary to solve the problem.
  3. Chain of Thought after Answer: In this segment the model first writes its answer and then is asked to give a reasoning explanation for the solution.

For our ablation studies, we tested both gpt-4.1-2025-04-14 and claude-3-7-sonnet-20250219.

2.5 Out of Distribution (OOD)

In this experiment we used two tasks:

  1. Last letter concatenation: In this task we asked the model to concatenate the last letters of words in 3-letter or 4-letter name (e.g., “Amy Alex Brown” →“yxn”) by showing it few-shot examples of 2-letter name concatenation.

  2. Coin flip: In this task, the model answers if a coin remains heads up after people either flip or don’t flip it. We increase the difficulty asking questions on 4 people flipping, while showing 2-person examples (e.g., “A coin is heads up. Phoebe flips the coin. Osvaldo does not flip the coin. Steve does not flip the coin. Is the coin still heads up?” → “no”).

3. Results

Across GSM8K and CSQA, all models scored 93–96% regardless of prompting strategy (Table 3). Unlike the original paper which reported large jumps from baseline to CoT, we find almost no difference with modern models.

Below are the detailed plots and tables of our evaluation experiments:


Image 1

Figure 1: The plot shows the accuracy of InstructGPT, PaLM-540B, GPT-4.1, and GPT-4o on CSQA.


Image 2

Figure 2: The plot shows the accuracy of InstructGPT, PaLM 540B, GPT-4.1, GPT-4o, and Sonnet-3.7 on GSM8K.


Image 3

Figure 3: The plot shows accuracy for LaMDA 137B, GPT-4.1 and Sonnet-3.7 on GSM8K in the three respective ablation studies.


More detailed numbers are provided in the tables below:


Table 3: Accuracy on CSQA and GSM8K with CoT vs. Baseline prompting


Dataset Prompting Type InstructGPT PaLM 540B GPT-4.1 GPT-4o Sonnet-3.7
CSQA Baseline 79.5 78.1 93.16 92.72 --
Chain of Thought 73.5 79.9 93.77 93.07 --
GSM8K Baseline 15.6 17.9 94.81 94.13 96.16
Chain of Thought 46.9 56.9 94.88 94.43 96.09


Table 4: Ablation Accuracy (%) on GSM8K for GPT-4.1 and Sonnet-3.7.


Ablation Type LaMDA 137B GPT-4.1 Sonnet-3.7
Variable Compute 6.4 47.1 68.2
Equation Only 5.4 79.0 96.5
Reasoning Post Answer 6.1 48.8 64.0

We also performed OOD studies on small sample on GPT-4.1 to study the relevance and observed near-perfect accuracy scores of the toy tasks.


Table 5: OOD accuracy of GPT-4.1


Problem Type CoT Baseline Sample Size
Last Name Concatenation 100% 98% 200
Coin Flip 100% 100% 50

Compared to the original CoT paper, which evaluated pre-trained models, we observed a drastic improvement with newer models. While the original work highlighted a sharp performance increase from baseline to CoT, we see new models already decomposing problems into reasoning steps even under baseline prompting. This is likely due to post-training optimizations for reasoning.

4. Discussion

During our experiments, several interesting observations emerged:

  1. Bias in GSM8K – Many questions assume a high-school-level approach. For example, salary and money problems were often solved by GPT-4.1 using compound interest, while the gold answers used simple interest. Although not incorrect, these diverged from the dataset solutions.
  2. Implicit vs. Explicit Assumptions – In problems involving time, age, or money, the model often defaulted to implicit assumptions influenced by prior examples. However, some questions required explicit assumptions to match the gold answer, leading to mismatches.
  3. Arithmetic Accuracy – Surprisingly, arithmetic mistakes were rare. Instead, issues mainly involved assumptions, semantic errors, and reasoning consistency.

Performance across smaller variants revealed additional weaknesses:


Table 6: Pain points across model variants


Model Name Pain Points
GPT-4.1 Mistakes in assumptions; implicit vs. explicit values
GPT-4.1-Mini Fails to track events properly
GPT-4.1-Nano Misses sequential steps; misinterprets variable relationships; fails to follow its own logic
GPT-4o Occasionally fails in multi-step reasoning; arithmetic slips
GPT-4o-Mini Hallucinates; oversimplifies assumptions; incoherent intermediate steps

Additionally in CSQA, we found issues such as vague questions, duplicate options, and multiple answers with equivalent meanings. Initial accuracy was ~86%. We used GPT-5 as a judge to re-evaluate our answers, which increased accuracy to 93%. Previous benchmarks reported GPT-4 at 83% and ChatGPT (June 2023) at 76% [3,4].

Reproducing old experiments on new models is valuable because it shows which benchmarks are still relevant, and which are ‘solved’. During our experiment we saw lot of discrepancies in CSQA datapoints and found that newer models are not reporting accuracy on the benchmark.

In the ablation setup, we went through interesting trajectory during our experiments:

  1. In the initial phase, we observed Variable Compute accuracy of 27% in GPT-4.1. After obtaining the original prompt from the authors and adjusting our setup to include few-shot examples with dots replacing reasoning steps, the performance jumped to 63%. The cause of this drastic improvement remains unclear.

  2. The high score of Sonnet-3.7 on Equation Only is because despite using the same experiment settings for both GPT variants and Sonnet-3.7 we found that Sonnet-3.7 used a structured reasoning along with the equations which were absent in GPT-4.1.

  3. Another key observation is that newer models show little difference between baseline and CoT prompting. This is likely because they are post-trained to handle reasoning explicitly, unlike the pre-trained-only models in the original CoT study.

  4. At first, CoT was more of a workaround which was applied during evaluation and production to maximize the model performance, but now it has become more like a built-in default behavior.

A detailed note on the early stage of the experiment is available in this document.

5. Future Research

In the course of our experiments, we identified two questions that are still unresolved and presents an intriguing avenue for further study.

  1. Question 1: Why must newer, larger models (GPT-4.1) still depend on the “show don’t tell” aspect of CoT? In our dots ablation, we still had to provide examples of the expected output, disproving our assumption that 4.1 would one-shot these questions.

  2. Question 2: Since this paper is a few years old relative to LLM progress, we were surprised the prompts showed similar results rather than major improvements. For instance, the dots results are close to the paper’s. We had expected 70–90% gains, if not 100%.

6. TL;DR

  1. The New models nearly solve GSM8K and CSQA with accuracy range of 93–96%.
  2. Prompting style doesn’t matter — baseline and Chain of Thought (CoT) achieve almost identical results.
  3. Reasoning is now a built-in default behavior where newer models seem to have internalized step-by-step thinking without needing explicit CoT cues.
  4. Ablations comes out as a surprise, specifically the decline in performance when replacing the reasoning steps with dots reflecting on the further scope in model improvement.

7. Acknowledgement

  1. Thanks to @twofifteenam for guidance throughout the project and API sponsorship.
  2. Thanks to the authors of the paper for helping us navigate through the ablation part of the experiment.
  3. Everyone who provided feedback on early results and insights.
  4. ChatGPT for grammar editing.

8. References

  1. Wei, Jason, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 35 (2022): 24824–24837. Paper Link.
  2. Kaplan, Jared, et al. Scaling laws for neural language models. arXiv:2001.08361 (2020). Paper Link.
  3. Dhingra, Sifatkaur, et al. Mind meets machine: Unravelling GPT-4’s cognitive psychology. BenchCouncil Transactions on Benchmarks, Standards and Evaluations 3.3 (2023): 100139. Paper Link.
  4. Do, Quyet V., et al. What Really is Commonsense Knowledge? arXiv:2411.03964 (2024). Paper Link.