The team had spent three weeks assembling the evaluation. The task was contract review — identifying unusual liability provisions, flagging clauses that deviated from standard form, mapping cross-references that modified earlier obligations in ways that were easy to miss on a first reading. The evaluation corpus was 180 contracts, deliberately weighted toward the complex end: multi-party instruments, agreements with unusual indemnity structures, documents containing explicit cross-default provisions referencing external schedules. On that corpus, the case for extended thinking was straightforward. With standard inference, the model missed cross-references in 31 per cent of cases. With extended thinking enabled, that figure fell to 6 per cent. The decision to deploy was made in the same meeting.
Six weeks after launch, we were sitting with the operations lead. Usage had plateaued at 63 per cent of the anticipated volume and had not grown since week three. The team was using the model for the difficult cases — the complex amendments, the multi-party disputes, the contracts with unusual cross-jurisdictional provisions — and was largely ignoring it for everything else. We asked why. "It's slow," she said. "For the standard ones, it just sits there."
The standard contracts — straightforward service agreements, single-jurisdiction software licences, NDAs following the firm's own template — made up 61 per cent of the incoming volume. Mean latency for the full system was 22 seconds. For those contracts, the model was spending 22 seconds reasoning through questions whose answers appeared in the first three clauses. We had built the evaluation on the cases that justified the capability, then deployed the capability everywhere. The evaluation was right. The architecture was not.
#02What the evaluation found, and what it could not see
The reasoning models — Claude 3.7 Sonnet and the Claude 4 family with extended thinking, o1, o3-mini, DeepSeek-R1 — are architecturally different from standard inference in ways that matter. The model generates a chain of internal reasoning before producing its final output. That chain uses tokens, takes time, and costs money at a rate separate from output tokens. The thinking budget can be configured, but it cannot be set to zero without disabling the mode: once thinking is on, the model reasons in proportion to the difficulty it assigns to the task.
The quality improvement on the right tasks is real. We have seen extended thinking reduce error rates on multi-step contract analysis from 31 to 6 per cent. On complex code generation tasks where correctness is mechanically verifiable, we have measured accuracy lifts of 15 to 40 percentage points against the same model family without thinking enabled. These are not evaluation artefacts produced by a curated demo set — they are what happens when a model works through a chain of dependent inferences rather than pattern-matching to a plausible answer. The capability is genuine.
The problem is that evaluations of reasoning models are almost always assembled from cases where reasoning visibly helps. A team building a contract review system reaches for the difficult instruments first, because those expose the model's limits and justify the evaluation effort. The evaluation corpus ends up weighted toward the 30 to 40 per cent of real production queries where extended thinking makes a material difference, and the deployment decision is made on that evidence. The remaining 60 to 70 per cent — the routine cases, the simple lookups, the single-step extractions — are not well represented, because nobody reached for them when the evaluation was assembled.
In nine production deployments we have managed or reviewed that involved reasoning-capable models, the proportion of queries exhibiting the multi-step dependency structure that extended thinking addresses averaged 34 per cent. The remainder — summaries, classifications, entity extractions, lookups against retrieved context, structured field population — derived no measurable quality benefit from the scratchpad. They paid its latency and cost penalty without receiving its gains. The evaluation ran on the hardest cases. Production runs on everything. Nobody had counted what everything looked like before the architecture was chosen.
“The evaluation ran on the hardest cases. Production runs on everything. Nobody had counted what everything looked like before the architecture was chosen.”
#03The structural feature that separates the two query classes
There is a single structural feature that reliably predicts whether extended thinking will improve an answer: the correct response requires drawing an inference that depends on a prior inference. 'What is the counterparty's maximum aggregate liability under this contract, accounting for the carve-out in Schedule 3 and the override in clause 11.4?' requires finding clause 11.4, reading Schedule 3, determining whether Schedule 3's carve-out applies to the clause 11.4 cap, and only then computing the effective figure. Without extended thinking, a frontier model will produce an answer that correctly identifies the stated liability cap and misses the interaction. With the scratchpad, the model tracks its intermediate conclusions. The difference is not about knowledge. It is about whether the model holds what it learned in step two when it reaches step four.
Tasks without step dependency do not benefit. Does this document contain a confidentiality clause? Does the counterparty's address appear in the recitals? What is the governing law? These are single-step retrieval or classification tasks. The model does not need to work through a chain of reasoning — it needs to find something. In our benchmarks, on labelled production samples from two separate deployments, extended thinking produced no accuracy improvement over standard inference on these task classes. It produced latency.
We now apply one question before any task class is routed to a thinking model: does the correct answer require holding an intermediate conclusion and applying it in a subsequent step? If yes, extended thinking is on the table. If no, it is not. The test is not complicated. Most teams have simply not run it before the architecture was committed.
#04Twenty-four seconds is a category, not a metric
The latency of extended thinking is not a performance problem amenable to standard optimisation work. At typical production latencies — 12 to 48 seconds for complex queries against Claude 4 and o3-class models under normal inference load — the interaction category changes. The user is not experiencing a slow system. They are experiencing a system that behaves like a batch process while presenting itself as an interactive tool, and the mismatch is what damages usage.
In the contract review deployment, we instrumented user interactions after week four. Forty-one per cent of sessions involving thinking-model responses included a navigation event within eight seconds of query submission — the user had left the interface while the model was still reasoning. The response appeared. The user returned and read it. But the pattern over six weeks had trained the team to treat the tool as asynchronous whether it was designed that way or not. By week six, reviewers were batching their difficult queries, switching away to other work, then returning to collect results. The tool was working. The interaction pattern was not the one the design intended, and nobody had agreed to the design change that produced it.
Reasoning models fit asynchronous and batch architectures naturally. A due diligence run processed overnight. A batch of supplier contracts reviewed for unusual indemnity provisions before the morning meeting. A code generation task submitted at midday and collected at three. In those contexts, 30 seconds is invisible. In a synchronous workflow where a solicitor is reading the model's analysis before deciding what to examine in detail, 30 seconds is a workflow redesign that nobody agreed to. The question of whether the workflow is synchronous or asynchronous is the first question we ask now, before accuracy, before cost, before architecture. If the workflow is synchronous, the latency floor of extended thinking must be part of the decision, not a consequence of it.
#05The routing layer we require before any thinking-model deployment
For every production system where extended thinking is available as an inference option, we now require a routing classifier before the inference step. The classifier does one thing: assess whether the incoming query has the step-dependency structure that makes extended thinking worth its cost and latency, and route to the appropriate inference path.
The classifier does not need to be a large model. We have had consistent results with an 8B-parameter instruction-tuned model running on hardware substantially cheaper than the frontier inference endpoint it gates. The classifier scores queries on two axes: the presence of step dependency, and verifiability — whether the model or the user can check the result against something external. High scores on both axes route to extended thinking. Low scores route to standard inference. A middle band — roughly 12 to 15 per cent of traffic in our current deployments — goes through a brief chain-of-thought pass on the standard model before the routing decision is finalised.
In the contract review system, retrofitting the classifier after week six took three weeks of engineering. Sixty-one per cent of queries now route to standard inference; 39 per cent to extended thinking. Mean latency across the full query population dropped from 22 seconds to 8.1 seconds. Cost per 1,000 queries fell by 56 per cent. On the complex cases — the multi-step liability analyses, the cross-reference chains, the cross-default provisions — quality is indistinguishable from the all-thinking baseline we launched with. On the simple cases, quality is indistinguishable from a well-prompted standard model. The routing layer is the single most impactful change we have made to the system since launch. It is also work we should have done before we launched.
#06Three requests we declined this year
We have declined to use extended thinking as the primary inference path in three synchronous workflows in the first half of 2026. Not on accuracy grounds — the accuracy was better in all three. On workflow grounds.
The first was a compliance checking tool for a retail banking operation. Compliance officers were reviewing client communications for potential conduct violations — a task that genuinely involves multi-step reasoning on the complex cases. The problem was volume: twelve compliance officers, 900 to 1,200 queries per day, working in real time. At a mean thinking latency of 22 seconds and that query volume, each officer would accumulate 25 to 33 minutes of daily waiting time before reading a single word of model output. The operation could not justify the throughput loss. We built a routing system: a complexity classifier sends 22 per cent of queries — the ones with genuine multi-step dependency — to extended thinking, and routes the remainder to a fast standard model. Mean latency across the full distribution is under 2.8 seconds. Extended thinking is active where it matters and invisible where it does not.
The second was a customer-facing insurance claims assistant. Extended thinking was proposed because some claims involve complex multi-party liability questions where the model genuinely reasons better. A customer-facing interface with 18-second latencies is not a customer-facing interface. We declined for the customer path and built a separate internal workflow for claims handlers working through complex liability analysis — where a longer wait is acceptable because it replaces substantial manual work.
The third was the simplest to decline. The team proposed extended thinking for structured data extraction: pulling specific fields from invoices and purchase orders and populating a procurement system. There were no step dependencies. The invoices varied in format but not in reasoning complexity. Extended thinking added, on average, 19 seconds of latency and zero accuracy improvement over a standard structured-output extraction prompt. We ran the comparison on a labelled sample of 400 invoices and showed them the numbers. The team was surprised — they had expected the more powerful mode to produce better results because it was more powerful. It did not. It was slower.
#07What this is not
This is not an argument against reasoning models. The thinking capability is the most significant quality improvement in a single model generation we have seen in three years, and on the right tasks it is decisive. We have the contract review system in production today, with extended thinking active for 39 per cent of its queries and performing better than any comparable system we have managed. In complex code review, in multi-instrument legal analysis, in technical due diligence across interconnected documentation where the correct answer requires holding and cross-referencing a large number of intermediate facts, the quality gap between thinking and non-thinking models is real and worth paying for.
The argument is narrower. Treating extended thinking as a blanket inference upgrade — enabling it for all queries because the evaluation showed it helped on the hard cases — produces a system that pays frontier reasoning costs for queries that derive no benefit from the scratchpad, at latencies that change how users engage with the tool within weeks of launch, often without the engineering team noticing until an operations lead runs a time study and finds the usage pattern has quietly become something nobody designed for.
The reasoning models are available now, in forms that produce genuine improvements on genuine enterprise problems. The routing question — which queries need the scratchpad, which do not, and what the latency floor of extended thinking means for the synchronous workflow you are building — must come before the deployment. The evaluation that justifies the tool is almost always run on the hardest cases. The routing layer is how you avoid spending the rest of the deployment paying for answers those cases do not represent.
