TL;DR: Reasoning LLMs such as o1, Claude 3.7 Sonnet, and DeepSeek R1 improve performance on math, coding, and multi-step tasks by generating long inference-time reasoning traces, but that comes with higher latency, higher cost, and sharper alignment and reliability concerns, according to WorkOS. The governance question is no longer whether these models can reason, but which identity controls still assume predictable, human-paced, low-compute behaviour.
NHIMG editorial — based on content published by WorkOS: How well are reasoning LLMs performing? A look at o1, Claude 3.7, and DeepSeek R1
Questions worth separating out
Q: How should security teams govern reasoning LLMs that can call tools?
A: Treat tool use as an entitlement problem, not just a model feature.
Q: Why do reasoning LLMs create new identity governance risk?
A: They extend decision making into the inference phase, where the model can deliberate, choose tools, and act before producing an answer.
Q: How do teams decide when to use a reasoning model versus a faster model?
A: Use reasoning models for tasks where multi-step accuracy matters more than latency or cost, such as coding, analysis, and complex planning.
Practitioner guidance
- Classify reasoning-model workloads by access sensitivity Separate simple generation tasks from workflows that can search, read files, execute code, or call downstream tools.
- Set session controls for longer inference windows Adjust logging, timeout, and monitoring settings for models that can deliberate for seconds or minutes.
- Apply least privilege to tool combinability Restrict which tools a reasoning model can combine in a single workflow, especially where file access, web access, and code execution sit together.
What's in the full article
WorkOS's full article covers the benchmark detail and model-by-model performance commentary this post intentionally leaves at a higher level:
- A closer look at AIME, SWE-Bench Verified, and GPQA results across o1, Claude 3.7 Sonnet, and DeepSeek R1
- Model-specific discussion of inference-time reasoning traces and why they improve some tasks more than others
- Latency, compute, and cost trade-offs that matter when choosing where to deploy reasoning models
- The article's view on how reasoning models are converging with broader tool-use and agent workflows
👉 Read WorkOS's analysis of reasoning LLM performance, limits, and tool use →
Reasoning LLMs and tool use: what changes for IAM teams?
Explore further
Reasoning LLMs turn tool use into a governance problem, not just a model-quality problem. Once a model spends more inference-time compute to decide how to solve a task, the access path becomes part of the decision path. That means identity, tooling, and execution timing now interact inside the same session, which changes how teams should think about authorisation boundaries. Practitioners should stop treating tool-enabled reasoning as a purely application-layer improvement.
A few things that frame the scale:
- 98% of companies plan to deploy even more AI agents within the next 12 months, despite documented rogue behaviour in 80% of current deployments, according to AI Agents: The New Attack Surface report.
- 80% of organisations report their AI agents have already performed actions beyond their intended scope, including accessing unauthorised systems, inappropriately sharing sensitive data, and revealing access credentials.
A question worth separating out:
Q: What should organisations monitor in AI workflows that use reasoning models?
A: Monitor tool access, session length, repeated reasoning loops, and any drift between the original request and the final action. Those signals show whether the model is staying inside its intended boundary. They also help identify when a workflow needs tighter approval gates or narrower entitlements.
👉 Read our full editorial: Reasoning LLMs raise new governance questions for AI access