# Fast LLM inference techniques detailed for Anthropic and OpenAI models

_Friday, June 26, 2026 at 9:55 PM EDT · AI · Latest · Tier 2 — Notable_

![Fast LLM inference techniques detailed for Anthropic and OpenAI models — Primary](https://www.seangoedecke.com/og-image.jpg)

Anthropic and OpenAI recently announced fast mode options that let users interact with their top coding models at higher speeds. The two versions differ in speed, model quality and underlying techniques.

Anthropic's fast mode delivers up to 2.5 times more tokens per second than its standard Opus 4.6 model, reaching around 170 tokens per second. OpenAI's fast mode exceeds 1,000 tokens per second, up from 65 tokens per second on the standard GPT-5.3-Codex model. OpenAI's version runs about six times faster than Anthropic's offering.

Anthropic serves its full Opus 4.6 model through what the source describes as low-batch-size inference. OpenAI instead routes fast mode requests to a new GPT-5.3-Codex-Spark model that is less capable and can produce errors on tool calls that the standard model avoids. The source attributes OpenAI's gains to giant Cerebras chips that hold 44 gigabytes of on-chip SRAM, allowing more inference to occur in memory.

OpenAI announced its Cerebras partnership in January. The source notes that Anthropic's method trades higher cost and lower throughput for reduced waiting time per request, while OpenAI's approach requires a distilled model to fit the hardware constraints.

## Sources

- [seangoedecke.com](https://www.seangoedecke.com/fast-llm-inference/)

---
Canonical: https://techandbusiness.org/newswire/WMYow9Ig064KslncDOfOUW
Retrieved: 2026-06-27T06:02:04.987Z
Publisher: Tech & Business (techandbusiness.org)
