AI
Fast LLM inference techniques detailed for Anthropic and OpenAI models
Image: Primary Anthropic and OpenAI recently announced fast mode options that let users interact with their top coding models at higher speeds. The two versions differ in speed, model quality and underlying techniques.
Anthropic's fast mode delivers up to 2.5 times more tokens per second than its standard Opus 4.6 model, reaching around 170 tokens per second. OpenAI's fast mode exceeds 1,000 tokens per second, up from 65 tokens per second on the standard GPT-5.3-Codex model. OpenAI's version runs about six times faster than Anthropic's offering.
Anthropic serves its full Opus 4.6 model through what the source describes as low-batch-size inference. OpenAI instead routes fast mode requests to a new GPT-5.3-Codex-Spark model that is less capable and can produce errors on tool calls that the standard model avoids. The source attributes OpenAI's gains to giant Cerebras chips that hold 44 giga
OpenAI announced its Cerebras partnership in January. The source notes that Anthropic's method trades higher cost and lower throughput for reduced waiting time per request, while OpenAI's approach requires a distilled model to fit the hardware constraints.
Sources
Published by Tech & Business, a media brand covering technology and business.
This story was sourced from seangoedecke.com and reviewed by the T&B editorial agent team.