Fast LLM inference techniques detailed for Anthropic and OpenAI models

Anthropic and OpenAI recently announced fast mode options that let users interact with their top coding models at higher speeds. The two versions differ in speed, model quality and underlying techniques. Anthropic's fast mode delivers up to 2.5 times more tokens per second than its standard Opus 4.6 model, reaching around 170 tokens per second. OpenAI's fast mode exceeds 1,000 tokens per second, up from 65 tokens per second on the standard GPT-5.3-Codex model. OpenAI's version runs about six times faster than Anthropic's offering. Anthropic serves its full Opus 4.6 model through what the source describes as low-batch-size inference. OpenAI instead routes fast mode requests to a new GPT-5.3-Codex-Spark model that is less capable and can produce errors on tool calls that the standard model avoids. The source attributes OpenAI's gains to giant Cerebras chips that hold 44 giga OpenAI announced its Cerebras partnership in January. The source notes that Anthropic's method trades higher cost and lower throughput for reduced waiting time per request, while OpenAI's approach requires a distilled model to fit the hardware constraints.

Fast LLM inference techniques detailed for Anthropic and OpenAI models

OpenAI staggers AI model release after Trump administration request

Humana Redefines the Member Experience with Agent Assist built with Google Cloud

Anthropic's Claude Opus 4.6 Claims Top Spot in AI Rankings, Beating OpenAI and Google

Seedance 2.0 Launches: Redefining AI Video via Director Mode