Anthropic's new prompt caching saves developers a fortune

Sign up for our daily and weekly newsletters to stay up to date with the latest updates and exclusive content on industry-leading AI coverage. More information

Anthropic introduced prompt caching on its APIwhich remembers context between API calls and allows developers to avoid repeated prompts.

The fast cache function is available in public beta on Claude 3.5 Sonnet and Claude 3 Haiku, but support for the largest Claude model, Opus, is coming soon.

Fast caching, described in this 2023 articleallows users to retain frequently used contexts across their sessions. Because the models remember these prompts, users can add additional background information without increasing the cost. This is useful in cases where someone wants to send a large amount of context in a prompt and then refer to it in multiple conversations with the model. It also allows developers and other users to fine-tune model responses.

According to Anthropic, early adopters have seen “substantial improvements in speed and cost with prompt caching for a variety of use cases — from recording an entire knowledge base to 100 sample recordings to recording every turn of a conversation in their prompt.”

According to the company, potential use cases include reducing the cost and latency of long instructions and uploaded documents for conversational agents, faster code autocomplete, providing multiple instructions to agent search tools, and embedding full documents in a prompt.

Anthropic (@AnthropicAI) just announced a groundbreaking API update: Prompt caching.
Think of prompt caching like this: You’re in a coffee shop. The first time you go, you have to tell the barista your entire order. But next time? Just say “the usual.”
That's fast… photo.twitter.com/ASB1nkdY4U
— Dan Shipper ? (@danshipper) August 14, 2024

Cached Prompts Prices

One advantage of prompt caching is the lower price per token. According to Anthropic, using cached prompts is “significantly cheaper” than the base price of the input token.

For Claude 3.5 Sonnet, writing a prompt to be cached costs $3.75 per 1 million tokens (MTok), but using a cached prompt costs $0.30 per MTok. The base price of an input for the Claude 3.5 Sonnet model is $3/MTok, so by paying a little more up front, you can expect to save 10x the next time you use the cached prompt.

We just rolled out prompt caching in the Anthropic API.
It reduces API entry costs by up to 90% and reduces latency by up to 80%.
This is how it works:
— Alex Albert (@alexalbert__) August 14, 2024

Speaking of cost, the first API call is slightly more expensive (because the prompt needs to be cached), but all subsequent calls cost one-tenth of the normal price. photo.twitter.com/3cPkz8c0rm
— Alex Albert (@alexalbert__) August 14, 2024

Claude 3 Haiku users pay $0.30/MTok for cache and $0.03/MTok when using saved prompts.

Although prompt caching is not yet available for Claude 3 Opus, Anthropic has already published the prices. Writing to cache costs $18.75/MTok, but access to the cached prompt costs $1.50/MTok.

However, as AI influencer Simon Willison noted on X, Anthropic's cache has a lifespan of only 5 minutes and is refreshed with every use.

Similar to Gemini's context caching, but the Anthropic pricing model is different
Gemini charges $4.50/million tokens/hour to keep the context cache warm
Anthropic cost for cache writes, and “the cache has a lifetime of 5 minutes and is refreshed every time the cached content is used” https://t.co/rfMQE2J3Rs
— Simon Willison (@simonw) August 14, 2024

Of course, this isn't the first time Anthropic has attempted to compete with other AI platforms through pricing. Before the release of the Claude 3 family of models, Anthropic lowered the prices of its tokens.

It is now engaged in a kind of 'race to the bottom' against rivals, including Google And OpenAI when it comes to offering low-cost options for third-party developers building on its platform.

Highly requested feature

Other platforms offer a version of prompt caching. Lamina, an LLM inference system, uses KV caching to reduce the cost of GPUs. A cursory glance at the OpenAI developer forums or GitHub will raise questions about how to cache prompts.

Caching prompts is not the same as the large language model memory. For example, OpenAI's GPT-4o provides a memory where the model remembers preferences or details. However, it does not store the actual prompts and responses like prompt caching does.

VB Daily

Stay informed! Receive the latest news in your inbox every day

By subscribing, you agree to VentureBeat's terms Terms of Service.

Thanks for subscribing. See more VB newsletters here.

An error has occurred.

Anthropic's new prompt caching saves developers a fortune

Cached Prompts Prices

Highly requested feature

Recent Post

Keyword Tag Cloud