Token Cost Control in AIAgent

Background

When building AI products, as the number of users grows, every LLM call and the prompt tokens used incur costs. So how can we minimize unnecessary expenses without affecting the results?

LLM Pricing Structure

Example: OpenAI

It can be observed that the latest OpenAI API token pricing structure is:

Input tokens: Priced at 10x Cached input tokens, and 1/10 of the Output tokens price.

Cached input tokens: The cheapest, priced at 1/10 of Input tokens and 1/100 of Output tokens.

Output tokens: The most expensive, priced at 10x Input tokens and 100x Cached input tokens.

For older LLM versions such as the gpt-4o series, the cached input token fees are waived.

Recommended Strategies

1. Encourage users to input concise, high-certainty keywords

Note here we say "encourage users," not "restrict users."
For a complete enterprise AI Agent, you can’t limit the length of user input because the user's message must be fully preserved and sent to the LLM.

Encouraging concise inputs => effectively reduces the context length in tokens.

Encouraging high-certainty rather than vague language => improves model response accuracy and reduces hallucinations.

Encouraging keyword-based inputs => improves retrieval accuracy and further reduces context token length.

2. Limit LLM output length

Since output tokens are the most expensive, explicitly instruct the LLM in the prompt to produce fixed-length responses or use summarizing and concise phrasing.

=> This effectively reduces the number of output tokens.

3. Reduce LLM calls: Memory design is essential

While OpenAI automatically calculates cached input tokens to reduce costs per call, the uncertainty is high. Thus, saving conversation history is important.

Memory here involves short-term and long-term recall, which this article won’t cover. We focus only on memory design to reduce token usage.

For repeated user queries in a short timeframe, avoid calling the LLM each time. Search the conversation records instead => reduces LLM calls.

How to leverage cached input design to minimize token fees?

Cached input principle: Avoid changing anything unnecessary; ideally, don’t alter a single token.

1. Categorize memory conversation records; group similar queries together => improves retrieval accuracy and reduces LLM calls.

2. Summarize memory conversation records => facilitates their use as part of input for future conversations, shortening context length.

4. Use prompt compression algorithms

Structured prompts tend to be lengthy, which increases token costs.

A recommended approach is the prompt compression algorithm LLMLingua-2.

No detailed explanation here, but interested readers may explore it further.

5. Prompt writing: Make full use of cached input

To illustrate, the prompt is written in Chinese here; preferably, write prompts in English.

Every LLM call uses a prompt. The prompt content adjusts or supplements according to user input, RAG retrieval results, etc.

To fully leverage cached input and reduce costs, reorder the prompt so that unchanged content is cached.

Example:

prompt = “
	You are a business case analyst skilled in assessing risks based on financial statements.
	The following background info is known:
	{Background_msg}

	And the financial statements:
	{Financial_msg}
	
	According to the user question:
	{question}

	Provide the answer.
	Return format:
	return:
	{
		result:""
	}
”
At first glance, this prompt seems fine, but according to cached input rules, only the first two lines get cached.

Improved prompt reordered for caching:

cached_prompt = “
	You are a business case analyst skilled in assessing risks based on financial statements.
	Based on Background info and Financial statements, answer the user question.
	Return format:
	return:
	{
		result:""
	}

	Background: {Background_msg}
	Financials: {Financial_msg}
	User question: {question}
”
According to cached input rules, this improved cached_prompt caches the first seven lines, lowering costs to about 1/10 of input price.

6. Is calling the LLM always necessary?

Example:

When storing semi-structured or unstructured documents into a vector database, does calling an LLM to convert raw data into structured JSON improve accuracy?

Proposed alternatives:

1. Regular expression filtering,

2. Rule-based parsing,

3. XGBoost classification,

4. Keyword searching,

to convert semi/unstructured documents into structured data.

Refer to this article for more: “RAG Structural Chunks Design”

https://strictfrog.com/2026/01/31/RAG%E4%B9%8BStructural-Chunks%E8%AE%BE%E8%AE%A1%E4%B8%8E%E6%80%9D%E8%80%83/

7. Can we switch LLM models without compromising output quality?

It’s recommended to switch between models within the GPT-5 series after extensive testing.

Generally, the latest model is the best, but different models may excel at different tasks. This requires AI engineers to validate and judge.

8. Is calling the LLM API always necessary?

For example: Calling the Whisper API for speech-to-text costs money.

Running Whisper locally is free.

TTS also has many free alternatives.

9. Restrict user input length: Max_tokens

For a specific AI Agent, decide:

1. Whether to set Max_tokens,

2. What the maximum Max_tokens should be.

My view:

For enterprise AI Agents used internally, do not set a limit as internal users typically won’t abuse tokens.

For external users, definitely set token limits and monitor usage.

10. Prefer writing prompts in English

On average,
One English word = 0.75 tokens,
One Chinese word = 1.0 token

11. manage the length of propmt

“Context Management and Filtering in Prompts” https://strictfrog.com/en/2026-03-20-context-management-and-filtering-in-prompts/

Summary

A penny per call is small, but with 10,000 users calling 10 times, costs become significant.

关于作者

我是Louis,一名长期从事iOS与AI相关工程实践的工程师,也是一个正在探索产品与商业可能性的准创始人.

这里的文章,更多是我在项目中用过,踩过坑,反复验证过的东西,而不是为了流量而写的“快内容”.

☕ 打赏

如果这篇文章对你有帮助,欢迎请我喝一杯咖啡☕️

PayPal
https://www.paypal.me/luochuan188

PayPay

You can support my work via PayPay by searching my PayPay ID:

PayPay ID: luochuan

微信支付

支付宝

你的支持会让我有更多时间,把真实项目中的经验持续整理和分享出来.

不打赏也完全没关系,感谢你读到这里.

联系与合作

如果你:

· 正在做iOS App / AI / 自动化相关的项目

· 对技术选型、架构设计、产品落地有困惑

· 或希望进行技术交流、合作探讨

欢迎通过以下邮箱联系我:

luochuanad@gmail.com

Cost Management