Skip to main content

Anatomy of Claude API Costs

Claude pricing on Amazon Bedrock is token-based, but understanding the details makes the difference between a $500/month bill and a $5,000/month bill for the same workload. This page breaks down exactly how costs are calculated and where the optimization opportunities are.

Input Tokens vs Output Tokens​

Claude charges separately for input tokens (what you send) and output tokens (what Claude generates). Output tokens are significantly more expensive because they require sequential computation.

ModelInput (per 1M tokens)Output (per 1M tokens)Output/Input Ratio
Claude Haiku 4.5$0.80$4.005x
Claude Sonnet 4.6$3.00$15.005x
Claude Opus 4.6$15.00$75.005x

The ratio is consistent: output tokens cost 5x more than input tokens across all models. This means every additional token Claude generates costs you five times more than every token you send. This asymmetry is the foundation of most cost optimization strategies.

How to Calculate Cost Per Request​

The formula is straightforward:

Cost = (input_tokens / 1,000,000 * input_price) + (output_tokens / 1,000,000 * output_price)

Example: A Sonnet 4.6 request with 2,000 input tokens and 500 output tokens:

Cost = (2,000 / 1,000,000 * $3.00) + (500 / 1,000,000 * $15.00)
= $0.006 + $0.0075
= $0.0135 per request

At 10,000 requests per day, that is $135/day or ~$4,050/month.

def calculate_cost(
input_tokens: int,
output_tokens: int,
model: str = "sonnet",
) -> dict:
"""Calculate the cost of a Claude API request."""
pricing = {
"haiku": {"input": 0.80, "output": 4.00},
"sonnet": {"input": 3.00, "output": 15.00},
"opus": {"input": 15.00, "output": 75.00},
}

prices = pricing[model]
input_cost = (input_tokens / 1_000_000) * prices["input"]
output_cost = (output_tokens / 1_000_000) * prices["output"]

return {
"input_cost": round(input_cost, 6),
"output_cost": round(output_cost, 6),
"total_cost": round(input_cost + output_cost, 6),
"output_share": f"{output_cost / (input_cost + output_cost) * 100:.0f}%",
}


# Example usage
result = calculate_cost(input_tokens=2000, output_tokens=500, model="sonnet")
print(result)
# {'input_cost': 0.006, 'output_cost': 0.0075, 'total_cost': 0.0135, 'output_share': '56%'}

Cost Comparison Across Models​

The same task can cost dramatically different amounts depending on model choice:

ScenarioHaiku 4.5Sonnet 4.6Opus 4.6
Simple classification (100 in / 10 out)$0.00012$0.00045$0.00225
Chat response (1K in / 300 out)$0.002$0.0075$0.0375
Document summary (10K in / 500 out)$0.01$0.0375$0.1875
Long analysis (50K in / 2K out)$0.048$0.18$0.90
def compare_model_costs(input_tokens: int, output_tokens: int):
"""Compare costs across all Claude models for the same request."""
models = ["haiku", "sonnet", "opus"]
print(f"{'Model':<10} {'Input Cost':>12} {'Output Cost':>12} {'Total':>12}")
print("-" * 48)
for model in models:
result = calculate_cost(input_tokens, output_tokens, model)
print(f"{model:<10} ${result['input_cost']:>10.6f} ${result['output_cost']:>10.6f} ${result['total_cost']:>10.6f}")


compare_model_costs(input_tokens=5000, output_tokens=1000)

Controlling Output Tokens​

Since output tokens cost 5x more, controlling output length is the highest-leverage cost optimization:

max_tokens Parameter​

Set max_tokens to the minimum necessary. Do not use the default maximum when you only need a short answer.

# Bad: allows up to 4096 output tokens for a yes/no question
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": [{"role": "user", "content": "Is this text positive? 'I love this product'"}],
"max_tokens": 4096,
})

# Good: constrain output for predictable cost
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": [
{"role": "user", "content": "Is this text positive? Answer only 'yes' or 'no'."},
{"role": "assistant", "content": ""},
],
"max_tokens": 10,
"stop_sequences": ["\n"],
})

Stop Sequences​

Stop sequences terminate generation when Claude outputs specific text, preventing unnecessary continuation.

body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"messages": [
{"role": "user", "content": "Extract the email address from this text: 'Contact us at hello@acme.com for info'"},
],
"max_tokens": 100,
"stop_sequences": ["\n", ".", " "], # Stop after the first complete answer
})

Structured Output​

Requesting JSON output produces more concise responses than free-form text:

system = """You are a data extraction tool. Always respond with valid JSON only, no explanations."""

messages = [
{"role": "user", "content": "Extract name and role: 'Jane Smith is the CTO of Acme Corp'"},
{"role": "assistant", "content": "{"}, # Prefill to force JSON
]
# Output: {"name": "Jane Smith", "role": "CTO", "company": "Acme Corp"}
# Much shorter than: "Based on the text provided, I can identify that..."

Batch Inference Pricing​

Bedrock offers batch inference for non-time-sensitive workloads at a 50% discount:

ModelBatch Input (per 1M)Batch Output (per 1M)Savings
Claude Haiku 4.5$0.40$2.0050%
Claude Sonnet 4.6$1.50$7.5050%
Claude Opus 4.6$7.50$37.5050%

Use batch inference for:

  • Nightly processing of accumulated documents
  • Bulk classification or tagging jobs
  • Dataset labeling and enrichment
  • Any workload where results are not needed immediately
# Batch inference uses S3 input/output
bedrock_batch = boto3.client("bedrock", region_name="us-east-1")

response = bedrock_batch.create_model_invocation_job(
jobName="nightly-classification",
modelId="us.anthropic.claude-haiku-4-5-20250315",
roleArn="arn:aws:iam::123456789012:role/BedrockBatchRole",
inputDataConfig={
"s3InputDataConfig": {
"s3Uri": "s3://my-bucket/batch-input/",
"s3InputFormat": "JSONL",
}
},
outputDataConfig={
"s3OutputDataConfig": {
"s3Uri": "s3://my-bucket/batch-output/",
}
},
)
job_arn = response["jobArn"]

On-Demand vs Provisioned Throughput​

For predictable high-volume workloads, Bedrock offers provisioned throughput:

FeatureOn-DemandProvisioned Throughput
PricingPer-tokenHourly commitment (1-6 month terms)
ThrottlingSubject to account limitsGuaranteed model units
Best ForVariable trafficSteady, high-volume workloads
Minimum CommitmentNone1 month
Cost SavingsN/A20-40% at scale

Provisioned throughput makes sense when your monthly spend exceeds ~$10,000 and your traffic patterns are predictable. Below that, on-demand is simpler and more flexible.

Monthly Cost Estimator​

Our cost estimator calculates projected monthly spend across on-demand and batch inference, factoring in model selection, request volume, and token distribution.

Premium

Monthly Cost Estimator Tool

Get access to our production cost estimator function with batch inference calculations, per-request cost breakdowns, and multi-model comparison.

Flashcards​

1 / 7
Question

Why do Claude output tokens cost more than input tokens?

Click to reveal
Answer

Output tokens require sequential generation (one at a time), while input tokens are processed in parallel. The computation cost per output token is higher, so output tokens cost 5x more than input tokens across all Claude models.

Key Insight

Output tokens drive most of your Claude costs. A request where Claude generates 500 tokens costs more in output than one where you send 5,000 input tokens. Every optimization should prioritize reducing output length: tighter max_tokens, structured output formats, stop sequences, and concise prompt instructions.