Anatomy of Claude API Costs

Claude pricing on Amazon Bedrock is token-based, but understanding the details makes the difference between a $500/month bill and a $5,000/month bill for the same workload. This page breaks down exactly how costs are calculated and where the optimization opportunities are.

Input Tokens vs Output Tokens

Claude charges separately for input tokens (what you send) and output tokens (what Claude generates). Output tokens are significantly more expensive because they require sequential computation.

Model	Input (per 1M tokens)	Output (per 1M tokens)	Output/Input Ratio
Claude Haiku 4.5	$0.80	$4.00	5x
Claude Sonnet 4.6	$3.00	$15.00	5x
Claude Opus 4.6	$15.00	$75.00	5x

The ratio is consistent: output tokens cost 5x more than input tokens across all models. This means every additional token Claude generates costs you five times more than every token you send. This asymmetry is the foundation of most cost optimization strategies.

How to Calculate Cost Per Request

The formula is straightforward:

Cost = (input_tokens / 1,000,000 * input_price) + (output_tokens / 1,000,000 * output_price)

Example: A Sonnet 4.6 request with 2,000 input tokens and 500 output tokens:

Cost = (2,000 / 1,000,000 * $3.00) + (500 / 1,000,000 * $15.00)
     = $0.006 + $0.0075
     = $0.0135 per request

At 10,000 requests per day, that is $135/day or ~$4,050/month.

def calculate_cost(
    input_tokens: int,
    output_tokens: int,
    model: str = "sonnet",
) -> dict:
    """Calculate the cost of a Claude API request."""
    pricing = {
        "haiku": {"input": 0.80, "output": 4.00},
        "sonnet": {"input": 3.00, "output": 15.00},
        "opus": {"input": 15.00, "output": 75.00},
    }

    prices = pricing[model]
    input_cost = (input_tokens / 1_000_000) * prices["input"]
    output_cost = (output_tokens / 1_000_000) * prices["output"]

    return {
        "input_cost": round(input_cost, 6),
        "output_cost": round(output_cost, 6),
        "total_cost": round(input_cost + output_cost, 6),
        "output_share": f"{output_cost / (input_cost + output_cost) * 100:.0f}%",
    }


# Example usage
result = calculate_cost(input_tokens=2000, output_tokens=500, model="sonnet")
print(result)
# {'input_cost': 0.006, 'output_cost': 0.0075, 'total_cost': 0.0135, 'output_share': '56%'}

Cost Comparison Across Models

The same task can cost dramatically different amounts depending on model choice:

Scenario	Haiku 4.5	Sonnet 4.6	Opus 4.6
Simple classification (100 in / 10 out)	$0.00012	$0.00045	$0.00225
Chat response (1K in / 300 out)	$0.002	$0.0075	$0.0375
Document summary (10K in / 500 out)	$0.01	$0.0375	$0.1875
Long analysis (50K in / 2K out)	$0.048	$0.18	$0.90

def compare_model_costs(input_tokens: int, output_tokens: int):
    """Compare costs across all Claude models for the same request."""
    models = ["haiku", "sonnet", "opus"]
    print(f"{'Model':<10} {'Input Cost':>12} {'Output Cost':>12} {'Total':>12}")
    print("-" * 48)
    for model in models:
        result = calculate_cost(input_tokens, output_tokens, model)
        print(f"{model:<10} ${result['input_cost']:>10.6f} ${result['output_cost']:>10.6f} ${result['total_cost']:>10.6f}")


compare_model_costs(input_tokens=5000, output_tokens=1000)

Controlling Output Tokens

Since output tokens cost 5x more, controlling output length is the highest-leverage cost optimization:

max_tokens Parameter

Set max_tokens to the minimum necessary. Do not use the default maximum when you only need a short answer.

# Bad: allows up to 4096 output tokens for a yes/no question
body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "messages": [{"role": "user", "content": "Is this text positive? 'I love this product'"}],
    "max_tokens": 4096,
})

# Good: constrain output for predictable cost
body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "messages": [
        {"role": "user", "content": "Is this text positive? Answer only 'yes' or 'no'."},
        {"role": "assistant", "content": ""},
    ],
    "max_tokens": 10,
    "stop_sequences": ["\n"],
})

Stop Sequences

Stop sequences terminate generation when Claude outputs specific text, preventing unnecessary continuation.

body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "messages": [
        {"role": "user", "content": "Extract the email address from this text: 'Contact us at hello@acme.com for info'"},
    ],
    "max_tokens": 100,
    "stop_sequences": ["\n", ".", " "],  # Stop after the first complete answer
})

Structured Output

Requesting JSON output produces more concise responses than free-form text:

system = """You are a data extraction tool. Always respond with valid JSON only, no explanations."""

messages = [
    {"role": "user", "content": "Extract name and role: 'Jane Smith is the CTO of Acme Corp'"},
    {"role": "assistant", "content": "{"},  # Prefill to force JSON
]
# Output: {"name": "Jane Smith", "role": "CTO", "company": "Acme Corp"}
# Much shorter than: "Based on the text provided, I can identify that..."

Batch Inference Pricing

Bedrock offers batch inference for non-time-sensitive workloads at a 50% discount:

Model	Batch Input (per 1M)	Batch Output (per 1M)	Savings
Claude Haiku 4.5	$0.40	$2.00	50%
Claude Sonnet 4.6	$1.50	$7.50	50%
Claude Opus 4.6	$7.50	$37.50	50%

Use batch inference for:

Nightly processing of accumulated documents
Bulk classification or tagging jobs
Dataset labeling and enrichment
Any workload where results are not needed immediately

# Batch inference uses S3 input/output
bedrock_batch = boto3.client("bedrock", region_name="us-east-1")

response = bedrock_batch.create_model_invocation_job(
    jobName="nightly-classification",
    modelId="us.anthropic.claude-haiku-4-5-20250315",
    roleArn="arn:aws:iam::123456789012:role/BedrockBatchRole",
    inputDataConfig={
        "s3InputDataConfig": {
            "s3Uri": "s3://my-bucket/batch-input/",
            "s3InputFormat": "JSONL",
        }
    },
    outputDataConfig={
        "s3OutputDataConfig": {
            "s3Uri": "s3://my-bucket/batch-output/",
        }
    },
)
job_arn = response["jobArn"]

On-Demand vs Provisioned Throughput

For predictable high-volume workloads, Bedrock offers provisioned throughput:

Feature	On-Demand	Provisioned Throughput
Pricing	Per-token	Hourly commitment (1-6 month terms)
Throttling	Subject to account limits	Guaranteed model units
Best For	Variable traffic	Steady, high-volume workloads
Minimum Commitment	None	1 month
Cost Savings	N/A	20-40% at scale

Provisioned throughput makes sense when your monthly spend exceeds ~$10,000 and your traffic patterns are predictable. Below that, on-demand is simpler and more flexible.

Monthly Cost Estimator

Our cost estimator calculates projected monthly spend across on-demand and batch inference, factoring in model selection, request volume, and token distribution.

Premium

Monthly Cost Estimator Tool

Get access to our production cost estimator function with batch inference calculations, per-request cost breakdowns, and multi-model comparison.

Flashcards

1 / 7

Question

Why do Claude output tokens cost more than input tokens?

Click to reveal

Answer

Output tokens require sequential generation (one at a time), while input tokens are processed in parallel. The computation cost per output token is higher, so output tokens cost 5x more than input tokens across all Claude models.

Key Insight

Output tokens drive most of your Claude costs. A request where Claude generates 500 tokens costs more in output than one where you send 5,000 input tokens. Every optimization should prioritize reducing output length: tighter max_tokens, structured output formats, stop sequences, and concise prompt instructions.

Input Tokens vs Output Tokens​

How to Calculate Cost Per Request​

Cost Comparison Across Models​

Controlling Output Tokens​

max_tokens Parameter​

Stop Sequences​

Structured Output​

Batch Inference Pricing​

On-Demand vs Provisioned Throughput​

Monthly Cost Estimator​

Flashcards​