Code · Backend

Cheap bulk LLM automation

RecommendedDeepSeek for the volume calls, Claude for the eval and golden examples

When you need to score reviews, classify support tickets, summarize emails at scale — Claude API costs add up. DeepSeek matches frontier reasoning on most batch tasks at ~10x cheaper. Use Claude to write the prompt and the eval set; switch to DeepSeek for production volume; sample-check with Claude monthly.

CODEINTERMEDIATEIntermediateFrom $5/mo
The stack
Claude
Prompt design + golden eval set

Use Claude to draft the prompt, hand-label 30 to 50 golden examples, and define what 'correct' looks like for your task. This is the work where model quality matters most.

$20/mo Pro · API $3/M tokensAlts: ChatGPT
DeepSeek
Production volume calls

DeepSeek-V3 / R1 hit ~$0.14/M input + $0.28/M output. On most classification, summarization, and structured-extraction tasks, output quality is within 2 to 5% of Claude on the eval set. The 90%+ cost savings dominate at volume.

Free chat · API ~$0.14/M input, $0.28/M outputAlts: Mistral, Qwen
Real monthly cost
small
$5/mo
10k calls/mo
  • claude$3 (eval-only)
  • deepseek$2
medium
$140/mo
1M calls/mo
  • claude$20 (Pro for eval)
  • deepseek$120
heavy
$1,200/mo
10M calls/mo
  • claude$80 (eval + sampling)
  • deepseek$1,120
Workflow
  1. 1
    Draft the prompt with ClaudeClaude

    Use Claude's free tier — for prompt iteration, the chat surface is fine. The model quality matters here.

    Prompt · Production prompt + eval set scaffold
    I'm building a production LLM task. Help me draft the prompt and an eval set.
    
    Task description:
    """
    {{describe the task: input, expected output format, what 'correct' means}}
    """
    
    Output, in this order:
    
    1. **System prompt** — the production prompt that will run on every input. Strict output format. No conversational hedging.
    2. **30-row eval set** — table of {input, expected output, why}. Cover the easy cases, the edge cases, and 5 deliberately adversarial inputs.
    3. **Eval rubric** — exactly how I'll score outputs against the expected outputs. Define partial credit if useful.
    4. **Smoke test plan** — the 3 minimum-viable runs I should do before sending real volume to a cheaper model.
    
    The point of all this: I'm going to run the prompt on 100k inputs/month via DeepSeek's API. I want a prompt that survives the swap from Claude (where I'm authoring it) to DeepSeek (where it'll run).
  2. 2
    Smoke-test against the eval on bothDeepSeek

    Run the eval set through both Claude and DeepSeek. If DeepSeek scores within 5% of Claude on your rubric, ship DeepSeek. If not, narrow the prompt or the task scope before re-testing.

  3. 3
    Run production on DeepSeekDeepSeek

    Standard OpenAI-compatible API. Most SDKs work by changing the base URL.

  4. 4
    Monthly sampling with ClaudeClaude

    Sample 1 to 2% of DeepSeek outputs and re-score with Claude. Catches drift and prompt rot before they become a customer-facing problem.

What it produced
Customer-support ticket classifier

Classifying ~120k tickets/month into 14 categories. Claude API: ~$2,800/mo at this volume. DeepSeek: ~$210/mo. Eval-set quality: within 3.4% on F1 (Claude 91.2 vs DeepSeek 87.8). Saved ~$31k/year; spent the savings on better hand-labeling.

Common pitfalls
Skipping the eval set

The whole bet of this stack is that DeepSeek matches Claude on YOUR task. Without an eval set, you're guessing. The eval set is the cheapest insurance possible.

Switching back to Claude on the first DeepSeek miss

Production LLMs all hallucinate sometimes. Don't switch on n=1 — the right move is to widen the eval set and confirm a real quality gap before paying 10x more.

Treating DeepSeek as a closed black box

It's offered as a hosted API, but the model weights are open. If data residency or supplier risk is a concern, you can self-host the same model on your own infra. Most teams won't, but knowing the option exists changes the conversation.

Curated by @alex-w
Updated weekly · last refresh: just now