Code · Backend

Open-weight LLM stack

RecommendedLlama for the everyday, Mistral for European data, Claude as the eval anchor

Closed-weight APIs are the right default for most teams. They're not the right default for a regulated industry, a privacy-led product, or anyone whose customers ask 'where does this data go'. Llama and Mistral are open-weight families that ship near-frontier quality with the receipts. Use Claude on a small eval set to confirm the open-weight model holds up on YOUR task.

CODEADVANCEDAdvancedFrom $15/mo
The stack
Meta AI
Default open-weight model

Llama 3.x and 4 cover most general tasks. Hosted on Groq for fast inference, on Together for breadth, or self-hosted on your own GPUs via Ollama / vLLM. Open weights mean you keep the option to leave any provider.

Free chat · Open weights · API via Groq/Together meteredAlts: Mistral, Qwen
Mistral
EU data residency + code

Mistral Large for general, Codestral for code. EU-headquartered, EU data centers available. The right pick when the legal team is in the room.

Free chat · €15/mo Pro · API meteredAlts: Meta AI
Claude
Eval anchor + spot-check

Use Claude on a 30 to 50-row eval to confirm your open-weight pick holds quality on your specific task. Re-run quarterly or whenever you swap providers.

$20/mo Pro · API $3/M tokensAlts: ChatGPT
Real monthly cost
small
$15/mo
Hosted inference, low volume
  • meta-ai$10 (Groq / Together API)
  • mistralFree Le Chat
  • claude$5 (eval)
medium
$170/mo
Hosted inference, real volume
  • meta-ai$100 (Groq scale tier)
  • mistral$50 API
  • claude$20 (eval Pro)
heavy
$1,200/mo
Self-hosted on rented GPUs
  • meta-ai$1,100 (rented GPU box for Llama)
  • mistral$0 (also on the same box)
  • claude$80 (regular eval + drift checks)
Workflow
  1. 1
    Pick the model for the use caseMeta AI

    Llama 3.x for general; Llama 4 for the heavy reasoning tasks; Mistral Large for EU + nuanced French/Spanish; Codestral when the task is code generation specifically.

  2. 2
    Pick the hostMeta AI

    Don't self-host until you have to. Groq is fastest for Llama; Together is broadest. Mistral La Plateforme for hosted Mistral. Self-host only when data-residency or cost-at-scale forces you.

  3. 3
    Build a 50-row eval with ClaudeClaude

    Use Claude (or GPT) to draft inputs + expected outputs. Hand-edit to remove ambiguity. This eval is the only way to know your open-weight pick is truly good enough.

    Prompt · Eval scaffold for an open-weight LLM swap
    I'm evaluating whether to use {{Llama 3.3 70B / Mistral Large / etc.}} in production for {{task description}}. Help me build the eval set.
    
    Task:
    """
    {{task: input shape, expected output shape, definition of correct}}
    """
    
    Output:
    1. **Eval rows** (50) — table of {input, expected output, why this case matters}. Cover the easy cases, the long-tail edge cases, and 5 deliberately adversarial inputs.
    2. **Scoring rubric** — exactly how I score actual outputs against expected. Define partial credit if useful.
    3. **Pass bar** — what % score against the rubric should I require before swapping production traffic to the open-weight model?
    4. **What I should re-run quarterly** — the 5 to 10 most-important rows that catch regression.
    
    Be ruthless about the adversarial cases. The whole point of evals is the model failing on things YOUR users will throw at it.
  4. 4
    Smoke-test on the evalMistral

    Run the eval on Claude AND your open-weight pick. If the open-weight model scores within 5% (or your domain's tolerance), ship it. If not, narrow scope or pick a different open-weight model.

  5. 5
    Production + monthly drift checkClaude

    Ship to prod. Re-run the eval monthly to catch silent quality drift when the host updates the model.

What it produced
Healthcare-adjacent SaaS, EU customers

Could not use closed-weight US-hosted APIs for the customer-data path. Built on Mistral La Plateforme (EU region) for the prod calls, kept Claude for the eval rig. Internal compliance team signed off in week 2 because the audit trail (model card + EU hosting) was clean.

Common pitfalls
Self-hosting before you have to

Renting a GPU box for one app is rarely a good trade. Hosted Llama (Groq, Together) covers most cases at lower cost. Self-host only when audit / latency / cost-at-scale forces it.

Treating 'open weights' as a feature without a use case

Open weights only matter if you'd actually self-host or audit. If you'll never inspect them, the closed-weight APIs are usually still the right call. Be honest about which camp you're in.

Forgetting the eval drift

Hosted open-weight models update quietly. The monthly eval is the cheapest way to catch a regression before a customer does.

Curated by @alex-w
Updated weekly · last refresh: just now