Pricing Report · May 2026

Baseten vs OpenRouter vs DeepInfra
+ Together AI, Fireworks, NVIDIA & HF

Cost per million tokens 7 platforms compared Swiss design
Overview

Key finding: DeepInfra is the cheapest for most open models per-token. NVIDIA NIM offers very competitive pricing on models it serves ($1.39/$2.78 for V4 Pro, beating Baseten/DeepInfra's $1.74/$3.48). Together AI and Fireworks AI offer full HIPAA compliance. OpenRouter adds 5.5% on top. Hugging Face Inference Providers routes to the same underlying providers at $0 markup.

NVIDIA NIM: Offers a free trial tier (1,000-5,000 credits) and then routes to partner providers for production — often at competitive undercut pricing. Check build.nvidia.com for current rates per model.

Hugging Face: Three models: (1) Inference Providers — routes to 20+ providers (DeepInfra, Together, Fireworks, etc.) at $0 markup, with $0.10/mo free credits for free users / $2/mo for PRO. (2) Inference Endpoints — dedicated GPU instances from $0.50/hr (T4) to $36/hr (H100). (3) ZeroGPU — free H200 for PRO users via Spaces. Not directly comparable to per-token APIs — best for custom models or when you want no vendor lock-in.

DeepSeek V4 — Pro & Flash
1.6T MoE (49B active) · 1M context

DeepSeek V4 Pro

List prices per 1M tokens
Platform Input Output Cache Verdict
Baseten $1.74 $3.48 $0.145 highest
DeepInfra $1.74 $3.48 $0.145 same
Together AI $2.10 $4.40 $0.20 highest
Fireworks AI $1.74 $3.48 $0.14 = DI/Baseten
NVIDIA NIM $1.39 $2.78 cheaper than 3
OpenRouter (DeepSeek) $0.435 $0.87 $0.0036 × cheapest

Winner: OpenRouter — by a massive margin. OpenRouter routes to DeepSeek's own API at $0.435/$0.87, 4× cheaper than everyone. NVIDIA NIM comes in 2nd at $1.39/$2.78 (20% cheaper than Baseten/DeepInfra/Fireworks). Together AI is the most expensive at $2.10/$4.40.

284B MoE (13B active) · 1M context

DeepSeek V4 Flash

Platform Input Output Cache Verdict
Baseten Not listed — deploy as dedicated GPU
DeepInfra $0.14 $0.28 $0.028 lowest
Together AI Not listed
Fireworks AI Not listed
NVIDIA NIM $0.14 $0.28 same as DI/OR
OpenRouter (DeepSeek) $0.14 $0.28 same

Tie: DeepInfra / NVIDIA / OpenRouter — all three at $0.14/$0.28. NVIDIA NIM also routes here via partner DeepInfra. Baseten and Together/Fireworks don't offer V4 Flash as serverless.

DeepSeek V3.1
671B MoE (37B active) · 128K context

DeepSeek V3.1 Terminus

Platform Input Output Cache Verdict
Baseten $0.50 $1.50 $0.25
DeepInfra $0.21 $0.79 $0.13 × cheapest
Together AI $0.60 $1.70 most expensive
Fireworks AI $0.56 $1.68 $0.28 near Baseten
OpenRouter (DeepSeek) $0.56 $1.68 $0.07 ≈ Fireworks

Winner: DeepInfra — by a huge margin. DeepInfra's $0.21/$0.79 is 2.5× cheaper on input than the next cheapest (Baseten at $0.50), and half the output cost of anyone else. Together AI is the most expensive at $0.60/$1.70.

Kimi K2.6
Moonshot AI · 262K context

Kimi K2.6

Platform Input Output Cache Verdict
Baseten $1.00 $3.90 $0.20 expensive
DeepInfra $0.75 $3.50 $0.15 tied cheapest
Together AI $1.20 $4.50 $0.20 most expensive
Fireworks AI $0.95 $4.00 $0.48 mid
NVIDIA NIM $0.95 $4.00 = Fireworks
OpenRouter (Moonshot) $0.21 $4.00 best input
OpenRouter (DeepInfra) $0.75 $3.50 $0.15 tied cheapest

Winner: DeepInfra / OpenRouter (tie) — DeepInfra direct at $0.75/$3.50 is the lowest consistent pricing. OpenRouter via Moonshot direct has unbeatable input ($0.21) but $4.00 output. NVIDIA NIM routes through Fireworks at $0.95/$4.00. Together is most expensive. Fireworks is mid-range.

Kimi K2.5
Moonshot AI · 262K context · Multimodal

Kimi K2.5

Platform Input Output Cache Verdict
Baseten $0.60 $3.00 $0.12 mid
DeepInfra $0.45 $2.25 $0.07 cheapest
Together AI $0.50 $2.80 close
Fireworks AI $0.60 $3.00 $0.10 = Baseten
OpenRouter (DeepInfra) $0.44 $2.00 $0.22 routed

Winner: DeepInfra — $0.45/$2.25 beats everyone. Together AI is competitive at $0.50/$2.80 but behind on output. Fireworks and Baseten match at $0.60/$3.00. OpenRouter's routed DeepInfra is slightly cheaper on output ($2.00) minus 5.5% fee.

Llama 3.3 70B Instruct
Meta · 70B · 131K context

Llama 3.3 70B

Platform Input Output Cache Verdict
Baseten $0.10 $0.50 highest output
DeepInfra $0.10 $0.32 cheapest paid
Together AI $0.88 $0.88 most expensive
Fireworks AI $0.90 $0.90 $0.45 most expensive
OpenRouter (Venice) $0.00 $0.00 free!
OpenRouter (paid) $0.10 $0.32 = DeepInfra

Winner: OpenRouter (free) — Llama 3.3 is free on Venice. Among paid options, DeepInfra and OpenRouter paid are the cheapest ($0.10/$0.32). Notably, Together AI and Fireworks AI are both nearly 9× more expensive on output than DeepInfra at $0.88/$0.90 per token flat rate.

Important: Together & Fireworks pricing pattern

Llama 3.3 70B Pricing Discrepancy

Together AI and Fireworks AI both charge a flat $0.88–$0.90 per million tokens (same rate for input and output) for Llama 3.3 70B. That compares unfavorably to DeepInfra's $0.10/$0.32 split. The same pattern holds across other open models — Together and Fireworks tend to price higher on smaller/commodity models. Their competitive advantage isn't price on small models — it's on large MoE models like DeepSeek V4 Pro where their optimized inference stacks close the gap.

Gemma 4 31B Instruct
Google DeepMind · Dense 31B · 256K context · Multimodal

Gemma 4 31B

Platform Input Output Cache Verdict
Baseten Not listed as Model API
DeepInfra Not listed
Together AI $0.20 $0.50 only pay API
Fireworks AI Not listed as serverless
NVIDIA NIM $0.14 $0.40 cheapest paid
OpenRouter $0.00 $0.00 free!

Winner: OpenRouter (free) — Gemma 4 31B is free on OpenRouter. Among paid options, NVIDIA NIM is the cheapest at $0.14/$0.40, beating Together AI's $0.20/$0.50 by 30%. Baseten, DeepInfra, and Fireworks don't offer it serverless.

NVIDIA Nemotron 3 Super
NVIDIA · 120B MoE (12B active) · 262K context

Nemotron 3 Super 120B A12B

Platform Input Output Cache Verdict
Baseten $0.30 $0.75 $0.06 mid
DeepInfra $0.10 $0.50 × cheapest
Together AI Not listed on serverless (dedicated only)
Fireworks AI On-demand deployment only (no serverless)
OpenRouter $0.00 $0.00 free!

Winner: OpenRouter (free) — Nemotron 3 Super is free on OpenRouter. Among paid providers, DeepInfra is the cheapest at $0.10/$0.50 — 3× cheaper on input than Baseten's $0.30/$0.75. Together AI and Fireworks only offer it via dedicated/on-demand deployments (per GPU-hour), not serverless per-token.

Business Model Differences
OpenRouter

Marketplace + Routing Layer

OpenRouter is an aggregator — it connects to 60+ provider APIs (DeepSeek, Together, Fireworks, DeepInfra, etc.) and routes your request to the cheapest/fastest/most reliable provider. It adds a 5.5% platform fee on top of provider pricing.

Strengths: 400+ models, auto-fallback, free tier, multi-provider competition, prompt caching across providers, no markup on base pricing.

Weaknesses: No dedicated GPU access, limited control over infrastructure, provider-dependent uptime.

Baseten

GPU Cloud + Inference Stack

Baseten is a GPU deployment platform that also offers optimized Model APIs. You can either use their pre-built endpoints (pay per token) or deploy your own model on dedicated GPUs (pay per minute/hour).

Strengths: Dedicated GPUs (T4→B200), no idle charges, autoscaling, SOC 2 + HIPAA, custom deployments via Truss, fast cold starts.

Weaknesses: Smaller model library (mostly open-source), per-token pricing is often higher than OpenRouter's best providers, no free tier for LLMs.

Together AI

Serverless + Dedicated Inference

Together AI is a full-stack inference platform with serverless, dedicated GPU deployments, fine-tuning, and GPU clusters. Raised significant funding and has strong enterprise adoption.

Strengths: SOC 2 Type II + HIPAA + BAA, competitive on large MoE models, dedicated H100 from $3.99/hr, fine-tuning platform, batch inference at 50% off.

Weaknesses: Expensive on smaller/commodity models (Llama 3.3 at $0.88/M), no free tier, smaller model selection than OpenRouter.

Fireworks AI

Fast Inference Stack

Fireworks AI focuses on low-latency inference with optimized engines (Firework Engine). Also SOC 2 Type II + HIPAA + BAA compliant. Raised over $100M.

Strengths: SOC 2 Type II + HIPAA + BAA, optimized inference (often fastest time-to-first-token), on-demand GPU deployments (H100/H200/B200/B300), fine-tuning platform.

Weaknesses: Expensive on small models (flat $0.90/M for Llama 3.3), smaller library than Together or OpenRouter.

NVIDIA NIM

GPU-Optimized Inference via Build.nvidia.com

NVIDIA NIM (via build.nvidia.com) provides inference microservices running on NVIDIA's optimized stack. Free tier with 1,000-5,000 credits for trial, then routes to partner providers for production. Includes Nemotron models and NVIDIA-optimized versions of all major open models.

Strengths: Often undercuts other providers on popular models ($1.39/$2.78 V4 Pro vs $1.74/$3.48 on Baseten), free trial credits, NVIDIA hardware optimization (DGX, Blackwell), Nemotron family runs best on its own hardware, strong for self-hosted enterprise deployments via NVIDIA AI Enterprise license.

Weaknesses: Not every model available serverless, limited per-token pricing transparency (many models are "downloadable only"), free tier is just for trial, production requires partner routing or enterprise license, no HIPAA BAA on the trial tier.

Hugging Face

Four Inference Models in One Platform

Hugging Face is unique — it operates four distinct inference paths:

1. Inference Providers (routed, per-token): Routes to 20+ provider APIs (DeepInfra, Together, Fireworks, Groq, Cerebras, Novita, etc.) with $0 markup — you pay the same rate as going direct. Includes $0.10/mo free credits for free users, $2/mo for PRO. Covers 200+ models.

2. Inference Endpoints (dedicated GPU): Deploy any model from the Hub on dedicated GPU instances. Pricing is per GPU-hour, not per token. Starts at $0.50/hr (T4) through $4.50/hr (H100) via AWS. Scales to zero. Best for custom/private models or high-volume workloads.

3. HF-Inference (HF's own serverless infra): Runs on HF's own hardware. Billed by compute time (GPU seconds), not per token. 15,000+ models supported but mostly CPU-friendly tasks — embeddings, text classification, sentence similarity, small LLMs (BERT, GPT-2). Only 1 trending chat model. A 10-second FLUX.1-dev image generation costs ~$0.0012.

4. ZeroGPU (free): Free H200 GPU access for PRO users via Spaces. Limited to side projects.

Strengths: Largest model library (1M+ models on Hub), $0 markup on routed providers, flexible deployment options, strong open-source community, fine-tuning platform, Spaces hosting.

Weaknesses: No per-token pricing for HF-native models (hf-inference is compute-time), routed providers add latency vs direct, no auto-fallback or multi-provider optimization like OpenRouter, $0.10/mo free credits is negligible.

When to Use Which
Decision matrix

OpenRouter when…

→ You want access to 400+ models including Claude, GPT, Gemini
→ You want provider redundancy with auto-fallback
→ You want a free tier for experimentation
→ Your workload has good cache hit rates (effective pricing often 30-50% below list)
→ You want to compare providers before choosing one
→ You don't need dedicated GPU or HIPAA compliance

Decision matrix

DeepInfra when…

→ You want the lowest per-token price for popular open models (often cheapest)
→ You want dedicated GPU at wholesale rates (H100 $1.79/hr, B200 $2.79/hr)
→ You need SOC 2 / ISO 27001 compliance
→ You know which model you want and don't need routing/fallback
→ You want to avoid the 5.5% OpenRouter platform fee
→ Your usage is high enough to benefit from direct pricing

Decision matrix

Baseten when…

→ You need HIPAA compliance (only provider with both SOC 2 + HIPAA)
→ You need to deploy custom models via Truss framework
→ You want autoscaling to zero with no idle charges
→ You need B200 or specific hardware for custom workloads
→ Your input/output ratio is input-heavy and benefits from cache

Decision matrix

Hugging Face when…

→ You want to route to 20+ providers with $0 markup, same rates if one provider works best
→ You need to deploy a custom model from the Hub on dedicated GPU (Inference Endpoints)
→ You want the largest model library (1M+ models) to experiment with
→ You want a single billing relationship for consolidated spending across providers
→ You're a PRO user and can use free ZeroGPU for side projects
→ You want to fine-tune and deploy in one platform (AutoTrain + Endpoints)

Hugging Face — Pricing Summary
Compliance & Certifications
Summary table

Head to Head

Model Cheapest Input Cheapest Output Best Platform
DeepSeek V4 Pro OpenRouter ×4 OpenRouter ×4 OpenRouter
DeepSeek V4 Flash DeepInfra / OR DeepInfra / OR DeepInfra/OR
DeepSeek V3.1 DeepInfra ×2.5 DeepInfra ×2 DeepInfra
Kimi K2.6 OpenRouter ×5 (Moonshot) DeepInfra / OR DeepInfra/OR
Kimi K2.5 DeepInfra ~25% DeepInfra ~25% DeepInfra
Llama 3.3 70B OpenRouter free DeepInfra / OR OpenRouter free
Gemma 4 31B OpenRouter free OpenRouter free OpenRouter free
Nemotron 3 Super OpenRouter free OpenRouter free OpenRouter free
All (via HF) Same as provider Same as provider Hugging Face
$0 markup

Overall winner: DeepInfra for paid per-token pricing. NVIDIA NIM earns an honorable mention — it undercuts most providers on V4 Pro ($1.39/$2.78 vs $1.74/$3.48) and Gemma 4 ($0.14/$0.40 vs $0.20/$0.50). OpenRouter's free tier dominates on Gemma 4, Nemotron, and Llama 3.3. Hugging Face Inference Providers is the best choice if you want zero markup routing to existing provider APIs with consolidated billing.

For HIPAA-required workloads: Together AI is the strongest pick — competitive pricing on large models, full compliance, and the widest model selection among HIPAA-compliant providers.