Baseten vs OpenRouter vs DeepInfra — Model Pricing Report

Overview

Key finding: DeepInfra is the cheapest for most open models per-token. NVIDIA NIM offers very competitive pricing on models it serves ($1.39/$2.78 for V4 Pro, beating Baseten/DeepInfra's $1.74/$3.48). Together AI and Fireworks AI offer full HIPAA compliance. OpenRouter adds 5.5% on top. Hugging Face Inference Providers routes to the same underlying providers at $0 markup.

NVIDIA NIM: Offers a free trial tier (1,000-5,000 credits) and then routes to partner providers for production — often at competitive undercut pricing. Check build.nvidia.com for current rates per model.

Hugging Face: Three models: (1) Inference Providers — routes to 20+ providers (DeepInfra, Together, Fireworks, etc.) at $0 markup, with $0.10/mo free credits for free users / $2/mo for PRO. (2) Inference Endpoints — dedicated GPU instances from $0.50/hr (T4) to $36/hr (H100). (3) ZeroGPU — free H200 for PRO users via Spaces. Not directly comparable to per-token APIs — best for custom models or when you want no vendor lock-in.

DeepSeek V4 — Pro & Flash

1.6T MoE (49B active) · 1M context

DeepSeek V4 Pro

List prices per 1M tokens

Platform	Input	Output	Cache	Verdict
Baseten	$1.74	$3.48	$0.145	highest
DeepInfra	$1.74	$3.48	$0.145	same
Together AI	$2.10	$4.40	$0.20	highest
Fireworks AI	$1.74	$3.48	$0.14	= DI/Baseten
NVIDIA NIM	$1.39	$2.78	—	cheaper than 3
OpenRouter (DeepSeek)	$0.435	$0.87	$0.0036	× cheapest

Winner: OpenRouter — by a massive margin. OpenRouter routes to DeepSeek's own API at $0.435/$0.87, 4× cheaper than everyone. NVIDIA NIM comes in 2nd at $1.39/$2.78 (20% cheaper than Baseten/DeepInfra/Fireworks). Together AI is the most expensive at $2.10/$4.40.

284B MoE (13B active) · 1M context

DeepSeek V4 Flash

Platform	Input	Output	Cache	Verdict
Baseten	Not listed — deploy as dedicated GPU
DeepInfra	$0.14	$0.28	$0.028	lowest
Together AI	Not listed
Fireworks AI	Not listed
NVIDIA NIM	$0.14	$0.28	—	same as DI/OR
OpenRouter (DeepSeek)	$0.14	$0.28	—	same

Tie: DeepInfra / NVIDIA / OpenRouter — all three at $0.14/$0.28. NVIDIA NIM also routes here via partner DeepInfra. Baseten and Together/Fireworks don't offer V4 Flash as serverless.

DeepSeek V3.1

671B MoE (37B active) · 128K context

DeepSeek V3.1 Terminus

Platform	Input	Output	Cache	Verdict
Baseten	$0.50	$1.50	$0.25	—
DeepInfra	$0.21	$0.79	$0.13	× cheapest
Together AI	$0.60	$1.70	—	most expensive
Fireworks AI	$0.56	$1.68	$0.28	near Baseten
OpenRouter (DeepSeek)	$0.56	$1.68	$0.07	≈ Fireworks

Winner: DeepInfra — by a huge margin. DeepInfra's $0.21/$0.79 is 2.5× cheaper on input than the next cheapest (Baseten at $0.50), and half the output cost of anyone else. Together AI is the most expensive at $0.60/$1.70.

Kimi K2.6

Moonshot AI · 262K context

Kimi K2.6

Platform	Input	Output	Cache	Verdict
Baseten	$1.00	$3.90	$0.20	expensive
DeepInfra	$0.75	$3.50	$0.15	tied cheapest
Together AI	$1.20	$4.50	$0.20	most expensive
Fireworks AI	$0.95	$4.00	$0.48	mid
NVIDIA NIM	$0.95	$4.00	—	= Fireworks
OpenRouter (Moonshot)	$0.21	$4.00	—	best input
OpenRouter (DeepInfra)	$0.75	$3.50	$0.15	tied cheapest

Winner: DeepInfra / OpenRouter (tie) — DeepInfra direct at $0.75/$3.50 is the lowest consistent pricing. OpenRouter via Moonshot direct has unbeatable input ($0.21) but $4.00 output. NVIDIA NIM routes through Fireworks at $0.95/$4.00. Together is most expensive. Fireworks is mid-range.

Kimi K2.5

Moonshot AI · 262K context · Multimodal

Kimi K2.5

Platform	Input	Output	Cache	Verdict
Baseten	$0.60	$3.00	$0.12	mid
DeepInfra	$0.45	$2.25	$0.07	cheapest
Together AI	$0.50	$2.80	—	close
Fireworks AI	$0.60	$3.00	$0.10	= Baseten
OpenRouter (DeepInfra)	$0.44	$2.00	$0.22	routed

Winner: DeepInfra — $0.45/$2.25 beats everyone. Together AI is competitive at $0.50/$2.80 but behind on output. Fireworks and Baseten match at $0.60/$3.00. OpenRouter's routed DeepInfra is slightly cheaper on output ($2.00) minus 5.5% fee.

Llama 3.3 70B Instruct

Meta · 70B · 131K context

Llama 3.3 70B

Platform	Input	Output	Cache	Verdict
Baseten	$0.10	$0.50	—	highest output
DeepInfra	$0.10	$0.32	—	cheapest paid
Together AI	$0.88	$0.88	—	most expensive
Fireworks AI	$0.90	$0.90	$0.45	most expensive
OpenRouter (Venice)	$0.00	$0.00	—	free!
OpenRouter (paid)	$0.10	$0.32	—	= DeepInfra

Winner: OpenRouter (free) — Llama 3.3 is free on Venice. Among paid options, DeepInfra and OpenRouter paid are the cheapest ($0.10/$0.32). Notably, Together AI and Fireworks AI are both nearly 9× more expensive on output than DeepInfra at $0.88/$0.90 per token flat rate.

Important: Together & Fireworks pricing pattern

Llama 3.3 70B Pricing Discrepancy

Together AI and Fireworks AI both charge a flat $0.88–$0.90 per million tokens (same rate for input and output) for Llama 3.3 70B. That compares unfavorably to DeepInfra's $0.10/$0.32 split. The same pattern holds across other open models — Together and Fireworks tend to price higher on smaller/commodity models. Their competitive advantage isn't price on small models — it's on large MoE models like DeepSeek V4 Pro where their optimized inference stacks close the gap.

Gemma 4 31B Instruct

Google DeepMind · Dense 31B · 256K context · Multimodal

Gemma 4 31B

Platform	Input	Output	Cache	Verdict
Baseten	Not listed as Model API
DeepInfra	Not listed
Together AI	$0.20	$0.50	—	only pay API
Fireworks AI	Not listed as serverless
NVIDIA NIM	$0.14	$0.40	—	cheapest paid
OpenRouter	$0.00	$0.00	—	free!

Winner: OpenRouter (free) — Gemma 4 31B is free on OpenRouter. Among paid options, NVIDIA NIM is the cheapest at $0.14/$0.40, beating Together AI's $0.20/$0.50 by 30%. Baseten, DeepInfra, and Fireworks don't offer it serverless.

NVIDIA Nemotron 3 Super

NVIDIA · 120B MoE (12B active) · 262K context

Nemotron 3 Super 120B A12B

Platform	Input	Output	Cache	Verdict
Baseten	$0.30	$0.75	$0.06	mid
DeepInfra	$0.10	$0.50	—	× cheapest
Together AI	Not listed on serverless (dedicated only)
Fireworks AI	On-demand deployment only (no serverless)
OpenRouter	$0.00	$0.00	—	free!

Winner: OpenRouter (free) — Nemotron 3 Super is free on OpenRouter. Among paid providers, DeepInfra is the cheapest at $0.10/$0.50 — 3× cheaper on input than Baseten's $0.30/$0.75. Together AI and Fireworks only offer it via dedicated/on-demand deployments (per GPU-hour), not serverless per-token.

Business Model Differences

Structural difference

OpenRouter = Router · Baseten = Infrastructure

These are fundamentally different products wearing similar-looking pricing pages:

OpenRouter

Marketplace + Routing Layer

OpenRouter is an aggregator — it connects to 60+ provider APIs (DeepSeek, Together, Fireworks, DeepInfra, etc.) and routes your request to the cheapest/fastest/most reliable provider. It adds a 5.5% platform fee on top of provider pricing.

Strengths: 400+ models, auto-fallback, free tier, multi-provider competition, prompt caching across providers, no markup on base pricing.

Weaknesses: No dedicated GPU access, limited control over infrastructure, provider-dependent uptime.

Baseten

GPU Cloud + Inference Stack

Baseten is a GPU deployment platform that also offers optimized Model APIs. You can either use their pre-built endpoints (pay per token) or deploy your own model on dedicated GPUs (pay per minute/hour).

Strengths: Dedicated GPUs (T4→B200), no idle charges, autoscaling, SOC 2 + HIPAA, custom deployments via Truss, fast cold starts.

Weaknesses: Smaller model library (mostly open-source), per-token pricing is often higher than OpenRouter's best providers, no free tier for LLMs.

Together AI

Serverless + Dedicated Inference

Together AI is a full-stack inference platform with serverless, dedicated GPU deployments, fine-tuning, and GPU clusters. Raised significant funding and has strong enterprise adoption.

Strengths: SOC 2 Type II + HIPAA + BAA, competitive on large MoE models, dedicated H100 from $3.99/hr, fine-tuning platform, batch inference at 50% off.

Weaknesses: Expensive on smaller/commodity models (Llama 3.3 at $0.88/M), no free tier, smaller model selection than OpenRouter.

Fireworks AI

Fast Inference Stack

Fireworks AI focuses on low-latency inference with optimized engines (Firework Engine). Also SOC 2 Type II + HIPAA + BAA compliant. Raised over $100M.

Strengths: SOC 2 Type II + HIPAA + BAA, optimized inference (often fastest time-to-first-token), on-demand GPU deployments (H100/H200/B200/B300), fine-tuning platform.

Weaknesses: Expensive on small models (flat $0.90/M for Llama 3.3), smaller library than Together or OpenRouter.

NVIDIA NIM

GPU-Optimized Inference via Build.nvidia.com

NVIDIA NIM (via build.nvidia.com) provides inference microservices running on NVIDIA's optimized stack. Free tier with 1,000-5,000 credits for trial, then routes to partner providers for production. Includes Nemotron models and NVIDIA-optimized versions of all major open models.

Strengths: Often undercuts other providers on popular models ($1.39/$2.78 V4 Pro vs $1.74/$3.48 on Baseten), free trial credits, NVIDIA hardware optimization (DGX, Blackwell), Nemotron family runs best on its own hardware, strong for self-hosted enterprise deployments via NVIDIA AI Enterprise license.

Weaknesses: Not every model available serverless, limited per-token pricing transparency (many models are "downloadable only"), free tier is just for trial, production requires partner routing or enterprise license, no HIPAA BAA on the trial tier.

Hugging Face

Four Inference Models in One Platform

Hugging Face is unique — it operates four distinct inference paths:

1. Inference Providers (routed, per-token): Routes to 20+ provider APIs (DeepInfra, Together, Fireworks, Groq, Cerebras, Novita, etc.) with $0 markup — you pay the same rate as going direct. Includes $0.10/mo free credits for free users, $2/mo for PRO. Covers 200+ models.

2. Inference Endpoints (dedicated GPU): Deploy any model from the Hub on dedicated GPU instances. Pricing is per GPU-hour, not per token. Starts at $0.50/hr (T4) through $4.50/hr (H100) via AWS. Scales to zero. Best for custom/private models or high-volume workloads.

3. HF-Inference (HF's own serverless infra): Runs on HF's own hardware. Billed by compute time (GPU seconds), not per token. 15,000+ models supported but mostly CPU-friendly tasks — embeddings, text classification, sentence similarity, small LLMs (BERT, GPT-2). Only 1 trending chat model. A 10-second FLUX.1-dev image generation costs ~$0.0012.

4. ZeroGPU (free): Free H200 GPU access for PRO users via Spaces. Limited to side projects.

Strengths: Largest model library (1M+ models on Hub), $0 markup on routed providers, flexible deployment options, strong open-source community, fine-tuning platform, Spaces hosting.

Weaknesses: No per-token pricing for HF-native models (hf-inference is compute-time), routed providers add latency vs direct, no auto-fallback or multi-provider optimization like OpenRouter, $0.10/mo free credits is negligible.

When to Use Which

Decision matrix

OpenRouter when…

→ You want access to 400+ models including Claude, GPT, Gemini
→ You want provider redundancy with auto-fallback
→ You want a free tier for experimentation
→ Your workload has good cache hit rates (effective pricing often 30-50% below list)
→ You want to compare providers before choosing one
→ You don't need dedicated GPU or HIPAA compliance

Decision matrix

DeepInfra when…

→ You want the lowest per-token price for popular open models (often cheapest)
→ You want dedicated GPU at wholesale rates (H100 $1.79/hr, B200 $2.79/hr)
→ You need SOC 2 / ISO 27001 compliance
→ You know which model you want and don't need routing/fallback
→ You want to avoid the 5.5% OpenRouter platform fee
→ Your usage is high enough to benefit from direct pricing

Decision matrix

Baseten when…

→ You need HIPAA compliance (only provider with both SOC 2 + HIPAA)
→ You need to deploy custom models via Truss framework
→ You want autoscaling to zero with no idle charges
→ You need B200 or specific hardware for custom workloads
→ Your input/output ratio is input-heavy and benefits from cache

Decision matrix

Hugging Face when…

→ You want to route to 20+ providers with $0 markup, same rates if one provider works best
→ You need to deploy a custom model from the Hub on dedicated GPU (Inference Endpoints)
→ You want the largest model library (1M+ models) to experiment with
→ You want a single billing relationship for consolidated spending across providers
→ You're a PRO user and can use free ZeroGPU for side projects
→ You want to fine-tune and deploy in one platform (AutoTrain + Endpoints)

Hugging Face — Pricing Summary

Three models + HF-native inference

HF Inference: Providers, Endpoints, and HF-Native

Hugging Face operates four distinct inference paths. The first three route through partners:

Inference Providers (routed, per-token): Routes to 20+ providers (DeepInfra, Together, Fireworks, Groq, Cerebras, Novita, etc.) at exactly the same rates they charge direct. $0 markup. $0.10/mo free credits (free) / $2/mo (PRO). Covers 200+ large models. Best for consolidated billing.
Inference Endpoints (dedicated GPU): Deploy any Hub model on dedicated GPU. $0.50/hr (T4) → $4.50/hr (H100) via AWS. Per GPU-hour. Scales to zero. Good for custom/private models.
ZeroGPU: Free H200 for PRO users via Spaces. Side projects only.
HF-Inference (HF's own infra): The legacy serverless API running on HF's own infrastructure. Billed by compute time, not per-token. Best for embeddings, text classification, sentence similarity, and small LLMs (BERT, GPT-2, etc.). Only 1 trending chat model. 15,000+ models supported but mostly CPU-friendly tasks, not large LLM chat.

HF-Native pricing: Not comparable to per-token APIs — you're paying for GPU seconds, not tokens. A 10-second FLUX.1-dev generation costs ~$0.0012. For chat LLMs, you're better off using the routed Inference Providers (which give you per-token pricing from DeepInfra/Groq/etc.).

Bottom line: If you want HF's own models (BLOOM, StarCoder, etc.), use Inference Endpoints. For per-token pricing on popular models, the routed Inference Providers are the same rates as going direct with $0 markup. The legacy hf-inference is only worth it for small/embedding models.

huggingface.co/docs/inference-providers/en/pricing ↗

Compliance & Certifications

HIPAA · SOC 2 · ISO 27001

Who Can Handle Regulated Data?

Provider	SOC 2	HIPAA	ISO 27001	BAA Available
Baseten	✅ Type II	✅ Yes	—	✅ Yes
DeepInfra	✅ Type I	⚠️ Claims compliance	✅ Yes	Unclear
OpenRouter	✅ Type II	❌ No BAA	⏳ In progress	❌ Not published
Together AI	✅ Type II	✅ Yes	—	✅ Yes
Fireworks AI	✅ Type II	✅ Yes	—	✅ Yes
Groq	✅ Type II	✅ Yes	—	✅ Yes
Hugging Face	—	⚠️ Limited	—	Check

For HIPAA workloads: Baseten, Together AI, Fireworks AI, and Groq all have SOC 2 Type II + HIPAA with BAAs. DeepInfra claims HIPAA compliance alongside SOC 2/ISO 27001 but BAA availability is unclear. OpenRouter has SOC 2 but no HIPAA BAA — not suitable for PHI.

Note on DeepInfra: Their privacy policy states they "comply with SOC 2 and ISO 27001 standards and include technical and organizational measures for GDPR and HIPAA compliance." However, they appear to be at SOC 2 Type I (not Type II), and a BAA for HIPAA isn't prominently listed. If HIPAA is critical, verify directly before committing.

Summary table

Head to Head

Model	Cheapest Input	Cheapest Output	Best Platform
DeepSeek V4 Pro	OpenRouter ×4	OpenRouter ×4	OpenRouter
DeepSeek V4 Flash	DeepInfra / OR	DeepInfra / OR	DeepInfra/OR
DeepSeek V3.1	DeepInfra ×2.5	DeepInfra ×2	DeepInfra
Kimi K2.6	OpenRouter ×5 (Moonshot)	DeepInfra / OR	DeepInfra/OR
Kimi K2.5	DeepInfra ~25%	DeepInfra ~25%	DeepInfra
Llama 3.3 70B	OpenRouter free	DeepInfra / OR	OpenRouter free
Gemma 4 31B	OpenRouter free	OpenRouter free	OpenRouter free
Nemotron 3 Super	OpenRouter free	OpenRouter free	OpenRouter free
All (via HF)	Same as provider	Same as provider	Hugging Face $0 markup

Overall winner: DeepInfra for paid per-token pricing. NVIDIA NIM earns an honorable mention — it undercuts most providers on V4 Pro ($1.39/$2.78 vs $1.74/$3.48) and Gemma 4 ($0.14/$0.40 vs $0.20/$0.50). OpenRouter's free tier dominates on Gemma 4, Nemotron, and Llama 3.3. Hugging Face Inference Providers is the best choice if you want zero markup routing to existing provider APIs with consolidated billing.

For HIPAA-required workloads: Together AI is the strongest pick — competitive pricing on large models, full compliance, and the widest model selection among HIPAA-compliant providers.

7 Platforms · 8 Models

Baseten vs OpenRouter vs DeepInfra
+ Together AI, Fireworks, NVIDIA & HF

DeepSeek V4 Pro

DeepSeek V4 Flash

DeepSeek V3.1 Terminus

Kimi K2.6

Kimi K2.5

Llama 3.3 70B

Llama 3.3 70B Pricing Discrepancy

Gemma 4 31B

Nemotron 3 Super 120B A12B

OpenRouter = Router · Baseten = Infrastructure

Marketplace + Routing Layer

GPU Cloud + Inference Stack

Serverless + Dedicated Inference

Fast Inference Stack

GPU-Optimized Inference via Build.nvidia.com

Four Inference Models in One Platform

OpenRouter when…

DeepInfra when…

Baseten when…

Hugging Face when…

HF Inference: Providers, Endpoints, and HF-Native

Who Can Handle Regulated Data?

Head to Head

7 Platforms · 8 Models

Baseten vs OpenRouter vs DeepInfra+ Together AI, Fireworks, NVIDIA & HF

DeepSeek V4 Pro

DeepSeek V4 Flash

DeepSeek V3.1 Terminus

Kimi K2.6

Kimi K2.5

Llama 3.3 70B

Llama 3.3 70B Pricing Discrepancy

Gemma 4 31B

Nemotron 3 Super 120B A12B

OpenRouter = Router · Baseten = Infrastructure

Marketplace + Routing Layer

GPU Cloud + Inference Stack

Serverless + Dedicated Inference

Fast Inference Stack

GPU-Optimized Inference via Build.nvidia.com

Four Inference Models in One Platform

OpenRouter when…

DeepInfra when…

Baseten when…

Hugging Face when…

HF Inference: Providers, Endpoints, and HF-Native

Who Can Handle Regulated Data?

Head to Head

Baseten vs OpenRouter vs DeepInfra
+ Together AI, Fireworks, NVIDIA & HF