Baseten vs OpenRouter vs DeepInfra
+ Together AI, Fireworks, NVIDIA & HF
Key finding: DeepInfra is the cheapest for most open models per-token. NVIDIA NIM offers very competitive pricing on models it serves ($1.39/$2.78 for V4 Pro, beating Baseten/DeepInfra's $1.74/$3.48). Together AI and Fireworks AI offer full HIPAA compliance. OpenRouter adds 5.5% on top. Hugging Face Inference Providers routes to the same underlying providers at $0 markup.
NVIDIA NIM: Offers a free trial tier (1,000-5,000 credits) and then routes to partner providers for production — often at competitive undercut pricing. Check build.nvidia.com for current rates per model.
Hugging Face: Three models: (1) Inference Providers — routes to 20+ providers (DeepInfra, Together, Fireworks, etc.) at $0 markup, with $0.10/mo free credits for free users / $2/mo for PRO. (2) Inference Endpoints — dedicated GPU instances from $0.50/hr (T4) to $36/hr (H100). (3) ZeroGPU — free H200 for PRO users via Spaces. Not directly comparable to per-token APIs — best for custom models or when you want no vendor lock-in.
DeepSeek V4 Pro
| Platform | Input | Output | Cache | Verdict |
|---|---|---|---|---|
| Baseten | $1.74 | $3.48 | $0.145 | highest |
| DeepInfra | $1.74 | $3.48 | $0.145 | same |
| Together AI | $2.10 | $4.40 | $0.20 | highest |
| Fireworks AI | $1.74 | $3.48 | $0.14 | = DI/Baseten |
| NVIDIA NIM | $1.39 | $2.78 | — | cheaper than 3 |
| OpenRouter (DeepSeek) | $0.435 | $0.87 | $0.0036 | × cheapest |
Winner: OpenRouter — by a massive margin. OpenRouter routes to DeepSeek's own API at $0.435/$0.87, 4× cheaper than everyone. NVIDIA NIM comes in 2nd at $1.39/$2.78 (20% cheaper than Baseten/DeepInfra/Fireworks). Together AI is the most expensive at $2.10/$4.40.
DeepSeek V4 Flash
| Platform | Input | Output | Cache | Verdict |
|---|---|---|---|---|
| Baseten | Not listed — deploy as dedicated GPU | |||
| DeepInfra | $0.14 | $0.28 | $0.028 | lowest |
| Together AI | Not listed | |||
| Fireworks AI | Not listed | |||
| NVIDIA NIM | $0.14 | $0.28 | — | same as DI/OR |
| OpenRouter (DeepSeek) | $0.14 | $0.28 | — | same |
Tie: DeepInfra / NVIDIA / OpenRouter — all three at $0.14/$0.28. NVIDIA NIM also routes here via partner DeepInfra. Baseten and Together/Fireworks don't offer V4 Flash as serverless.
DeepSeek V3.1 Terminus
| Platform | Input | Output | Cache | Verdict |
|---|---|---|---|---|
| Baseten | $0.50 | $1.50 | $0.25 | — |
| DeepInfra | $0.21 | $0.79 | $0.13 | × cheapest |
| Together AI | $0.60 | $1.70 | — | most expensive |
| Fireworks AI | $0.56 | $1.68 | $0.28 | near Baseten |
| OpenRouter (DeepSeek) | $0.56 | $1.68 | $0.07 | ≈ Fireworks |
Winner: DeepInfra — by a huge margin. DeepInfra's $0.21/$0.79 is 2.5× cheaper on input than the next cheapest (Baseten at $0.50), and half the output cost of anyone else. Together AI is the most expensive at $0.60/$1.70.
Kimi K2.6
| Platform | Input | Output | Cache | Verdict |
|---|---|---|---|---|
| Baseten | $1.00 | $3.90 | $0.20 | expensive |
| DeepInfra | $0.75 | $3.50 | $0.15 | tied cheapest |
| Together AI | $1.20 | $4.50 | $0.20 | most expensive |
| Fireworks AI | $0.95 | $4.00 | $0.48 | mid |
| NVIDIA NIM | $0.95 | $4.00 | — | = Fireworks |
| OpenRouter (Moonshot) | $0.21 | $4.00 | — | best input |
| OpenRouter (DeepInfra) | $0.75 | $3.50 | $0.15 | tied cheapest |
Winner: DeepInfra / OpenRouter (tie) — DeepInfra direct at $0.75/$3.50 is the lowest consistent pricing. OpenRouter via Moonshot direct has unbeatable input ($0.21) but $4.00 output. NVIDIA NIM routes through Fireworks at $0.95/$4.00. Together is most expensive. Fireworks is mid-range.
Kimi K2.5
| Platform | Input | Output | Cache | Verdict |
|---|---|---|---|---|
| Baseten | $0.60 | $3.00 | $0.12 | mid |
| DeepInfra | $0.45 | $2.25 | $0.07 | cheapest |
| Together AI | $0.50 | $2.80 | — | close |
| Fireworks AI | $0.60 | $3.00 | $0.10 | = Baseten |
| OpenRouter (DeepInfra) | $0.44 | $2.00 | $0.22 | routed |
Winner: DeepInfra — $0.45/$2.25 beats everyone. Together AI is competitive at $0.50/$2.80 but behind on output. Fireworks and Baseten match at $0.60/$3.00. OpenRouter's routed DeepInfra is slightly cheaper on output ($2.00) minus 5.5% fee.
Llama 3.3 70B
| Platform | Input | Output | Cache | Verdict |
|---|---|---|---|---|
| Baseten | $0.10 | $0.50 | — | highest output |
| DeepInfra | $0.10 | $0.32 | — | cheapest paid |
| Together AI | $0.88 | $0.88 | — | most expensive |
| Fireworks AI | $0.90 | $0.90 | $0.45 | most expensive |
| OpenRouter (Venice) | $0.00 | $0.00 | — | free! |
| OpenRouter (paid) | $0.10 | $0.32 | — | = DeepInfra |
Winner: OpenRouter (free) — Llama 3.3 is free on Venice. Among paid options, DeepInfra and OpenRouter paid are the cheapest ($0.10/$0.32). Notably, Together AI and Fireworks AI are both nearly 9× more expensive on output than DeepInfra at $0.88/$0.90 per token flat rate.
Llama 3.3 70B Pricing Discrepancy
Together AI and Fireworks AI both charge a flat $0.88–$0.90 per million tokens (same rate for input and output) for Llama 3.3 70B. That compares unfavorably to DeepInfra's $0.10/$0.32 split. The same pattern holds across other open models — Together and Fireworks tend to price higher on smaller/commodity models. Their competitive advantage isn't price on small models — it's on large MoE models like DeepSeek V4 Pro where their optimized inference stacks close the gap.
Gemma 4 31B
| Platform | Input | Output | Cache | Verdict |
|---|---|---|---|---|
| Baseten | Not listed as Model API | |||
| DeepInfra | Not listed | |||
| Together AI | $0.20 | $0.50 | — | only pay API |
| Fireworks AI | Not listed as serverless | |||
| NVIDIA NIM | $0.14 | $0.40 | — | cheapest paid |
| OpenRouter | $0.00 | $0.00 | — | free! |
Winner: OpenRouter (free) — Gemma 4 31B is free on OpenRouter. Among paid options, NVIDIA NIM is the cheapest at $0.14/$0.40, beating Together AI's $0.20/$0.50 by 30%. Baseten, DeepInfra, and Fireworks don't offer it serverless.
Nemotron 3 Super 120B A12B
| Platform | Input | Output | Cache | Verdict |
|---|---|---|---|---|
| Baseten | $0.30 | $0.75 | $0.06 | mid |
| DeepInfra | $0.10 | $0.50 | — | × cheapest |
| Together AI | Not listed on serverless (dedicated only) | |||
| Fireworks AI | On-demand deployment only (no serverless) | |||
| OpenRouter | $0.00 | $0.00 | — | free! |
Winner: OpenRouter (free) — Nemotron 3 Super is free on OpenRouter. Among paid providers, DeepInfra is the cheapest at $0.10/$0.50 — 3× cheaper on input than Baseten's $0.30/$0.75. Together AI and Fireworks only offer it via dedicated/on-demand deployments (per GPU-hour), not serverless per-token.
OpenRouter = Router · Baseten = Infrastructure
These are fundamentally different products wearing similar-looking pricing pages:
Marketplace + Routing Layer
OpenRouter is an aggregator — it connects to 60+ provider APIs (DeepSeek, Together, Fireworks, DeepInfra, etc.) and routes your request to the cheapest/fastest/most reliable provider. It adds a 5.5% platform fee on top of provider pricing.
Strengths: 400+ models, auto-fallback, free tier, multi-provider competition, prompt caching across providers, no markup on base pricing.
Weaknesses: No dedicated GPU access, limited control over infrastructure, provider-dependent uptime.
GPU Cloud + Inference Stack
Baseten is a GPU deployment platform that also offers optimized Model APIs. You can either use their pre-built endpoints (pay per token) or deploy your own model on dedicated GPUs (pay per minute/hour).
Strengths: Dedicated GPUs (T4→B200), no idle charges, autoscaling, SOC 2 + HIPAA, custom deployments via Truss, fast cold starts.
Weaknesses: Smaller model library (mostly open-source), per-token pricing is often higher than OpenRouter's best providers, no free tier for LLMs.
Serverless + Dedicated Inference
Together AI is a full-stack inference platform with serverless, dedicated GPU deployments, fine-tuning, and GPU clusters. Raised significant funding and has strong enterprise adoption.
Strengths: SOC 2 Type II + HIPAA + BAA, competitive on large MoE models, dedicated H100 from $3.99/hr, fine-tuning platform, batch inference at 50% off.
Weaknesses: Expensive on smaller/commodity models (Llama 3.3 at $0.88/M), no free tier, smaller model selection than OpenRouter.
Fast Inference Stack
Fireworks AI focuses on low-latency inference with optimized engines (Firework Engine). Also SOC 2 Type II + HIPAA + BAA compliant. Raised over $100M.
Strengths: SOC 2 Type II + HIPAA + BAA, optimized inference (often fastest time-to-first-token), on-demand GPU deployments (H100/H200/B200/B300), fine-tuning platform.
Weaknesses: Expensive on small models (flat $0.90/M for Llama 3.3), smaller library than Together or OpenRouter.
GPU-Optimized Inference via Build.nvidia.com
NVIDIA NIM (via build.nvidia.com) provides inference microservices running on NVIDIA's optimized stack. Free tier with 1,000-5,000 credits for trial, then routes to partner providers for production. Includes Nemotron models and NVIDIA-optimized versions of all major open models.
Strengths: Often undercuts other providers on popular models ($1.39/$2.78 V4 Pro vs $1.74/$3.48 on Baseten), free trial credits, NVIDIA hardware optimization (DGX, Blackwell), Nemotron family runs best on its own hardware, strong for self-hosted enterprise deployments via NVIDIA AI Enterprise license.
Weaknesses: Not every model available serverless, limited per-token pricing transparency (many models are "downloadable only"), free tier is just for trial, production requires partner routing or enterprise license, no HIPAA BAA on the trial tier.
Four Inference Models in One Platform
Hugging Face is unique — it operates four distinct inference paths:
1. Inference Providers (routed, per-token): Routes to 20+ provider APIs (DeepInfra, Together, Fireworks, Groq, Cerebras, Novita, etc.) with $0 markup — you pay the same rate as going direct. Includes $0.10/mo free credits for free users, $2/mo for PRO. Covers 200+ models.
2. Inference Endpoints (dedicated GPU): Deploy any model from the Hub on dedicated GPU instances. Pricing is per GPU-hour, not per token. Starts at $0.50/hr (T4) through $4.50/hr (H100) via AWS. Scales to zero. Best for custom/private models or high-volume workloads.
3. HF-Inference (HF's own serverless infra): Runs on HF's own hardware. Billed by compute time (GPU seconds), not per token. 15,000+ models supported but mostly CPU-friendly tasks — embeddings, text classification, sentence similarity, small LLMs (BERT, GPT-2). Only 1 trending chat model. A 10-second FLUX.1-dev image generation costs ~$0.0012.
4. ZeroGPU (free): Free H200 GPU access for PRO users via Spaces. Limited to side projects.
Strengths: Largest model library (1M+ models on Hub), $0 markup on routed providers, flexible deployment options, strong open-source community, fine-tuning platform, Spaces hosting.
Weaknesses: No per-token pricing for HF-native models (hf-inference is compute-time), routed providers add latency vs direct, no auto-fallback or multi-provider optimization like OpenRouter, $0.10/mo free credits is negligible.
OpenRouter when…
→ You want access to 400+ models including Claude, GPT, Gemini
→ You want provider redundancy with auto-fallback
→ You want a free tier for experimentation
→ Your workload has good cache hit rates (effective pricing often 30-50% below list)
→ You want to compare providers before choosing one
→ You don't need dedicated GPU or HIPAA compliance
DeepInfra when…
→ You want the lowest per-token price for popular open models (often cheapest)
→ You want dedicated GPU at wholesale rates (H100 $1.79/hr, B200 $2.79/hr)
→ You need SOC 2 / ISO 27001 compliance
→ You know which model you want and don't need routing/fallback
→ You want to avoid the 5.5% OpenRouter platform fee
→ Your usage is high enough to benefit from direct pricing
Baseten when…
→ You need HIPAA compliance (only provider with both SOC 2 + HIPAA)
→ You need to deploy custom models via Truss framework
→ You want autoscaling to zero with no idle charges
→ You need B200 or specific hardware for custom workloads
→ Your input/output ratio is input-heavy and benefits from cache
Hugging Face when…
→ You want to route to 20+ providers with $0 markup, same rates if one provider works best
→ You need to deploy a custom model from the Hub on dedicated GPU (Inference Endpoints)
→ You want the largest model library (1M+ models) to experiment with
→ You want a single billing relationship for consolidated spending across providers
→ You're a PRO user and can use free ZeroGPU for side projects
→ You want to fine-tune and deploy in one platform (AutoTrain + Endpoints)
HF Inference: Providers, Endpoints, and HF-Native
Hugging Face operates four distinct inference paths. The first three route through partners:
- Inference Providers (routed, per-token): Routes to 20+ providers (DeepInfra, Together, Fireworks, Groq, Cerebras, Novita, etc.) at exactly the same rates they charge direct. $0 markup. $0.10/mo free credits (free) / $2/mo (PRO). Covers 200+ large models. Best for consolidated billing.
- Inference Endpoints (dedicated GPU): Deploy any Hub model on dedicated GPU. $0.50/hr (T4) → $4.50/hr (H100) via AWS. Per GPU-hour. Scales to zero. Good for custom/private models.
- ZeroGPU: Free H200 for PRO users via Spaces. Side projects only.
- HF-Inference (HF's own infra): The legacy serverless API running on HF's own infrastructure. Billed by compute time, not per-token. Best for embeddings, text classification, sentence similarity, and small LLMs (BERT, GPT-2, etc.). Only 1 trending chat model. 15,000+ models supported but mostly CPU-friendly tasks, not large LLM chat.
HF-Native pricing: Not comparable to per-token APIs — you're paying for GPU seconds, not tokens. A 10-second FLUX.1-dev generation costs ~$0.0012. For chat LLMs, you're better off using the routed Inference Providers (which give you per-token pricing from DeepInfra/Groq/etc.).
Bottom line: If you want HF's own models (BLOOM, StarCoder, etc.), use Inference Endpoints. For per-token pricing on popular models, the routed Inference Providers are the same rates as going direct with $0 markup. The legacy hf-inference is only worth it for small/embedding models.
Who Can Handle Regulated Data?
| Provider | SOC 2 | HIPAA | ISO 27001 | BAA Available |
|---|---|---|---|---|
| Baseten | ✅ Type II | ✅ Yes | — | ✅ Yes |
| DeepInfra | ✅ Type I | ⚠️ Claims compliance | ✅ Yes | Unclear |
| OpenRouter | ✅ Type II | ❌ No BAA | ⏳ In progress | ❌ Not published |
| Together AI | ✅ Type II | ✅ Yes | — | ✅ Yes |
| Fireworks AI | ✅ Type II | ✅ Yes | — | ✅ Yes |
| Groq | ✅ Type II | ✅ Yes | — | ✅ Yes |
| Hugging Face | — | ⚠️ Limited | — | Check |
For HIPAA workloads: Baseten, Together AI, Fireworks AI, and Groq all have SOC 2 Type II + HIPAA with BAAs. DeepInfra claims HIPAA compliance alongside SOC 2/ISO 27001 but BAA availability is unclear. OpenRouter has SOC 2 but no HIPAA BAA — not suitable for PHI.
Note on DeepInfra: Their privacy policy states they "comply with SOC 2 and ISO 27001 standards and include technical and organizational measures for GDPR and HIPAA compliance." However, they appear to be at SOC 2 Type I (not Type II), and a BAA for HIPAA isn't prominently listed. If HIPAA is critical, verify directly before committing.
Head to Head
| Model | Cheapest Input | Cheapest Output | Best Platform |
|---|---|---|---|
| DeepSeek V4 Pro | OpenRouter ×4 | OpenRouter ×4 | OpenRouter |
| DeepSeek V4 Flash | DeepInfra / OR | DeepInfra / OR | DeepInfra/OR |
| DeepSeek V3.1 | DeepInfra ×2.5 | DeepInfra ×2 | DeepInfra |
| Kimi K2.6 | OpenRouter ×5 (Moonshot) | DeepInfra / OR | DeepInfra/OR |
| Kimi K2.5 | DeepInfra ~25% | DeepInfra ~25% | DeepInfra |
| Llama 3.3 70B | OpenRouter free | DeepInfra / OR | OpenRouter free |
| Gemma 4 31B | OpenRouter free | OpenRouter free | OpenRouter free |
| Nemotron 3 Super | OpenRouter free | OpenRouter free | OpenRouter free |
| All (via HF) | Same as provider | Same as provider | Hugging Face $0 markup |
Overall winner: DeepInfra for paid per-token pricing. NVIDIA NIM earns an honorable mention — it undercuts most providers on V4 Pro ($1.39/$2.78 vs $1.74/$3.48) and Gemma 4 ($0.14/$0.40 vs $0.20/$0.50). OpenRouter's free tier dominates on Gemma 4, Nemotron, and Llama 3.3. Hugging Face Inference Providers is the best choice if you want zero markup routing to existing provider APIs with consolidated billing.
For HIPAA-required workloads: Together AI is the strongest pick — competitive pricing on large models, full compliance, and the widest model selection among HIPAA-compliant providers.