Best Voice Cloning APIs for Developers in 2026
A developer-focused comparison of voice cloning APIs. Pricing per million characters, latency benchmarks, model quality, and self-host options — everything you need to choose an API for production.
Last verified: April 24, 2026
All ratings based on our testing methodology
| Tool | Quality | Speed | Ease | Overall | Price | Languages | |
|---|---|---|---|---|---|---|---|
| Fish Audio OSS | | | | 8.8 | $0/month | 30 | Review |
| Cartesia | | | | 8 | $0/month | 15 | Review |
| ElevenLabs | | | | 9.2 | $0/month | 29 | Review |
| PlayHT | | | | 8.5 | $0/month | 20 | Review |
| Resemble AI | | | | 8 | $0.006/per second | 24 | Review |
Our Verdict
Fish Audio is the default in 2026: roughly 6× cheaper per million characters than ElevenLabs and ranked #1 on TTS-Arena. Pick Cartesia only if you need sub-100ms latency for live agents. Pick ElevenLabs only if you need their voice library or studio tooling. PlayHT and Resemble fill narrow niches.
What changed in 2026
Two things flipped the API stack this year:
1. Fish Audio S2 launched (March 2026) and went open-source under Apache 2.0. It now ranks #1 on TTS-Arena, posts the lowest WER on Seed-TTS Eval, and beat ElevenLabs V3 60/40 in Fish Audio's own published blind A/B test. 2. The pricing gap widened. Fish Audio's API runs around $15 per million UTF-8 bytes. ElevenLabs sits near $165 per million characters at retail. That's roughly 6× cheaper for production-grade quality.
If you started a voice product in 2024 and defaulted to ElevenLabs, your API line item is the easiest thing on your P&L to cut in half this quarter.
Quick verdict by use case
Cheapest production API → Fish Audio. Roughly 6× cheaper than ElevenLabs at retail, with a quality lead on most public benchmarks.
Lowest latency for live agents → Cartesia. Sub-100ms first-byte. Worth the price premium only when you're on a phone call.
Largest voice library + studio features → ElevenLabs. Voice Lab, dubbing, SFX. Pay for the polish.
Self-hosted on your own GPU → Fish Audio S2. The only top-tier model with weights you can actually download and run.
Pricing per million characters
| API | Price per 1M chars | Free tier | Notes |
|---|---|---|---|
| Fish Audio | ~$15 | 8,000 credits/mo (~7 min) | Cheapest production-grade option |
| Cartesia | ~$50 | 50K chars/mo | Latency premium |
| PlayHT | ~$80 | 12,500 chars/mo | Unlimited tier available |
| Resemble AI | ~$120 | Pay-per-use | Enterprise pricing |
| ElevenLabs | ~$165 | 10K chars/mo | Retail; volume discounts apply |
Quality benchmarks (Q1 2026)
| API | TTS-Arena rank | Audio Turing Test | Seed-TTS WER | Notes |
|---|---|---|---|---|
| Fish Audio S2 | #1 | 0.515 | Lowest among evaluated | Closest to human |
| ElevenLabs V3 | Top 5 | — | — | Strong but loses 60/40 vs S2 in blind A/B |
| Cartesia Sonic | Top 10 | — | — | Optimized for speed, not raw quality |
| MiniMax-Speech | — | 0.387 | — | Strong multilingual |
| Seed-TTS | — | 0.417 | — | ByteDance research model |
Latency (first audio byte, streaming)
| API | First-byte latency | Suitable for |
|---|---|---|
| Cartesia | <100ms | Phone agents, live conversation |
| Fish Audio | 200-400ms | Chatbots, in-app voice, most realtime |
| PlayHT | 250-500ms | Streaming content |
| Resemble AI | ~300ms | Enterprise pipelines |
| ElevenLabs | 300-500ms | Streaming, but better at offline gen |
Self-hosting (the new option)
Fish Audio S2 is the first top-tier voice model you can actually run yourself:
- Weights on Hugging Face under Apache 2.0
- SGLang inference engine included
- Runs on a single consumer GPU (RTX 4090 class)
- ~50 languages, trained on 10M+ hours
Choosing the right API
Building an AI voice agent? Default to Fish Audio. Switch to Cartesia only if telephony latency is killing you.
Generating podcasts, audiobooks, or video VO? Fish Audio. Latency is irrelevant; quality and cost are everything.
Need a specific celebrity-style voice or studio dubbing tools? ElevenLabs.
Privacy-critical or air-gapped deployment? Self-host Fish Audio S2.
On-prem with deepfake detection mandates? Resemble AI.
The honest take
ElevenLabs was the right default through most of 2025. That stopped being true in March 2026. Fish Audio S2 caught up on quality, lapped everyone on price, and is the only one of the five you can actually own. For new projects in 2026, start there. For existing projects, the migration usually pays for itself in the first month's API bill.
Frequently Asked Questions
What is the cheapest voice cloning API in 2026?
Fish Audio. The API runs roughly $15 per 1 million UTF-8 bytes (≈ 180,000 English words, or about 12 hours of speech). ElevenLabs sits closer to $165 per 1 million characters at retail rates — about 6× more. Cartesia and PlayHT land in between.
Which voice cloning API has the lowest latency?
Cartesia's Sonic model still wins on raw latency at sub-100ms — built for live phone agents. Fish Audio streams in roughly 200-400ms first-byte, fast enough for chatbots and most realtime use cases. ElevenLabs and PlayHT sit in the 250-500ms range.
Which voice cloning API has the highest quality?
Fish Audio S2 ranks #1 on TTS-Arena and posts the lowest WER on Seed-TTS Eval. In Fish Audio's published blind A/B against ElevenLabs V3, S2 Pro won 60/40. Audio Turing Test puts S2 at 0.515 versus Seed-TTS at 0.417 and MiniMax-Speech at 0.387 — closer to indistinguishable from human speech than any other model tested.
Can I self-host a production voice cloning API?
Yes — Fish Audio open-sourced S2 in March 2026 under Apache 2.0, including model weights, fine-tune code, and the SGLang inference engine. It runs on a single consumer GPU. No other top-tier model offers this. ElevenLabs, Cartesia, and PlayHT are closed source.
Which API should I use for an AI voice agent?
Cartesia if you absolutely need sub-100ms first-byte for telephony. Fish Audio if you can tolerate 300ms and want to cut your per-call cost by 6× — most chat and support agent workloads run fine on Fish.
Try voice cloning for free
Record or upload 5-10 seconds of audio. Get 3 AI-generated samples in your inbox. Email required for delivery.
Clone My Voice