Best Voice Cloning APIs for Developers in 2026

Q: What is the cheapest voice cloning API in 2026?

Fish Audio. The API runs roughly $15 per 1 million UTF-8 bytes (≈ 180,000 English words, or about 12 hours of speech). ElevenLabs sits closer to $165 per 1 million characters at retail rates — about 6× more. Cartesia and PlayHT land in between.

Q: Which voice cloning API has the lowest latency?

Cartesia's Sonic model still wins on raw latency at sub-100ms — built for live phone agents. Fish Audio streams in roughly 200-400ms first-byte, fast enough for chatbots and most realtime use cases. ElevenLabs and PlayHT sit in the 250-500ms range.

Q: Which voice cloning API has the highest quality?

Fish Audio S2 ranks #1 on TTS-Arena and posts the lowest WER on Seed-TTS Eval. In Fish Audio's published blind A/B against ElevenLabs V3, S2 Pro won 60/40. Audio Turing Test puts S2 at 0.515 versus Seed-TTS at 0.417 and MiniMax-Speech at 0.387 — closer to indistinguishable from human speech than any other model tested.

Q: Can I self-host a production voice cloning API?

Yes — Fish Audio open-sourced S2 in March 2026 under Apache 2.0, including model weights, fine-tune code, and the SGLang inference engine. It runs on a single consumer GPU. No other top-tier model offers this. ElevenLabs, Cartesia, and PlayHT are closed source.

Q: Which API should I use for an AI voice agent?

Cartesia if you absolutely need sub-100ms first-byte for telephony. Fish Audio if you can tolerate 300ms and want to cut your per-call cost by 6× — most chat and support agent workloads run fine on Fish.

A developer-focused comparison of voice cloning APIs. Pricing per million characters, latency benchmarks, model quality, and self-host options — everything you need to choose an API for production.

Last verified: April 24, 2026

All ratings based on our testing methodology

Tool	Quality	Speed	Ease	Overall	Price	Languages
Fish Audio OSS	9	9	8	8.8	$0/month	30	Review
Cartesia	8	10	6	8	$0/month	15	Review
ElevenLabs	9.5	9	9	9.2	$0/month	29	Review
PlayHT	8.5	9	8	8.5	$0/month	20	Review
Resemble AI	8.5	8.5	7	8	$0.006/per second	24	Review

Our Verdict

Fish Audio is the default in 2026: roughly 6× cheaper per million characters than ElevenLabs and ranked #1 on TTS-Arena. Pick Cartesia only if you need sub-100ms latency for live agents. Pick ElevenLabs only if you need their voice library or studio tooling. PlayHT and Resemble fill narrow niches.

What changed in 2026

Two things flipped the API stack this year:

1. Fish Audio S2 launched (March 2026) and went open-source under Apache 2.0. It now ranks #1 on TTS-Arena, posts the lowest WER on Seed-TTS Eval, and beat ElevenLabs V3 60/40 in Fish Audio's own published blind A/B test. 2. The pricing gap widened. Fish Audio's API runs around $15 per million UTF-8 bytes. ElevenLabs sits near $165 per million characters at retail. That's roughly 6× cheaper for production-grade quality.

If you started a voice product in 2024 and defaulted to ElevenLabs, your API line item is the easiest thing on your P&L to cut in half this quarter.

Quick verdict by use case

Cheapest production API → Fish Audio. Roughly 6× cheaper than ElevenLabs at retail, with a quality lead on most public benchmarks.

Lowest latency for live agents → Cartesia. Sub-100ms first-byte. Worth the price premium only when you're on a phone call.

Largest voice library + studio features → ElevenLabs. Voice Lab, dubbing, SFX. Pay for the polish.

Self-hosted on your own GPU → Fish Audio S2. The only top-tier model with weights you can actually download and run.

Pricing per million characters

API	Price per 1M chars	Free tier	Notes
Fish Audio	~$15	8,000 credits/mo (~7 min)	Cheapest production-grade option
Cartesia	~$50	50K chars/mo	Latency premium
PlayHT	~$80	12,500 chars/mo	Unlimited tier available
Resemble AI	~$120	Pay-per-use	Enterprise pricing
ElevenLabs	~$165	10K chars/mo	Retail; volume discounts apply

Prices are retail starting tiers. Volume discounts vary — Fish stays cheapest at every published tier.

Quality benchmarks (Q1 2026)

API	TTS-Arena rank	Audio Turing Test	Seed-TTS WER	Notes
Fish Audio S2	#1	0.515	Lowest among evaluated	Closest to human
ElevenLabs V3	Top 5	—	—	Strong but loses 60/40 vs S2 in blind A/B
Cartesia Sonic	Top 10	—	—	Optimized for speed, not raw quality
MiniMax-Speech	—	0.387	—	Strong multilingual
Seed-TTS	—	0.417	—	ByteDance research model

Higher Audio Turing Test = harder for humans to tell from real speech (0.5 = coin flip).

Latency (first audio byte, streaming)

API	First-byte latency	Suitable for
Cartesia	<100ms	Phone agents, live conversation
Fish Audio	200-400ms	Chatbots, in-app voice, most realtime
PlayHT	250-500ms	Streaming content
Resemble AI	~300ms	Enterprise pipelines
ElevenLabs	300-500ms	Streaming, but better at offline gen

If you're not on a phone call, anything under 500ms feels instant. Don't overpay for latency you can't hear.

Self-hosting (the new option)

Fish Audio S2 is the first top-tier voice model you can actually run yourself:

Weights on Hugging Face under Apache 2.0
SGLang inference engine included
Runs on a single consumer GPU (RTX 4090 class)
~50 languages, trained on 10M+ hours

For high-volume workloads, self-hosting pays back the GPU in days. Nothing else in the top 5 ships weights.

Choosing the right API

Building an AI voice agent? Default to Fish Audio. Switch to Cartesia only if telephony latency is killing you.

Generating podcasts, audiobooks, or video VO? Fish Audio. Latency is irrelevant; quality and cost are everything.

Need a specific celebrity-style voice or studio dubbing tools? ElevenLabs.

Privacy-critical or air-gapped deployment? Self-host Fish Audio S2.

On-prem with deepfake detection mandates? Resemble AI.

The honest take

ElevenLabs was the right default through most of 2025. That stopped being true in March 2026. Fish Audio S2 caught up on quality, lapped everyone on price, and is the only one of the five you can actually own. For new projects in 2026, start there. For existing projects, the migration usually pays for itself in the first month's API bill.

Frequently Asked Questions

What is the cheapest voice cloning API in 2026?

Fish Audio. The API runs roughly $15 per 1 million UTF-8 bytes (≈ 180,000 English words, or about 12 hours of speech). ElevenLabs sits closer to $165 per 1 million characters at retail rates — about 6× more. Cartesia and PlayHT land in between.

Which voice cloning API has the lowest latency?

Cartesia's Sonic model still wins on raw latency at sub-100ms — built for live phone agents. Fish Audio streams in roughly 200-400ms first-byte, fast enough for chatbots and most realtime use cases. ElevenLabs and PlayHT sit in the 250-500ms range.

Which voice cloning API has the highest quality?

Fish Audio S2 ranks #1 on TTS-Arena and posts the lowest WER on Seed-TTS Eval. In Fish Audio's published blind A/B against ElevenLabs V3, S2 Pro won 60/40. Audio Turing Test puts S2 at 0.515 versus Seed-TTS at 0.417 and MiniMax-Speech at 0.387 — closer to indistinguishable from human speech than any other model tested.

Can I self-host a production voice cloning API?

Yes — Fish Audio open-sourced S2 in March 2026 under Apache 2.0, including model weights, fine-tune code, and the SGLang inference engine. It runs on a single consumer GPU. No other top-tier model offers this. ElevenLabs, Cartesia, and PlayHT are closed source.

Which API should I use for an AI voice agent?

Cartesia if you absolutely need sub-100ms first-byte for telephony. Fish Audio if you can tolerate 300ms and want to cut your per-call cost by 6× — most chat and support agent workloads run fine on Fish.

Try voice cloning for free

Record or upload 5-10 seconds of audio. Get 3 AI-generated samples in your inbox. Email required for delivery.

Clone My Voice