Best Voice Cloning APIs for Developers in 2026

A developer-focused comparison of voice cloning APIs. Pricing per million characters, latency benchmarks, model quality, and self-host options — everything you need to choose an API for production.

Last verified: April 24, 2026

All ratings based on our testing methodology

Tool Quality Speed Ease Overall Price Languages
Fish Audio OSS
9
9
8
8.8 $0/month 30 Review
Cartesia
8
10
6
8 $0/month 15 Review
ElevenLabs
9.5
9
9
9.2 $0/month 29 Review
PlayHT
8.5
9
8
8.5 $0/month 20 Review
Resemble AI
8.5
8.5
7
8 $0.006/per second 24 Review

Our Verdict

Fish Audio is the default in 2026: roughly 6× cheaper per million characters than ElevenLabs and ranked #1 on TTS-Arena. Pick Cartesia only if you need sub-100ms latency for live agents. Pick ElevenLabs only if you need their voice library or studio tooling. PlayHT and Resemble fill narrow niches.

What changed in 2026

Two things flipped the API stack this year:

1. Fish Audio S2 launched (March 2026) and went open-source under Apache 2.0. It now ranks #1 on TTS-Arena, posts the lowest WER on Seed-TTS Eval, and beat ElevenLabs V3 60/40 in Fish Audio's own published blind A/B test. 2. The pricing gap widened. Fish Audio's API runs around $15 per million UTF-8 bytes. ElevenLabs sits near $165 per million characters at retail. That's roughly 6× cheaper for production-grade quality.

If you started a voice product in 2024 and defaulted to ElevenLabs, your API line item is the easiest thing on your P&L to cut in half this quarter.

Quick verdict by use case

Cheapest production API → Fish Audio. Roughly 6× cheaper than ElevenLabs at retail, with a quality lead on most public benchmarks.

Lowest latency for live agents → Cartesia. Sub-100ms first-byte. Worth the price premium only when you're on a phone call.

Largest voice library + studio features → ElevenLabs. Voice Lab, dubbing, SFX. Pay for the polish.

Self-hosted on your own GPU → Fish Audio S2. The only top-tier model with weights you can actually download and run.

Pricing per million characters

APIPrice per 1M charsFree tierNotes
Fish Audio~$158,000 credits/mo (~7 min)Cheapest production-grade option
Cartesia~$5050K chars/moLatency premium
PlayHT~$8012,500 chars/moUnlimited tier available
Resemble AI~$120Pay-per-useEnterprise pricing
ElevenLabs~$16510K chars/moRetail; volume discounts apply
Prices are retail starting tiers. Volume discounts vary — Fish stays cheapest at every published tier.

Quality benchmarks (Q1 2026)

APITTS-Arena rankAudio Turing TestSeed-TTS WERNotes
Fish Audio S2#10.515Lowest among evaluatedClosest to human
ElevenLabs V3Top 5Strong but loses 60/40 vs S2 in blind A/B
Cartesia SonicTop 10Optimized for speed, not raw quality
MiniMax-Speech0.387Strong multilingual
Seed-TTS0.417ByteDance research model
Higher Audio Turing Test = harder for humans to tell from real speech (0.5 = coin flip).

Latency (first audio byte, streaming)

APIFirst-byte latencySuitable for
Cartesia<100msPhone agents, live conversation
Fish Audio200-400msChatbots, in-app voice, most realtime
PlayHT250-500msStreaming content
Resemble AI~300msEnterprise pipelines
ElevenLabs300-500msStreaming, but better at offline gen
If you're not on a phone call, anything under 500ms feels instant. Don't overpay for latency you can't hear.

Self-hosting (the new option)

Fish Audio S2 is the first top-tier voice model you can actually run yourself:

  • Weights on Hugging Face under Apache 2.0
  • SGLang inference engine included
  • Runs on a single consumer GPU (RTX 4090 class)
  • ~50 languages, trained on 10M+ hours
For high-volume workloads, self-hosting pays back the GPU in days. Nothing else in the top 5 ships weights.

Choosing the right API

Building an AI voice agent? Default to Fish Audio. Switch to Cartesia only if telephony latency is killing you.

Generating podcasts, audiobooks, or video VO? Fish Audio. Latency is irrelevant; quality and cost are everything.

Need a specific celebrity-style voice or studio dubbing tools? ElevenLabs.

Privacy-critical or air-gapped deployment? Self-host Fish Audio S2.

On-prem with deepfake detection mandates? Resemble AI.

The honest take

ElevenLabs was the right default through most of 2025. That stopped being true in March 2026. Fish Audio S2 caught up on quality, lapped everyone on price, and is the only one of the five you can actually own. For new projects in 2026, start there. For existing projects, the migration usually pays for itself in the first month's API bill.

Frequently Asked Questions

What is the cheapest voice cloning API in 2026?

Fish Audio. The API runs roughly $15 per 1 million UTF-8 bytes (≈ 180,000 English words, or about 12 hours of speech). ElevenLabs sits closer to $165 per 1 million characters at retail rates — about 6× more. Cartesia and PlayHT land in between.

Which voice cloning API has the lowest latency?

Cartesia's Sonic model still wins on raw latency at sub-100ms — built for live phone agents. Fish Audio streams in roughly 200-400ms first-byte, fast enough for chatbots and most realtime use cases. ElevenLabs and PlayHT sit in the 250-500ms range.

Which voice cloning API has the highest quality?

Fish Audio S2 ranks #1 on TTS-Arena and posts the lowest WER on Seed-TTS Eval. In Fish Audio's published blind A/B against ElevenLabs V3, S2 Pro won 60/40. Audio Turing Test puts S2 at 0.515 versus Seed-TTS at 0.417 and MiniMax-Speech at 0.387 — closer to indistinguishable from human speech than any other model tested.

Can I self-host a production voice cloning API?

Yes — Fish Audio open-sourced S2 in March 2026 under Apache 2.0, including model weights, fine-tune code, and the SGLang inference engine. It runs on a single consumer GPU. No other top-tier model offers this. ElevenLabs, Cartesia, and PlayHT are closed source.

Which API should I use for an AI voice agent?

Cartesia if you absolutely need sub-100ms first-byte for telephony. Fish Audio if you can tolerate 300ms and want to cut your per-call cost by 6× — most chat and support agent workloads run fine on Fish.

Try voice cloning for free

Record or upload 5-10 seconds of audio. Get 3 AI-generated samples in your inbox. Email required for delivery.

Clone My Voice