F5-TTS took 7+ minutes for 4 seconds of speech on our production droplet. edge-tts delivered the same output in 6.7 seconds across 6 scenes. The constraint was never quality — it was compute profile vs. deployment target.
We made a technology switch mid-build. This is the decision record.
Our production environment is a 2vCPU/4GB DigitalOcean droplet. No GPU. No specialized ML hardware. A daily content pipeline needs to complete in under 10 minutes to be useful.
F5-TTS is a state-of-the-art open-source TTS system. On our droplet, it took more than 7 minutes to generate 4 seconds of speech. Transformer-based inference without GPU acceleration does not scale to a 2vCPU environment. The model was not the problem. The deployment target was the problem.
Before committing to any ML-backed tool in production: (1) Run the tool on the actual target hardware. (2) Measure wall-clock time for a representative workload. (3) Calculate whether the tool can complete within the pipeline time budget. (4) Only then commit to the integration.
edge-tts is a Python library that calls Microsoft Neural TTS via a free cloud-backed API. On our droplet, it generated 6 scenes of voiceover in 6.7 seconds. The quality is Microsoft Neural — professional, natural, consistent. Trade-off: we depend on a free external API with no SLA.
Every ML tool evaluation now includes a deployment target test. If a tool requires GPU acceleration and the deployment target has no GPU, the tool is not suitable for that context — regardless of quality or cost.