Built a full video-to-social content engine: n8n orchestration, Remotion rendering, edge-tts narration, DigitalOcean deployment. 6 scenes rendered in 6.7 seconds. 4 platforms publishing daily. Lessons from 3 failed TTS engines.
We needed a content engine that could produce, render, and publish video content daily — without a production team. Three weeks, one droplet, and several failed approaches later, here is what we shipped.
The pipeline has four layers:
We tried F5-TTS first. It worked at import level. It took more than 7 minutes to generate 4 seconds of speech on our 2vCPU/4GB droplet. Transformer-based TTS without GPU acceleration is effectively unusable for a daily content pipeline. The lesson: benchmark on target hardware before committing.
Every scene duration must match actual voiceover length plus a 1-second buffer. Fixed durations cause voice cutoff. We measure each voiceover with ffprobe and set scene duration to VO length plus 1 to 1.5 seconds.
Instagram Reels has a 90-second maximum and requires 9:16 aspect ratio. Threads allows up to 5 minutes. LinkedIn allows up to 10 minutes. We built isPlatformCompatible() as a gate that runs before every publish attempt.
The engine runs daily on a DigitalOcean droplet via PM2. Content is deduped by SHA-1 content hash — not just by media URL — to prevent text posts from republishing the same content across cycles.