March 4, 2026·7 min read

I Built a Full Video Pipeline with AI in 4 Minutes

From script to published video using Seedance, TTS, and automated subtitle burning — a step-by-step breakdown of the pipeline that changed how I create content.

The 4-Minute Claim

Let me be precise. The pipeline itself takes 4 minutes to set up from scratch — once. After that, producing each video takes 10-15 minutes of actual human work (writing the script) and 3-5 minutes of automated processing.

What used to require a camera, a ring light, three re-takes, and 45 minutes of editing now takes a script and one terminal window.

Here's the exact pipeline I validated, with every command.

The Stack

**Seedance** (via 火山方舟/Volcengine): AI image-to-video generation. ~5 seconds per clip.

**TTS**: Text-to-speech for voiceover. ElevenLabs for English, Volcengine TTS for Chinese.

**Pillow + ffmpeg**: Subtitle burning. No libass dependency issues.

**ffmpeg**: Video normalization and final composition.

Total cost for a 30-second video: approximately $0.30-0.80 depending on clip count.

Step 1: Write the Script (the only human step)

The script drives everything. I write it in chunks — each chunk becomes one video clip.

The formula that works:

```

[Hook — 3 seconds]

One sentence that makes someone stop scrolling.

Must be counterintuitive or specific.

[Problem — 5 seconds]

The thing your audience recognizes as real.

[Solution — 15 seconds]

What you actually did. Specifics over generalities.

[Result — 4 seconds]

The outcome. Numbers or visible evidence.

[CTA — 3 seconds]

One action. Not three.

```

Example script (28 seconds, 5 clips):

```

"99% of AI content creators are talking to themselves. Here's why.

They optimize for creation. Not distribution.

I spent this week building a full video pipeline — script to published —

using Seedance, TTS, and automated subtitles.

Zero camera. Zero editing software. One API key.

Want the exact pipeline? Comment 'video' below."

```

Step 2: Generate Video Clips with Seedance

Seedance is a video generation model available via Volcengine's ARK API. You give it an image + a motion prompt, it returns a 5-second video clip.

```bash

export ARK_API_KEY="your-key"

export ARK_ENDPOINT_ID="your-endpoint-id"

bash seedance-i2v.sh cover-image.png "professional workspace, slow zoom in, soft natural light, cinematic" clip1.mp4

```

The script:

Submits an async task to the ARK API

Polls every 5 seconds until the task succeeds

Downloads the video to your local path

For a 30-second video, I generate 5-6 clips. Total wait time: ~3 minutes.

Prompt writing rules I learned the hard way:

Always include a motion verb (zoom, pan, float, drift)

Specify lighting explicitly (soft natural light > "bright")

Keep it under 50 words

The image you provide sets the "world"; the prompt sets the movement

Step 3: Normalize and Concatenate

Seedance clips may have slightly different resolutions or framerates. Before concatenating, normalize everything:

```bash

bash normalize-clips.sh clip1.mp4 clip2.mp4 clip3.mp4 clip4.mp4 clip5.mp4

```

This outputs `clip1-norm.mp4` through `clip5-norm.mp4` and a `concat.mp4` — all clips joined, no audio, uniform 1920×1080 at 30fps.

For vertical video (TikTok/Reels/Xiaohongshu), change the scale:

```bash

SCALE=1080:1920 bash normalize-clips.sh clip1.mp4 clip2.mp4 clip3.mp4

```

Step 4: Generate Voiceover

I use TTS to generate the voiceover from the same script. Key parameters:

Format: MP3, 44100Hz, stereo

Speed: ~140 characters/minute for technical content

No music bed (you can add one in ffmpeg later if needed)

If the TTS output isn't in the right format:

```bash

ffmpeg -i voice-raw.mp3 -ar 44100 -ac 2 -b:a 128k voice.mp3

```

Step 5: Write and Burn Subtitles

I write subtitles manually as SRT — it takes 3-4 minutes and gives me precise control over timing. For volume production, Whisper can auto-generate them from the voiceover.

```srt

00:00:00,000 --> 00:00:03,500

99%的内容创作者在对着空气说话

00:00:03,500 --> 00:00:08,000

他们优化的是创作，不是分发

```

The tricky part: ffmpeg's drawtext filter breaks on Chinese characters without a properly configured libass. My solution: use Pillow to render subtitles frame-by-frame.

```bash

python3 burn-subs.py --video concat.mp4 --srt subs.srt --output final-with-subs.mp4

```

Takes about 1× realtime (30 seconds to process a 30-second video).

Step 6: Merge Audio and Video

```bash

ffmpeg -i final-with-subs.mp4 -i voice.mp3 -c:v copy -c:a aac -shortest output-final.mp4

```

Done. The output file is ready to upload.

The Insight Behind the Pipeline

What surprised me wasn't that this worked — it's that it works well enough that I'd post the output without embarrassment.

The bottleneck in content creation isn't talent or ideas. It's production friction. Every step that requires a camera, software, or manual editing is a step where most creators give up.

Removing those steps doesn't make the content better. It makes the creation sustainable.

The pipeline I've described is about distribution velocity — the ability to convert a thought into a published video in under an hour, consistently, without burning out.

For newsletter writers especially, this is the missing piece. You already have the insights (your newsletters). The pipeline turns them into video without requiring you to become a video producer.

What's Next

I'm integrating this pipeline into OnePost's workflow — so newsletter content becomes not just social posts, but videos automatically. The [generate endpoint](/app) already produces 7 platform variants from a newsletter. Video is the eighth.

The full pipeline scripts are open — drop a comment if you want the repo.

✦

Put this into practice

Paste your newsletter → get 7 platform-native posts in 30 seconds. Free to start.

Try OnePost free →