I Built an AI Video Ad Pipeline That Orchestrates 6 AI Services

Most "AI video" tools give you a text box and a prayer. Type a prompt. Wait. Hope what comes back looks like something you'd actually run as an ad.

I wanted something different.

When a small business owner types "create a video ad for my plumbing business," I wanted the system to think like a creative director. Plan scenes. Cast characters. Direct cameras. Write voiceover copy. Compose music. Deliver a broadcast-ready video ad with burned-in captions. Zero manual editing.

So I built the Scene Composer. Here's how it works.

Video Ads Are the Hardest Creative to Automate

Text generation is solved. Image generation is table stakes. But a 30-second video ad with coherent scene transitions, synced voiceover, background music, and animated captions? That takes coordinating multiple AI models, keeping them temporally aligned, and delivering output that looks intentional. Not generated.

The hard part isn't any single AI call. It's the orchestration.

A video ad needs at least six operations that depend on each other: script and scene planning, reference image generation per scene, image-to-video conversion, text-to-speech with word-level timestamps, background music composition, and final assembly with caption burning. Some of these can run in parallel. Some absolutely cannot. And any one of them can fail.

I needed architecture that could handle all of this reliably, resume from any failure point, and still finish in under two minutes.

Architecture: Effect-TS as the Backbone

I didn't reach for a workflow engine or a queue system. I reached for Effect-TS.

If you haven't used Effect, think of it as TypeScript's answer to the question: "What if errors, concurrency, retries, timeouts, and dependency injection were all first-class language features instead of afterthoughts?"

The Scene Composer is structured as five directories, each with a strict boundary:

domain/ Pure TypeScript. Zero side effects. Types, validation, constants, helpers. You can test every function here without mocking a single dependency.
services/ Six stateless Effect services, one per external integration. Each implements a typed interface and is provided via Effect's dependency injection layer.
orchestration/ The conductors. These modules coordinate services, manage concurrency, and handle per-scene error recovery.
stages/ Sequential pipeline phases that persist state to the database between each step.
clients/ Thin API wrappers. Nothing smart happens here.

The service layer composes into a single dependency:

const SceneComposerLive = Layer.mergeAll(
  SceneCompositionServiceLive,    // Google Gemini
  ImageGenerationServiceLive,     // Flux via fal.ai
  VideoGenerationServiceLive,     // Veo 3.1 / Sora-2
  VideoAssemblerServiceLive,      // FFmpeg assembly
  AudioServiceLive,               // ElevenLabs TTS + music
  CaptionBurningServiceLive,      // Remotion Lambda
)

Six services. One composable layer. Every function in the pipeline declares exactly which services it needs in its type signature.

The Pipeline: 4 Stages, 6 AI Services, ~90 Seconds

Stage 1: Scene Composition (Google Gemini)

The pipeline starts with a multimodal prompt to Gemini. I send the business context, product details, any uploaded images, and the user's creative brief. Gemini returns structured output. Not prose. A typed SceneCompositionOutput:

A GlobalStyle object defining the visual direction, protagonist characteristics, voice configuration, color palette, and mood
An array of SceneDescription objects, each with a narrative, camera direction, emotion, duration (4/6/8 seconds), and a detailed first-frame image prompt
A complete voiceover script, pre-calibrated to match the total video duration

That last part matters. I measured words-per-second rates for each voice in the library (ranging from 1.95 to 2.22 WPS depending on voice style). The scene composition prompt includes these rates so Gemini writes scripts that actually fit the runtime. If the script runs long, the system auto-adjusts scene durations before proceeding.

Stage 2: Reference Images (Flux via fal.ai)

Each scene gets a reference image. A high-quality still frame that serves as the visual anchor for video generation.

These generate in parallel (up to 8 concurrent) using Flux through fal.ai. Each prompt is built from the scene's firstFramePrompt plus anti-collage constraints to avoid the "AI grid" look that ruins coherence.

Here's where the character library comes in. I maintain 13 pre-generated diverse character personas, each photographed from three angles: headshot, bodyshot, and seated. When the scene calls for a protagonist and the user hasn't uploaded their own reference, Gemini selects from this shuffled library and includes the character's reference images in the generation prompt. The character metadata (age, gender presentation, ethnicity, vibe) helps the AI make contextually appropriate casting decisions.

This gives you something that typically requires a photo shoot: visual consistency across scenes with a recognizable protagonist.

Stage 3: Video Generation (Veo 3.1 / Sora-2 via fal.ai)

Each reference image converts to a video clip using image-to-video generation. The engine selection is deliberate:

Multi-scene videos use Google Veo 3.1. Better at maintaining visual consistency across clips.
Single-scene videos use OpenAI Sora-2. Stronger at complex motion and cinematic quality for longer standalone shots.

Both models run through fal.ai as a proxy layer, giving me unified queue management and progress tracking. Concurrency stays at 8 parallel generations with exponential backoff retries. Each scene independently tracks its status (pending → generating_image → generating_video → completed or failed) so a single scene failure doesn't kill the run.

Stage 4: Assembly (ElevenLabs + FFmpeg + Remotion Lambda)

This is where everything comes together.

Three operations run in parallel:

Voiceover generation. ElevenLabs converts the full script to speech with word-level timestamps. Not sentence-level. Individual word start/end times.
Background music. ElevenLabs composes an instrumental track matched to the scene's emotion, mood, and ambience.
Video concatenation. An FFmpeg-based service stitches scene clips together with the audio tracks, mixing voiceover and music at appropriate levels.

Then the final piece: caption burning.

Word-level timestamps from the voiceover feed into a Remotion composition running on AWS Lambda. I built three caption styles:

Hormozi. Bold Anton font, uppercase, green highlight, two words per line, scale animation. The internet marketing standard.
Framed. Inter font, black pill highlight, three words per line. Clean and readable.
Simple. Inter font, text shadow, six words per line, bottom-positioned with spring enter/exit animation.

Remotion renders at 1080x1920, 30fps, H.264. Directly on Lambda, no GPU instances needed. The rendered video uploads to Supabase storage, and the pipeline updates both the generated_videos and ads records atomically.

The Part Nobody Talks About: Resumability

The pipeline takes about 90 seconds when everything works. But AI APIs fail. Network requests timeout. Lambda functions run out of memory.

This is where Effect-TS earns its keep.

Every stage writes its progress to the database. The generated_videos table tracks the current phase (scenes → images → videos → assembly → complete), and each video_scene record tracks its individual status.

When a generation resumes, the system runs a completion analysis:

All scenes have images but some lack videos? Skip to Stage 3.
All scenes have videos but assembly failed? Skip to Stage 4.
Caption data exists but captioned video doesn't? Run caption burning only.
Some scenes stuck for more than 15 minutes? Reset those scenes and retry, respecting a maximum retry count.

The resume function figures out exactly where things broke and runs the minimum work needed to finish. 20-minute timeout on fresh runs. 25-minute timeout on resumes since they might need to re-run expensive stages.

export const composeScenesEffect = (input: ComposeScenesInput) =>
  Effect.scoped(
    Effect.gen(function* () {
      const {
        generatedVideoId, globalStyle,
        protagonistReferenceUrl, savedScenes,
      } = yield* initializeGenerationEffect(input)
 
      const scenesWithImages =
        yield* runImageStageEffect({ ... })
      yield* runVideoStageEffect({ ... })
      const { videoUrl } =
        yield* runAssemblyStageEffect({ ... })
 
      return { videoUrl }
    }),
  ).pipe(
    Effect.timeout(Duration.millis(COMPOSE_PIPELINE_TIMEOUT_MS)),
    Effect.tapError((error) =>
      updateGeneratedVideoErrorByAdIdEffect(
        input.adId, extractErrorDetails(error)
      )
    ),
    Effect.onInterrupt(() =>
      updateGeneratedVideoErrorByAdIdEffect(
        input.adId, 'pipeline-interrupted'
      )
    ),
  )

Four lines of pipeline. Complete timeout handling, error persistence, and interruption cleanup. All declarative, all typed.

Selective Scene Regeneration

After the initial generation, users often want to tweak one or two scenes without re-running the entire pipeline. "Make scene 3 more dramatic" or "change the protagonist in scene 1."

The edit module handles this through a clone-and-replace strategy:

Clone the entire generated_videos and video_scenes to a new run
Send the existing scenes plus user feedback back to Gemini for targeted recomposition
Re-run only the modified scenes through image → video → assembly
Rebuild the voiceover script to merge unchanged and updated sections seamlessly

You get a new complete video that reflects targeted edits without starting from scratch. Each edit creates a clean generation run. No mutation of previous outputs.

Error Handling: Tagged Errors, Not String Messages

Every error in the pipeline extends Data.TaggedError, giving each failure type a discriminant tag that the type system can reason about:

class VideoAssemblyError extends Data.TaggedError(
  'VideoAssemblyError'
)<{
  message: string
  details?: unknown
}> {}

Error recovery is pattern-matched, not stringly-typed. I can catch a VideoAssemblyError and retry assembly without accidentally swallowing an ImageGenerationError. The orchestration layer uses Effect.catchTag to implement per-error-type recovery strategies.

When a scene fails, the error handler updates that scene's status in the database and continues processing other scenes. The pipeline produces partial results. Four successful scenes out of five is better than zero.

What I'd Do Differently

Not everything is perfect. A few honest notes:

The Monogo service is a single point of failure. FFmpeg-based video assembly runs on a separate service rather than serverlessly. If it goes down, assembly fails for everyone. I'm exploring moving this to Lambda or a container-based approach.

Duration calibration is still approximate. Despite measuring WPS rates per voice, there's inherent variance in TTS output. The system handles mismatches with audio padding and truncation, but occasionally there's a noticeable gap between the last spoken word and the end of the video. Still learning the best way to handle this edge case.

Veo and Sora are both accessed through fal.ai. This gives me a unified interface but adds a proxy layer. For latency-sensitive production workloads, going direct might save 2-3 seconds per generation. Something I'm testing.

The Stack

Layer	Technology
Orchestration	Effect-TS
Scene AI	Google Gemini (Vercel AI SDK)
Image generation	Flux via fal.ai
Video generation	Google Veo 3.1, OpenAI Sora-2 via fal.ai
Voice & music	ElevenLabs
Video assembly	FFmpeg (Monogo service)
Caption rendering	Remotion Lambda on AWS
Database	Drizzle ORM + PostgreSQL
Storage	Supabase
Framework	Next.js 16 (App Router)

Why This Matters

A plumber in Munich doesn't have a creative agency. A real estate agent in Lagos doesn't have a video production budget. A bakery owner in Mexico City doesn't have three weeks to wait for an ad.

What they do have is a phone, a product, and three minutes to spare.

The Scene Composer turns that into a broadcast-ready video ad with professional voiceover, scene-appropriate music, and animated captions. The kind of content that used to cost $5,000 and take two weeks.

I didn't build this because AI video generation is technically interesting (though it is). I built it because the alternative, paying an agency or learning After Effects, means most small businesses never advertise with video at all.

That's the real problem worth solving.

If you're interested in how I approach reliable distributed systems, check out how we built a push notification system that actually doesn't lose messages — similar patterns of orchestration and failure recovery, applied to real-time delivery.

Built at Glorya. We're hiring engineers who think a bit more, preferably out of the 📦.