Back to Blog
AI

Adding AI Lipsync and a Generative Marketing Studio to a UGC Platform

By Zaman Ishtiyaq
Jun 26, 202611 min read
AIUGCLipsyncElevenLabsVideo GenerationMarketingReelRocket

Adding AI Lipsync and a Generative Marketing Studio to a UGC Platform


Quick answer: You can add AI lipsync to a UGC platform by combining ElevenLabs TTS with avatar-based animation instead of user photo uploads. This cuts privacy concerns, reduces friction, and keeps output quality consistent. Pair that with client-side video export and you have a system that scales without ballooning server costs.


I've been heads-down on [ReelRocket](https://ugccreatorsfrontend.vercel.app) for the past few weeks, and two features have been competing for my attention at the same time: a lipsync studio and a generative marketing suite. One is shipping, one is still in research mode. This post covers both because the decisions behind each one are connected.


What just shipped: product review video tiers


Before I get into lipsync, I want to briefly cover what just landed in PR #3: tiered product review videos.


The feature itself isn't complicated. Users pick Basic, Medium, or Pro when they request a product review video, and the backend adjusts the prompt depth, video length, and quality settings accordingly. But the reasoning behind it took longer to land on than the code did.


The problem with a single output level is that it serves no one perfectly. A solo creator testing a $12 face wash doesn't need a three-minute, studio-quality breakdown. A brand running a paid campaign does. Tiering lets both users get what they actually need, and it gives the platform a natural pricing surface without artificially limiting core functionality.


It also changes how people interact with the tool. When you give someone a "Pro" option, they take the output more seriously. That shapes the use case they bring to the platform.


Why I chose avatar selection over image upload for lipsync


Most lipsync demos you see online follow the same pattern: user uploads a photo, system animates it to match audio. It looks impressive in a demo. In production, it creates problems.


The first problem is quality variance. A headshot taken in bad lighting, at an odd angle, or with a cluttered background produces a noticeably worse lipsync result. Users blame the tool, not their photo. That's a support and retention problem.


The second problem is privacy. Asking users to upload their face for processing through a third-party API pipeline raises questions most platforms don't want to answer in their terms of service.


Avatar selection sidesteps both. Users pick from a set of pre-built AI avatars. The quality baseline is fixed because you control the input. There's no privacy surface. And the UX matches what's already on the hooks and demo pages in ReelRocket, so new users don't have to learn a new interaction model.


The consistency argument is underrated here. When every lipsync output starts from the same high-quality avatar assets, you can tune the system once and it works for everyone. With user photo uploads, you're essentially running a different pipeline for every input.


How the lipsync pipeline actually works


The architecture follows a pattern I first saw in the open-generative-ai reference implementation and adapted to fit ReelRocket's existing patterns.


Step 1: Script input and TTS via ElevenLabs


The user types or pastes a script. ElevenLabs converts it to audio. This part is straightforward — POST the script, get audio back, store it in Cloudflare R2.


The storage pattern here is identical to what handles hooks and demo videos elsewhere in the platform. That consistency matters more than it sounds. When you're debugging at 1am, knowing that all generated assets live in the same R2 bucket structure with the same naming convention saves real time.


Step 2: Avatar selection and lipsync animation


The selected avatar asset and the generated audio file go into the lipsync pipeline. The avatar is animated to match the audio timing. The result is stored back in R2.


Step 3: History tab


One thing I added early that I'm glad about: a history tab showing past lipsync generations. Users iterate. They tweak the script, regenerate, compare. Without history, every generation is a dead end. With it, the tool feels like a workspace.


What I'm researching: the generative AI marketing studio


This is still in the investigation phase, but the model landscape has moved fast and it's worth documenting.


The starting point is the open-generative-ai architecture. It's a clean reference for how to structure a generative video studio, and it gave me a framework for thinking about what a marketing studio would look like inside ReelRocket.


ByteDance models worth watching


Through testing on Kie.ai's model marketplace, I've been evaluating three ByteDance models that are directly relevant to UGC:


Seedance 2.0 and 2.5 handle general video generation. The output quality is noticeably better than earlier open-source alternatives for product-focused content. Seedance 2.5 handles motion consistency well, which matters for demo-style videos where a product needs to stay in frame.


OmniHuman 1.5 is the more interesting one for this use case. It generates talking-head video from a single image plus an audio file — the same workflow as the lipsync feature, but with a different model underneath. This opens the door to brand avatar creation from a product mascot or spokesperson photo without requiring video footage as input.


Volcengine Lip Sync API is ByteDance's own lipsync offering, available through the same Kie.ai marketplace. I haven't benchmarked it against the current ElevenLabs-driven pipeline yet, but it's on the list.


The reason all three matter together is that they suggest a coherent marketing studio workflow: generate a script (existing), generate audio (ElevenLabs), generate a talking-head video (OmniHuman), optionally add B-roll or product footage (Seedance). That's a full UGC ad in a few API calls.


The client-side video export decision


This one came down to cost and scale.


The existing demo video feature runs ffmpeg on the server. That's fine for low volume. It doesn't scale cleanly when you add lipsync generations, which can be heavier to process. Running ffmpeg on every export means server compute costs that grow directly with usage.


The alternative: move final video export to the client side. The user's browser does the processing. This is now live on the hooks page, the demo page, and the automations section.


The tradeoff is that slower machines will have slower exports. That's real. But for the target user — a creator or small brand running TikTok campaigns — the machine doing the export is almost always a reasonably modern laptop, and the files aren't large.


The bigger win is that the server cost for export drops to near zero. That changes what's economically viable to offer at lower tiers.


What I've learned building this


UX consistency compounds. Every time I've mirrored an existing ReelRocket pattern (avatar selection, R2 storage, history tabs), onboarding the feature has been faster and bugs have been easier to trace. New patterns introduce new failure modes.


Model marketplaces change the economics. Two years ago, adding talking-head video to a platform meant building or fine-tuning your own model. Now Kie.ai and similar marketplaces put Seedance and OmniHuman behind a straightforward API. The barrier is integration work, not ML work.


Tiers force clarity. Shipping the Basic/Medium/Pro review tiers made me think harder about what actually differentiates each output level. That thinking made the prompts better and the feature more useful.


The lipsync feature is shipping soon. The marketing studio is still in research, but the model landscape is mature enough that it's more an integration project than a research project at this point.