Hello,
I am looking for assistance with a video generation feature that involves dynamic text-to-voice audio and corresponding visual assets. The goal is to ensure that the length of the audio and the display time of visual assets (images or text) are synchronized. Here are the details of my use case:
- Dynamic Text-to-Voice Audio:
- The audio is generated via a text-to-voice system.
- The length of this audio can vary depending on different business scenarios.
- The audio can be either a single continuous file or split into multiple parts.
- Segmented Audio and Visuals:
- The audio consists of several distinct parts.
- Each part of the audio needs to have a corresponding visual element (such as background images or text).
- These visual elements should match the length of each specific audio segment.
- Synchronization Needs:
- Is it possible to adjust the display time of text and image assets based on the length of the corresponding audio segments?
- If the audio is split into multiple files, how can we ensure the visual assets are properly synchronized with the appropriate audio segments?
Your guidance on how to achieve this synchronization between audio and visual assets in the video generation process would be greatly appreciated.
Thank you!