Combine Two Videos in Speaker Mode

I’d like to combine two source videos and output based on who is talking.

Similar to speaker view in zoom calls.

I’m able to provide timestamps of when to switch between users.

So we’d have input files like:
google-drive/host.mp4
google-drive/guest.mp4

And then timestamps like this:
00:00 - Host
00:30 - Guest
00:35 - Host
00:40 - Host
01:30 - Guest
02:30 - Host
… etc

There may be hundreds of changes in a roughly 30 minute interview.

How can I write JSON for Shotstack to produce this output?

Something like this maybe

Thanks,
Marcus

I’ve put together this very simple example to show how you can alternate from one clip to another while keeping the timing in sync. If you swap the source videos in the JSON below with the speakers and start adjusting the timing (start, length and trim) you should be able to achieve what you are trying to do.

In this example I am just using a counter video so you can see the seconds counting up and see the timing is continuously proceeding through the video.

Here is the JSON:

{
    "timeline": {
        "tracks": [
            {
                "clips": [
                    {
                        "asset": {
                            "type": "video",
                            "src": "https://shotstack-assets.s3.ap-southeast-2.amazonaws.com/motion-graphics/countup-1.mp4",
                            "trim": 0
                        },
                        "start": 0,
                        "length": 5
                    },
                    {
                        "asset": {
                            "type": "video",
                            "src": "https://shotstack-assets.s3.ap-southeast-2.amazonaws.com/motion-graphics/countup-2.mp4",
                            "trim": 5
                        },
                        "start": 5,
                        "length": 5
                    },
                    {
                        "asset": {
                            "type": "video",
                            "src": "https://shotstack-assets.s3.ap-southeast-2.amazonaws.com/motion-graphics/countup-1.mp4",
                            "trim": 10
                        },
                        "start": 10,
                        "length": 5
                    },
                    {
                        "asset": {
                            "type": "video",
                            "src": "https://shotstack-assets.s3.ap-southeast-2.amazonaws.com/motion-graphics/countup-2.mp4",
                            "trim": 15
                        },
                        "start": 15,
                        "length": 5
                    },
                    {
                        "asset": {
                            "type": "video",
                            "src": "https://shotstack-assets.s3.ap-southeast-2.amazonaws.com/motion-graphics/countup-1.mp4",
                            "trim": 20
                        },
                        "start": 20,
                        "length": 5
                    },
                    {
                        "asset": {
                            "type": "video",
                            "src": "https://shotstack-assets.s3.ap-southeast-2.amazonaws.com/motion-graphics/countup-2.mp4",
                            "trim": 25
                        },
                        "start": 25,
                        "length": 5
                    }
                ]
            }
        ]
    },
    "output": {
        "format": "mp4",
        "resolution": "sd"
    }
}

In this snippet the two videos are: https://shotstack-assets.s3.ap-southeast-2.amazonaws.com/motion-graphics/countup-1.mp4 and https://shotstack-assets.s3.ap-southeast-2.amazonaws.com/motion-graphics/countup-2.mp4.

The final output looks like this:

The trick will be to work out the start, length (duration) and trim (how much to chop off the start (or how far in to the video you are).

This should all translate across to one of our SDK’s as well but this should hopefully explain the general concept.

Nice, I’ll give it a try! Thank you :slight_smile:

It works! Thank you :slight_smile:

1 Like

How can I merge the audio from the two tracks and play the merged audio on the video that we created above?

The audio needs to be just merged, it doesn’t need to use the timestamps above.

Is there a way to do it alongside the JSON above? That would be a lot easier than two seperate requests.

Thanks!

What is the audio source? Is it in the videos? If it is one of the videos or if you have a separate mp3 file you can mute the video in the video clips by setting the volume to 0. Then you can create a new track and use an audio clip. The audio clip should actually accept an mp4 file or an mp3 file, set the volume to 1 and play the entire length of the video.

That should give you a single, uninterrupted audio clip and the video clips will be muted so their audio is not heard.

I hope I explained that OK. Let me know if you need a JSON example.

Yes, it’s the audio from the videos.

Is there a way to use the JSON to merge the audio in this one request?

Or would we need to do one request with volume=0, another request to merge the audio, and another request to add the audio onto the video?

I think you can do this all in one render, and I’d want to avoid doing multiple renders to mix it together - you’ll get billed more, it will take longer and video quality will deteriorate with each render.

This is what I am trying to explain in JSON:

{
    "timeline": {
        "tracks": [
            {
                "clips": [
                    {
                        "asset": {
                            "type": "video",
                            "src": "https://github.com/shotstack/test-media/raw/main/captioning/scott-ko.mp4",
                            "volume": 0
                        },
                        "start": 0,
                        "length": 10
                    },
                    {
                        "asset": {
                            "type": "audio",
                            "src": "https://github.com/shotstack/test-media/raw/main/captioning/scott-ko.mp4",
                            "volume": 1
                        },
                        "start": 0,
                        "length": 10
                    }
                ]
            }
        ]
    },
    "output": {
        "format": "mp4",
        "resolution": "sd"
    }
}
  • Set all your video clips to volume: 0
  • Create an audio clip using one of the video clips that has the audio, it doesn’t matter it is a video file
  • Set the volume of the audio clip to 1 and it’s duration to the max duration of the overall video

Render that and see if you get the audio playing by itself and the videos muted.

Nice, thanks! That makes sense.

1 Like