Semantic Chunking of Videos for AI
AI EngineeringRecent multi-modal models like OpenAI's gpt-4o and Google's Gemini 1.5 models can comprehend video. When feeding video into these new models, we can push frames at a set frequency (for example, one frame every second) — but this method can be wildly inefficient and expensive.
Fortunately, there is a better method called "semantic chunking." Semantic chunking is a common method used in text-based Retrieval-Augmented Generation (RAG), but using image embedding models, we can apply the same logic to video. By using the similarity between these frames, we can effectively split videos based on the constituent frames' semantic meaning.
In this article, we'll explore how to use two test videos and chunk them into semantic blocks.
Getting Started
Let's start by loading a test video and splitting it into frames
Now that we have the frames loaded, we can go ahead and use the Chunker
functionality to create splits based on frame similarity
First, lets initialise our ViT Encoder
Now lets initialise our Splitter.
Note: currently, we can only use semantic_chunkers.chunkers.ConsecutiveChunker
for image content
100%|██████████| 4/4 [00:04<00:00, 1.20s/it]
100%|██████████| 249/249 [00:00<00:00, 74737.49it/s]
2 chunks identified
The video has two main camera angles, which is represented here by two Semantic Chunks. (each row represents 1 chunk, columns represent frame samples within the chunk)
Chunk #1 - scene 1, high angle shot of Big Buck Bunny looking up at a butterfly
Chunk #2 - scene 2, straight-up angle shot of Big Buck Bunny, with a distinct yellow background
Using ViT features from frames, we were able to distinguish these two scenes
What about non-animated footage?
Depending on the complexity of the footage you're trying to semantically chunk, you might need to adjust the threshold
parameter for semantic_chunker
Let's use a public domain video from the automotive domain to demonstrate
How to pick the right threshold
?
It's an art as much as it is a science.
A lower threshold value means that the chunker is more lenient to accepting frames within a chunk
, with threshold 0 meaning all frames are just 1 chunk.
Conversely, the higher the threshold value, the stricter the chunker becomes, with threshold 1 putting each frame (besides 100% identical ones) into the same chunk.
For this video, we empirically found a value of 0.65
to work the best.
100%|██████████| 18/18 [00:24<00:00, 1.34s/it]
100%|██████████| 1138/1138 [00:00<00:00, 246163.90it/s]