Semantic Chunking of Videos for AI

Open In Colab

Recent multi-modal models like OpenAI's gpt-4o and Google's Gemini 1.5 models can comprehend video. When feeding video into these new models, we can push frames at a set frequency (for example, one frame every second) — but this method can be wildly inefficient and expensive.

Fortunately, there is a better method called "semantic chunking." Semantic chunking is a common method used in text-based Retrieval-Augmented Generation (RAG), but using image embedding models, we can apply the same logic to video. By using the similarity between these frames, we can effectively split videos based on the constituent frames' semantic meaning.

In this article, we'll explore how to use two test videos and chunk them into semantic blocks.

Getting Started

Let's start by loading a test video and splitting it into frames

python

!pip install -qU \
    "semantic-chunkers[stats]" \
    "semantic-router[vision]==0.0.39" \
    opencv-python

python

import cv2

vidcap = cv2.VideoCapture("https://www.w3schools.com/html/mov_bbb.mp4")

frames = []
success, image = vidcap.read()
while success:
    frames.append(image)
    success, image = vidcap.read()
len(frames)

text

python

from PIL import Image

image_frames = list(map(Image.fromarray, frames))
len(image_frames)

text

Now that we have the frames loaded, we can go ahead and use the Chunker functionality to create splits based on frame similarity

First, lets initialise our ViT Encoder

python

import torch
from semantic_router.encoders import VitEncoder

device = (
    "mps"
    if torch.backends.mps.is_available()
    else "cuda" if torch.cuda.is_available() else "cpu"
)
print(f"Using '{device}'")

encoder = VitEncoder(device=device)

text

    Using 'mps'

Now lets initialise our Splitter.

Note: currently, we can only use semantic_chunkers.chunkers.ConsecutiveChunker for image content

python

from semantic_chunkers import ConsecutiveChunker

chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.6)

chunks = chunker(docs=[image_frames])
print(f"{len(chunks[0])} chunks identified")

100%|██████████| 4/4 [00:04<00:00, 1.20s/it]

100%|██████████| 249/249 [00:00<00:00, 74737.49it/s]

2 chunks identified

python

import matplotlib.pyplot as plt

f, axarr = plt.subplots(len(chunks[0]), 3, figsize=(20, 5))

for i, chunk in enumerate(chunks[0]):
    axarr[i, 0].imshow(chunk.splits[0])
    num_docs = len(chunk.splits)
    mid = num_docs // 2
    axarr[i, 1].imshow(chunk.splits[mid])
    axarr[i, 2].imshow(chunk.splits[num_docs - 1])

First video chunking results

The video has two main camera angles, which is represented here by two Semantic Chunks. (each row represents 1 chunk, columns represent frame samples within the chunk)

Chunk #1 - scene 1, high angle shot of Big Buck Bunny looking up at a butterfly

Chunk #2 - scene 2, straight-up angle shot of Big Buck Bunny, with a distinct yellow background

Using ViT features from frames, we were able to distinguish these two scenes

What about non-animated footage?

Depending on the complexity of the footage you're trying to semantically chunk, you might need to adjust the threshold parameter for semantic_chunker

Let's use a public domain video from the automotive domain to demonstrate

python

vidcap = cv2.VideoCapture(
    "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4"
)

frames = []
success, image = vidcap.read()
while success:
    frames.append(image)
    success, image = vidcap.read()
image_frames = list(map(Image.fromarray, frames))
len(image_frames)

text

How to pick the right `threshold`?

It's an art as much as it is a science.

A lower threshold value means that the chunker is more lenient to accepting frames within a chunk, with threshold 0 meaning all frames are just 1 chunk.

Conversely, the higher the threshold value, the stricter the chunker becomes, with threshold 1 putting each frame (besides 100% identical ones) into the same chunk.

For this video, we empirically found a value of 0.65 to work the best.

python

chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.65)

chunks = chunker(docs=[image_frames])

100%|██████████| 18/18 [00:24<00:00, 1.34s/it]

100%|██████████| 1138/1138 [00:00<00:00, 246163.90it/s]

python

import matplotlib.pyplot as plt

f, axarr = plt.subplots(len(chunks[0]), 3, figsize=(20, 60))

for i, chunk in enumerate(chunks[0]):
    axarr[i, 0].imshow(chunk.splits[0])
    num_docs = len(chunk.splits)
    mid = num_docs // 2
    axarr[i, 1].imshow(chunk.splits[mid])
    axarr[i, 2].imshow(chunk.splits[num_docs - 1])

Second video chunking results

Semantic Chunking of Videos for AI

Getting Started

What about non-animated footage?

How to pick the right threshold?

How to pick the right `threshold`?