Group 02 · Capstone · Spring 2026

Multimodal Video Indexing — natural-language search across the archive.

A search interface that lets analysts and archivists query video collections in plain language — "the moment the speaker walks off-stage," "every shot with two people and a whiteboard" — by indexing each scene with joint vision-and-language embeddings, not just human-typed metadata.

Domain Search · multimedia

Stack CLIP / SigLIPWhisper transcriptsFAISS indexReact playerFastAPI

Demoed · Spring 2026

IWhat we built

What problem this solves.

Most video archives are searchable only as well as someone bothered to tag them. The metadata is shallow, frequently wrong, and almost never describes what is actually inside the frame. Researchers, journalists, and instructors who need to find a specific moment end up scrubbing timelines by hand — a workflow that does not survive contact with a collection longer than a few hours.

Group 02 framed the problem as a retrieval one, not a transcription one. The goal was not to summarise a video. It was to make the archive answer to ordinary questions: who is on screen, what is happening, where in the runtime, with citations a user can jump to.

IIHow it works

The system, end to end.

The system splits each video into scenes by shot-boundary detection, then computes two embeddings per scene: a visual embedding from a CLIP-family vision-language model, and a textual embedding over the Whisper-extracted transcript window. Both go into a FAISS index keyed by (video_id, scene_id) with timecode metadata.

Queries are dual-encoded the same way and scored against both indices; results are fused with a weighted Reciprocal Rank Fusion that favours visual evidence when the query is concrete and transcript evidence when it is referential. The UI returns a ranked list of scene thumbnails that play in place, with a synchronised transcript pane that scrolls to the matched line.

Pipeline · Multimodal Video Indexing

Ingest

Video in

raw archive

Transform

Scene segmentation

PySceneDetect shot boundaries

Model

Embedding

CLIP / SigLIP + Whisper

Storage

Scene index

FAISS HNSW, one per modality

Ingest

User query

natural language

Transform

Fused ranking

RRF + per-query α blend

Surface

Player & UX

transcript synced to playhead

Ingest Transform Model Storage Surface

IIIThe stack

What it's built on.

Layer · tool / library
Scene segmentation	PySceneDetect for shot-boundary detection
Embedding & indexing	CLIP / SigLIP for visual embeddings (per keyframe)
Fused ranking	Reciprocal Rank Fusion across both modalities
Player & UX	React-based scene-grid result view with hover preview

Scene segmentation

PySceneDetect for shot-boundary detection
Per-scene keyframe extraction at the midpoint
Timecodes preserved through the entire pipeline

Embedding & indexing

CLIP / SigLIP for visual embeddings (per keyframe)
Whisper transcript windows embedded with text encoder
FAISS HNSW indices, one per modality

Fused ranking

Reciprocal Rank Fusion across both modalities
Per-query α blending heuristic on concrete-vs-referential
Re-ranker fall-back for top-50 with cross-encoder

Player & UX

React-based scene-grid result view with hover preview
Transcript pane synchronised to playhead
Shareable deep-link URLs with embedded timecodes

IVDeliverables

What the team shipped.

Source repository GitHub · code, tests, README

Demo video Capstone day · screen recording, 4–6 min

Write-up PDF Final brief · methods, evaluation, reflection

Slide deck Capstone presentation · 10 slides

VWhat sets it apart

What sets this capstone apart.

Takeaway 01 · Scene-level, not video-level

Index the moment, not the file.

Most video search systems retrieve files. This one retrieves scenes. The index granularity is the change in user experience: the result is a clip you can play, not a video you still have to scrub.

Takeaway 02 · Two modalities, one rank

Vision and transcript, fused.

Visual embeddings catch what was on screen. Transcript embeddings catch what was said. A fusion ranker decides which signal dominates per query — without forcing the user to pick a search mode.

Takeaway 03 · Citations as timecodes

Every result is a jump cut.

Every match resolves to a (video, start, end) tuple. The player honours the tuple on click. Verification is two seconds, not two minutes — and that is what makes the system usable beyond the demo.

VIIInstructor note

How this project landed.

The early framing of this project drifted toward summarisation — the team wanted a model to describe each video in prose. The first review pushed back: summarisation is a generation problem, but the user need is retrieval. The reframe to scene-level retrieval was the project's pivotal week.

Once that was settled, the technical choices made themselves. Two modalities, one fused rank, one timecoded result. The team defended the fusion heuristic against several reasonable alternatives in critique and shipped a working player that holds up on real material.

Multimodal Video Indexing — natural-language search across the archive.

Index the moment, not the file.

Vision and transcript, fused.

Every result is a jump cut.

Related work in the cohort.

Stock Investment AI

GenAI Claim Verification