- PySceneDetect for shot-boundary detection
- Per-scene keyframe extraction at the midpoint
- Timecodes preserved through the entire pipeline
Multimodal Video Indexing — natural-language search across the archive.
A search interface that lets analysts and archivists query video collections in plain language — "the moment the speaker walks off-stage," "every shot with two people and a whiteboard" — by indexing each scene with joint vision-and-language embeddings, not just human-typed metadata.
What problem this solves.
Most video archives are searchable only as well as someone bothered to tag them. The metadata is shallow, frequently wrong, and almost never describes what is actually inside the frame. Researchers, journalists, and instructors who need to find a specific moment end up scrubbing timelines by hand — a workflow that does not survive contact with a collection longer than a few hours.
Group 02 framed the problem as a retrieval one, not a transcription one. The goal was not to summarise a video. It was to make the archive answer to ordinary questions: who is on screen, what is happening, where in the runtime, with citations a user can jump to.
The system, end to end.
The system splits each video into scenes by shot-boundary detection, then computes two embeddings per scene: a visual embedding from a CLIP-family vision-language model, and a textual embedding over the Whisper-extracted transcript window. Both go into a FAISS index keyed by (video_id, scene_id) with timecode metadata.
Queries are dual-encoded the same way and scored against both indices; results are fused with a weighted Reciprocal Rank Fusion that favours visual evidence when the query is concrete and transcript evidence when it is referential. The UI returns a ranked list of scene thumbnails that play in place, with a synchronised transcript pane that scrolls to the matched line.
What it's built on.
| Scene segmentation | PySceneDetect for shot-boundary detection |
|---|---|
| Embedding & indexing | CLIP / SigLIP for visual embeddings (per keyframe) |
| Fused ranking | Reciprocal Rank Fusion across both modalities |
| Player & UX | React-based scene-grid result view with hover preview |
- CLIP / SigLIP for visual embeddings (per keyframe)
- Whisper transcript windows embedded with text encoder
- FAISS HNSW indices, one per modality
- Reciprocal Rank Fusion across both modalities
- Per-query α blending heuristic on concrete-vs-referential
- Re-ranker fall-back for top-50 with cross-encoder
- React-based scene-grid result view with hover preview
- Transcript pane synchronised to playhead
- Shareable deep-link URLs with embedded timecodes
What the team shipped.
What sets this capstone apart.
Index the moment, not the file.
Most video search systems retrieve files. This one retrieves scenes. The index granularity is the change in user experience: the result is a clip you can play, not a video you still have to scrub.
Vision and transcript, fused.
Visual embeddings catch what was on screen. Transcript embeddings catch what was said. A fusion ranker decides which signal dominates per query — without forcing the user to pick a search mode.
Every result is a jump cut.
Every match resolves to a (video, start, end) tuple. The player honours the tuple on click. Verification is two seconds, not two minutes — and that is what makes the system usable beyond the demo.
How this project landed.
The early framing of this project drifted toward summarisation — the team wanted a model to describe each video in prose. The first review pushed back: summarisation is a generation problem, but the user need is retrieval. The reframe to scene-level retrieval was the project's pivotal week.
Once that was settled, the technical choices made themselves. Two modalities, one fused rank, one timecoded result. The team defended the fusion heuristic against several reasonable alternatives in critique and shipped a working player that holds up on real material.