Group 02 · Capstone · Spring 2026

Multimodal Video Indexing — natural-language search across the archive.

A search interface that lets analysts and archivists query video collections in plain language — "the moment the speaker walks off-stage," "every shot with two people and a whiteboard" — by indexing each scene with joint vision-and-language embeddings, not just human-typed metadata.

Domain Search · multimedia
Stack CLIP / SigLIPWhisper transcriptsFAISS indexReact playerFastAPI
Demoed · Spring 2026
IWhat we built

What problem this solves.

Most video archives are searchable only as well as someone bothered to tag them. The metadata is shallow, frequently wrong, and almost never describes what is actually inside the frame. Researchers, journalists, and instructors who need to find a specific moment end up scrubbing timelines by hand — a workflow that does not survive contact with a collection longer than a few hours.

Group 02 framed the problem as a retrieval one, not a transcription one. The goal was not to summarise a video. It was to make the archive answer to ordinary questions: who is on screen, what is happening, where in the runtime, with citations a user can jump to.

IIHow it works

The system, end to end.

The system splits each video into scenes by shot-boundary detection, then computes two embeddings per scene: a visual embedding from a CLIP-family vision-language model, and a textual embedding over the Whisper-extracted transcript window. Both go into a FAISS index keyed by (video_id, scene_id) with timecode metadata.

Queries are dual-encoded the same way and scored against both indices; results are fused with a weighted Reciprocal Rank Fusion that favours visual evidence when the query is concrete and transcript evidence when it is referential. The UI returns a ranked list of scene thumbnails that play in place, with a synchronised transcript pane that scrolls to the matched line.

Pipeline · Multimodal Video Indexing
Ingest
Video in
raw archive
Transform
Scene segmentation
PySceneDetect shot boundaries
Model
Embedding
CLIP / SigLIP + Whisper
Storage
Scene index
FAISS HNSW, one per modality
Ingest
User query
natural language
Transform
Fused ranking
RRF + per-query α blend
Surface
Player & UX
transcript synced to playhead
Ingest Transform Model Storage Surface
IIIThe stack

What it's built on.

Layer · tool / library
Scene segmentation PySceneDetect for shot-boundary detection
Embedding & indexing CLIP / SigLIP for visual embeddings (per keyframe)
Fused ranking Reciprocal Rank Fusion across both modalities
Player & UX React-based scene-grid result view with hover preview
Scene segmentation
  • PySceneDetect for shot-boundary detection
  • Per-scene keyframe extraction at the midpoint
  • Timecodes preserved through the entire pipeline
Embedding & indexing
  • CLIP / SigLIP for visual embeddings (per keyframe)
  • Whisper transcript windows embedded with text encoder
  • FAISS HNSW indices, one per modality
Fused ranking
  • Reciprocal Rank Fusion across both modalities
  • Per-query α blending heuristic on concrete-vs-referential
  • Re-ranker fall-back for top-50 with cross-encoder
Player & UX
  • React-based scene-grid result view with hover preview
  • Transcript pane synchronised to playhead
  • Shareable deep-link URLs with embedded timecodes
IVDeliverables

What the team shipped.

Source repository GitHub · code, tests, README
Demo video Capstone day · screen recording, 4–6 min
Write-up PDF Final brief · methods, evaluation, reflection
Slide deck Capstone presentation · 10 slides
VWhat sets it apart

What sets this capstone apart.

Takeaway 01 · Scene-level, not video-level

Index the moment, not the file.

Most video search systems retrieve files. This one retrieves scenes. The index granularity is the change in user experience: the result is a clip you can play, not a video you still have to scrub.

Takeaway 02 · Two modalities, one rank

Vision and transcript, fused.

Visual embeddings catch what was on screen. Transcript embeddings catch what was said. A fusion ranker decides which signal dominates per query — without forcing the user to pick a search mode.

Takeaway 03 · Citations as timecodes

Every result is a jump cut.

Every match resolves to a (video, start, end) tuple. The player honours the tuple on click. Verification is two seconds, not two minutes — and that is what makes the system usable beyond the demo.

VIIInstructor note

How this project landed.

The early framing of this project drifted toward summarisation — the team wanted a model to describe each video in prose. The first review pushed back: summarisation is a generation problem, but the user need is retrieval. The reframe to scene-level retrieval was the project's pivotal week.

Once that was settled, the technical choices made themselves. Two modalities, one fused rank, one timecoded result. The team defended the fusion heuristic against several reasonable alternatives in critique and shipped a working player that holds up on real material.