Group 06 · Capstone · Spring 2026

GenAI Claim Verification — retrieval, evidence, calibrated confidence.

A factual-claim verifier that takes a sentence, retrieves supporting and contradicting passages from a curated corpus, and returns a labelled verdict — supported, contradicted, insufficient — with the evidence attached and a confidence score that is honest about what the retrieval actually found.

Domain Civic · verification
Stack Dense + BM25 hybridCurated corpusCross-encoder NLIFastAPINext.js
Demoed · Spring 2026
IWhat we built

What problem this solves.

Most consumer-facing fact-checking either reads like a confident verdict from a black box, or buries its evidence behind a long-form article. Neither model serves the person who has a single sentence and a five-second tolerance for ambiguity.

Group 06 built for that user: someone who needs to know whether a claim is supported by sources they would trust, evidence inline, with a verdict that says "insufficient" out loud when the corpus does not actually settle the question.

IIHow it works

The system, end to end.

The pipeline is hybrid retrieval into NLI inference. A claim is embedded and scored against a curated corpus with both a dense retriever and a BM25 lexical retriever; the result lists are fused. The top passages are fed pair-wise into a cross-encoder NLI model that classifies each as entailing, contradicting, or neutral with respect to the claim.

Aggregation is deliberately conservative. A verdict of "supported" requires multiple entailing passages with no high-confidence contradictions; a "contradicted" verdict requires the inverse. Anything in between resolves to "insufficient evidence" — which the team treated as a first-class output, not a failure mode.

Pipeline · GenAI Claim Verification
Ingest
Claim in
user submission
Transform
Retrieval
BM25 + dense, RRF fused
Storage
Curated corpus
domain-whitelisted
Model
NLI inference
cross-encoder, per pair
Transform
Aggregation
support / contradict / insufficient
Surface
Verdict + cites
evidence inline
Transform
Corpus curation
domain whitelist + freshness
Ingest Transform Model Storage Surface
IIIThe stack

What it's built on.

Layer · tool / library
Retrieval Dense retriever (sentence-transformer) over the curated corpus
NLI inference Cross-encoder NLI model scores (claim, passage) pairs
Aggregation Verdict logic: supported / contradicted / insufficient
Corpus curation Domain whitelist enforced at ingest time
Retrieval
  • Dense retriever (sentence-transformer) over the curated corpus
  • BM25 lexical retriever as a parallel signal
  • Reciprocal Rank Fusion at the top-k boundary
NLI inference
  • Cross-encoder NLI model scores (claim, passage) pairs
  • Per-passage entailment / contradiction / neutral labels
  • Calibration check against a held-out evaluation set
Aggregation
  • Verdict logic: supported / contradicted / insufficient
  • Conservative thresholds; "insufficient" is the default
  • Confidence reported as a calibrated probability, not a percentage
Corpus curation
  • Domain whitelist enforced at ingest time
  • Per-source freshness metadata surfaced in the UI
  • Out-of-scope claim detector blocks unsupportable queries upstream
IVDeliverables

What the team shipped.

Source repository GitHub · code, tests, README
Demo video Capstone day · screen recording, 4–6 min
Write-up PDF Final brief · methods, evaluation, reflection
Slide deck Capstone presentation · 10 slides
VWhat sets it apart

What sets this capstone apart.

Takeaway 01 · Insufficient is a verdict

"Don't know" ships as a first-class answer.

The system is allowed — and required — to return "insufficient evidence" when the retrieval doesn't settle the question. That refusal posture is what separates this from the confident-but-wrong fact-checkers it competes against.

Takeaway 02 · Evidence inline, every time

Show the passages, not the score.

Every verdict ships with the passages that produced it. The user reads the model, then reads the source. The score is incidental; the evidence is the product.

Takeaway 03 · Hybrid retrieval is the floor

Dense and lexical, fused.

Pure dense retrieval misses claims that turn on proper nouns and dates. Pure BM25 misses paraphrase. The system runs both and fuses — because either alone leaves a verdict on the floor that the other would have caught.

VIIInstructor note

How this project landed.

Verification projects fail two ways. They overreach by claiming to settle every claim, or they underreach by punting on hard cases without saying so. The early proposal for this team leaned toward the first failure mode — a single confident verdict with a percentage attached.

The reframe was small but load-bearing: make "insufficient" a first-class verdict, and the rest of the system follows. The capstone shipped a working pipeline, a calibrated aggregator, and a verdict surface that an honest reader could actually trust.