Group 06 · Capstone · Spring 2026

GenAI Claim Verification — retrieval, evidence, calibrated confidence.

A factual-claim verifier that takes a sentence, retrieves supporting and contradicting passages from a curated corpus, and returns a labelled verdict — supported, contradicted, insufficient — with the evidence attached and a confidence score that is honest about what the retrieval actually found.

Domain Civic · verification

Stack Dense + BM25 hybridCurated corpusCross-encoder NLIFastAPINext.js

Demoed · Spring 2026

IWhat we built

What problem this solves.

Most consumer-facing fact-checking either reads like a confident verdict from a black box, or buries its evidence behind a long-form article. Neither model serves the person who has a single sentence and a five-second tolerance for ambiguity.

Group 06 built for that user: someone who needs to know whether a claim is supported by sources they would trust, evidence inline, with a verdict that says "insufficient" out loud when the corpus does not actually settle the question.

IIHow it works

The system, end to end.

The pipeline is hybrid retrieval into NLI inference. A claim is embedded and scored against a curated corpus with both a dense retriever and a BM25 lexical retriever; the result lists are fused. The top passages are fed pair-wise into a cross-encoder NLI model that classifies each as entailing, contradicting, or neutral with respect to the claim.

Aggregation is deliberately conservative. A verdict of "supported" requires multiple entailing passages with no high-confidence contradictions; a "contradicted" verdict requires the inverse. Anything in between resolves to "insufficient evidence" — which the team treated as a first-class output, not a failure mode.

Pipeline · GenAI Claim Verification

Ingest

Claim in

user submission

Transform

Retrieval

BM25 + dense, RRF fused

Storage

Curated corpus

domain-whitelisted

Model

NLI inference

cross-encoder, per pair

Transform

Aggregation

support / contradict / insufficient

Surface

Verdict + cites

evidence inline

Transform

Corpus curation

domain whitelist + freshness

Ingest Transform Model Storage Surface Feedback

IIIThe stack

What it's built on.

Layer · tool / library
Retrieval	Dense retriever (sentence-transformer) over the curated corpus
NLI inference	Cross-encoder NLI model scores (claim, passage) pairs
Aggregation	Verdict logic: supported / contradicted / insufficient
Corpus curation	Domain whitelist enforced at ingest time

Retrieval

Dense retriever (sentence-transformer) over the curated corpus
BM25 lexical retriever as a parallel signal
Reciprocal Rank Fusion at the top-k boundary

NLI inference

Cross-encoder NLI model scores (claim, passage) pairs
Per-passage entailment / contradiction / neutral labels
Calibration check against a held-out evaluation set

Aggregation

Verdict logic: supported / contradicted / insufficient
Conservative thresholds; "insufficient" is the default
Confidence reported as a calibrated probability, not a percentage

Corpus curation

Domain whitelist enforced at ingest time
Per-source freshness metadata surfaced in the UI
Out-of-scope claim detector blocks unsupportable queries upstream

IVDeliverables

What the team shipped.

Source repository GitHub · code, tests, README

Demo video Capstone day · screen recording, 4–6 min

Write-up PDF Final brief · methods, evaluation, reflection

Slide deck Capstone presentation · 10 slides

VWhat sets it apart

What sets this capstone apart.

Takeaway 01 · Insufficient is a verdict

"Don't know" ships as a first-class answer.

The system is allowed — and required — to return "insufficient evidence" when the retrieval doesn't settle the question. That refusal posture is what separates this from the confident-but-wrong fact-checkers it competes against.

Takeaway 02 · Evidence inline, every time

Show the passages, not the score.

Every verdict ships with the passages that produced it. The user reads the model, then reads the source. The score is incidental; the evidence is the product.

Takeaway 03 · Hybrid retrieval is the floor

Dense and lexical, fused.

Pure dense retrieval misses claims that turn on proper nouns and dates. Pure BM25 misses paraphrase. The system runs both and fuses — because either alone leaves a verdict on the floor that the other would have caught.

VIIInstructor note

How this project landed.

Verification projects fail two ways. They overreach by claiming to settle every claim, or they underreach by punting on hard cases without saying so. The early proposal for this team leaned toward the first failure mode — a single confident verdict with a percentage attached.

The reframe was small but load-bearing: make "insufficient" a first-class verdict, and the rest of the system follows. The capstone shipped a working pipeline, a calibrated aggregator, and a verdict surface that an honest reader could actually trust.

GenAI Claim Verification — retrieval, evidence, calibrated confidence.

"Don't know" ships as a first-class answer.

Show the passages, not the score.

Dense and lexical, fused.

Related work in the cohort.

Stock Investment AI

RAG Rules · Ultimate Frisbee