Evaluating Rigorously

01

The Semantic Equivalence Problem

§ 1 / 12

When we evaluate AI systems for software engineering, we compare model predictions against reference answers. But an exact textual match is not always necessary for correctness. A Java method that reads the contents of this source as a string is semantically equivalent to one that gets the textual information from this source and represent it as a string, yet token-overlap metrics will penalize the second heavily.

This module explores why overlap-based metrics can fail and what alternatives exist.

The Core Problem

Consider these two summaries of the same code:

Prediction: "Reads the contents of this source as a string."
Ground Truth: "Get the textual information from this source and represent it as a string."

Only 3 tokens overlap — yet the summaries are semantically equivalent. An exact-match metric (BLEU) would score this 0.21, suggesting a near-total failure when in fact the prediction is perfectly correct. This is the fundamental challenge of evaluation: surface-level similarity is not the same as semantic correctness.

12

Metrics covered

3

Loss functions

SIDE

Final framework

7

Live demos

02

Classification Metrics: Foundation

§ 2 / 12

Before we evaluate generative models, many SE tasks are classification problems. These use standard classification metrics that form the foundation for all other evaluation approaches.

Precision

Of all items predicted positive, how many truly are? TP / (TP + FP)

Recall

Of all actual positives, how many did we find? TP / (TP + FN)

F1 Score

Harmonic mean of precision and recall. 2 · P · R / (P + R)

Accuracy

Overall correct predictions. Can be misleading with imbalanced data.

03

BLEU: The Workhorse Metric

§ 3 / 12

BLEU (Bilingual Evaluation Understudy) measures how much the generated text's n-grams overlap with the reference. It is simple, fast, and widely used — but it has significant limitations for code evaluation. BLEU computes the precision of n-grams (sequences of 1, 2, 3, or 4 consecutive tokens) in the candidate that appear in the reference, then applies a brevity penalty to discourage short outputs.

BLEU = BP × exp(Σ w_n · log p_n)

where:
  p_n = modified n-gram precision
  w_n = 1/N (uniform weight, typically N=4)
  BP = min(1, e^(1 − r/c)) = brevity penalty
  r = reference length, c = candidate length

Worked Example

Reference: public int getMax(int[] arr)

Candidate: public int findMax(int[] array)

1-grams: {public, int} match → precision = 2/4 = 0.50
2-grams: {public int} matches → precision = 1/3 = 0.33
Brevity penalty: both have 4 tokens → BP = 1.00
BLEU-2 = 1.00 × exp(0.5 × ln(0.50) + 0.5 × ln(0.33)) = 0.41

The variable rename alone dropped the score significantly, even though the code structure is identical.

04

BLEU Limitations and Failure Modes

§ 4 / 12

BLEU is widely used but has well-documented failure modes, especially for code. Understanding these limitations is essential for interpreting BLEU scores correctly and knowing when to use alternatives.

Variable Renaming

Renaming sum to total drops BLEU significantly despite functional equivalence. BLEU treats every token equally — it cannot distinguish variable names from keywords. A single rename can drop BLEU by 10-30%.

Reordering

Swapping the order of independent statements breaks higher-order n-gram matches even though execution order may not matter. BLEU-4 can halve despite identical semantics.

No Semantic Understanding

x = a + b and x = b + a are mathematically identical. BLEU scores them differently because bigrams differ. Commutativity is invisible to BLEU.

Sentence-Level Unreliability

BLEU was designed for corpus-level evaluation (averaging over thousands of examples). Sentence-level BLEU is unreliable and often produces scores of 0 for short sequences. Never draw conclusions from a single BLEU score.

05

ROUGE, METEOR, and Recall-Oriented Metrics

§ 5 / 12

BLEU measures precision — how much of the candidate appears in the reference. But what about recall? Different tasks need different emphasis. ROUGE is recall-oriented: what fraction of reference unigrams appear in the candidate? ROUGE-L uses Longest Common Subsequence, rewarding in-order overlap without requiring contiguity.

ROUGE-1

Unigram recall: what fraction of reference unigrams appear in the candidate?

ROUGE-2

Bigram recall: captures some word-order information.

ROUGE-L

Longest common subsequence: rewards in-order overlap without requiring contiguity.

METEOR

Combines unigram matches with stemming, synonyms, and word order penalty. Recall-weighted F-mean.

ROUGE-L Worked Example

Reference: reads the contents of this source as a string (9 tokens)

Candidate: get the textual information from this source and represent it as a string (13 tokens)

LCS: "the", "this", "source", "as", "a", "string" = 6 tokens
ROUGE-L Recall: 6 / 9 = 0.667
ROUGE-L Precision: 6 / 13 = 0.462
ROUGE-L F1: 0.546

This is more generous than BLEU (0.21) because LCS recognizes the shared sequential structure despite different word choices.

06

Embedding-Based Metrics and Vector Similarity

§ 6 / 12

Instead of comparing surface tokens, we can encode each text into a vector and measure geometric closeness. Embedding-based metrics transform the evaluation problem from string matching to distance in a learned semantic space.

1

Tokenization

Code is split into sub-word tokens using BPE. getMaxValue becomes ["get", "Max", "Value"].

2

Token Embedding

Each token ID maps to a learned 768-dim vector via a lookup table.

3

Transformer Layers

12 self-attention layers refine each token's vector using context from all other tokens.

4

Pooling

The [CLS] token's final hidden state (or mean of all tokens) serves as the fixed-length representation of the entire snippet.

cos(A, B) = (A · B) / (||A|| × ||B||)

Range: −1 (opposite) to +1 (identical direction)
0 = orthogonal (unrelated)

Key insight: Cosine similarity measures DIRECTION, not LENGTH.
Two vectors pointing the same way score 1.0 regardless of length.

07

CodeBLEU: Structure and Semantic Awareness

§ 7 / 12

CodeBLEU extends BLEU by recognizing that code should be evaluated not only as text, but also as structure and behavior. It combines four components — each weighted 0.25: n-gram match, weighted n-gram match, AST match, and data-flow match.

AST Match

Compares syntactic structure. Parses both snippets into Abstract Syntax Trees, normalizes variable names, and measures the fraction of reference subtrees that appear in the candidate.

Example: Two for-loops with different loop variables score 1.0 on AST match because their structure is identical after normalization.

Data-flow Match

Compares value dependencies. Tracks how values are defined, used, and propagated. Two snippets with identical data-flow graphs score highly even if variable names differ.

Example: int x = a + b; int y = x * 2; and int temp = a + b; int result = temp * 2; have identical data-flow despite renaming.

08

pass@k: Functional Correctness and Test-Based Evaluation

§ 8 / 12

For code generation, we don't need EVERY output to be correct — just ONE. pass@k measures the probability that at least one of k generated solutions passes all test cases. It is the gold standard for code generation evaluation.

pass@k = 1 − C(n−c, k) / C(n, k)

where:
  n = total samples generated
  c = number that pass all tests
  k = number of attempts allowed

With n=100, c=23: pass@1 ≈ 0.23, pass@10 ≈ 0.89, pass@100 = 1.00. This captures a fundamental property of code generation: users often generate multiple candidates and pick the best one. A model producing one correct solution out of ten is still useful.

Benchmark	Coverage	Key property
HumanEval	164 hand-crafted Python problems	Standard: n=200, report pass@1, pass@10, pass@100
MBPP	974 crowd-sourced Python tasks	Simpler than HumanEval, broader coverage
LiveCodeBench	Continuously updated from competitive programming	Immune to contamination; post-dates training

09

Contrastive Learning: Shaping the Embedding Space

§ 9 / 12

If lexical metrics are not enough, how else might we learn what "similar meaning" looks like? The answer: shape the embedding space itself through contrastive learning. The idea is simple: train a model so that similar pairs stay close in embedding space and dissimilar pairs are pushed far apart.

Contrastive Loss

Pull similar pairs together, push dissimilar pairs apart up to a margin.

Triplet Loss

Anchor should be closer to positive than to negative by margin m.

N-pair Loss

Generalize to multiple negatives — distinguish the correct match from many wrong candidates.

10

SIDE: Summary-to-Code Semantic Alignment

§ 10 / 12

Traditional metrics compare prediction vs. reference summary. But a summary can sound fluent and still be wrong with respect to the code. A stronger metric should measure alignment with code semantics. SIDE (Summary alIgnment to coDe sEmantic) learns a metric using contrastive learning on ~180K method-summary pairs, measuring whether the summary aligns with the meaning of the code itself — not just with a reference sentence.

Traditional Approach

Prediction vs. Reference Summary. A fluent-sounding but wrong summary might score well on overlap metrics.

SIDE Approach

Summary vs. Code Semantics. Encodes both code and summary through MPNet, compares semantic alignment directly.

Good Summary

"Create a connection to the consumer." aligns with the method's actual behavior. SIDE score: 0.81

Bad Summary

"Connect to the server and return the status." misrepresents the code. SIDE score: 0.23

Training Data

Contrastively trained on CodeXGLUE's 180K method-summary pairs to recognize semantic alignment.

Result

SIDE distinguishes good summaries from bad ones better than any surface metric, approaching human judgment.

11

Common Pitfalls and How to Avoid Them

§ 11 / 12

Even with the right metrics, evaluation can go wrong. Contamination — when test data leaks into pre-training corpora — makes evaluation results meaningless. A model trained on web data may have memorized solutions from HumanEval or MBPP. High scores then measure memorization, not generalization. Detection: check for long n-gram overlap (8-13 tokens) between test and training. Defense: use time-stamped benchmarks (LiveCodeBench).

Statistical rigor: a 2% BLEU improvement means nothing if it falls within the noise margin. Always report confidence intervals using bootstrap resampling. If Model A's BLEU is 32.4 [30.8, 34.1] and Model B's is 34.1 [32.3, 35.9], the confidence intervals overlap — improvement is NOT statistically significant.

Cherry-Picking

Showing only the best outputs creates misleading impressions. Fix: Report aggregate metrics over the full test set.

Wrong Baseline

Comparing to weak or outdated models inflates relative improvement. Fix: Compare against current state-of-the-art.

Overfitting to Benchmarks

Optimizing for metric scores (Goodhart's Law) rather than quality. Fix: Validate with held-out data and human evaluation.

Metric Gaming

Generating outputs of specific length or padding with safe tokens to exploit brevity penalties. Fix: Use multiple metrics; check quality manually.

12

Metric Selection Guide by Task

§ 12 / 12

Different tasks demand different metrics. Using the wrong metric can lead to misleading conclusions about model quality. Here is a decision framework for common SE tasks.

Task	Primary metric	Secondary metrics	When to use human eval
Code Generation	pass@k	CodeBLEU + BLEU	Always for final paper conclusions
Code Summarization	ROUGE + METEOR	BLEU + SIDE	To assess summary quality and alignment
Code Translation	CodeBLEU	Exact Match + AST similarity	When structure preservation matters
Bug Fixing	Exact Match + Test Pass	BLEU as secondary	Always — correctness is binary
Code Review	Human Evaluation	SIDE + Embedding Sim	Primary evaluation method
Documentation Generation	METEOR	ROUGE + Embedding Sim	For fluency and completeness

EVALUATING.lecture Evaluating Rigorously

The Core Problem

Worked Example

ROUGE-L Worked Example