The Semantic Equivalence Problem
When we evaluate AI systems for software engineering, we compare model predictions against reference answers. But an exact textual match is not always necessary for correctness. A Java method that reads the contents of this source as a string is semantically equivalent to one that gets the textual information from this source and represent it as a string, yet token-overlap metrics will penalize the second heavily.
This module explores why overlap-based metrics can fail and what alternatives exist.
The Core Problem
Consider these two summaries of the same code:
- Prediction: "Reads the contents of this source as a string."
- Ground Truth: "Get the textual information from this source and represent it as a string."
Only 3 tokens overlap — yet the summaries are semantically equivalent. An exact-match metric (BLEU) would score this 0.21, suggesting a near-total failure when in fact the prediction is perfectly correct. This is the fundamental challenge of evaluation: surface-level similarity is not the same as semantic correctness.