Numbers Don’t Lie, But BERTScore Might

Author: Denis Avetisyan

A new study reveals that standard NLP evaluation metrics struggle to accurately assess semantic similarity in financial text where numerical values play a critical role.

The FinNuE dataset is constructed through a dedicated pipeline designed to facilitate nuanced financial news understanding.

Researchers demonstrate that BERTScore is unreliable for evaluating financial natural language generation systems due to its inability to properly interpret numerical semantics.

While semantic similarity metrics like BERTScore are increasingly used to evaluate financial natural language processing systems, they often overlook critical nuances in numerical precision. This limitation is addressed in ‘FinNuE: Exposing the Risks of Using BERTScore for Numerical Semantic Evaluation in Finance’, which demonstrates that BERTScore struggles to differentiate financially meaningful variations in numerical data—a 2% gain versus a 20% loss, for example. The authors introduce FinNuE, a novel diagnostic dataset, to highlight these shortcomings and reveal the fundamental limitations of embedding-based metrics in financial contexts. Will future evaluation frameworks prioritize numerically-aware approaches to more accurately assess the performance of financial NLP models?

The Illusion of Semantic Equivalence

Evaluating Natural Language Generation (NLG) fundamentally requires accurate measurement of semantic equivalence – a task surprisingly difficult even with sophisticated automated metrics. Current embedding-based metrics, such as BERTScore, demonstrate power in many contexts but exhibit limitations when applied to specialized domains like finance. They struggle with the nuanced interpretation of numerical information; comparative analyses reveal BERTScore achieves near-random performance (approximately 49% accuracy) when assessing sentences with comparable numerical values. This deficiency hinders reliable assessment of NLG within finance, as even minor numerical discrepancies hold substantial meaning. Truly robust evaluation demands metrics sensitive to both linguistic and quantitative precision.

Diagnosing Numerical Blindness

The FinNuE dataset addresses a critical gap in evaluating natural language processing (NLP) metrics within finance. It was designed to probe numerical sensitivity, a characteristic often overlooked. FinNuE consists of sentence pairs identical except for variations in numerical values, allowing researchers to isolate the impact of these changes on metric performance. This enables systematic assessment of whether a metric correctly identifies semantic differences stemming solely from numerical alterations. Using FinNuE, evaluation can move beyond broad similarity scores to determine if a metric accurately reflects how numerical alterations affect meaning, crucial for financial NLP tasks like risk assessment or fraud detection.

Anchoring Accuracy: A Comparative Study

Anchor-based evaluation offers a robust methodology for assessing the numerical sensitivity of language models. These protocols, including triplet and listwise evaluation, compare a base sentence to numerically modified variants, assessing whether a metric accurately positions the base sentence closer to its numerically similar counterpart. Application of triplet evaluation, coupled with random data augmentation, yielded 91.86% accuracy for FinBERT and 92.14% for bert-base. Rule-based augmentation revealed a different profile: FinBERT achieved 84.31%, exceeding bert-base’s 83.09%. These results indicate that FinBERT demonstrates improved, though not flawless, numerical sensitivity compared to bert-base, suggesting a heightened capacity to process numerical relationships within text.

The Limits of Tokenization

BERTScore, despite advancements in semantic similarity assessment, exhibits limitations in contexts demanding precise numerical understanding. Its reliance on subword tokenization fragments numerical values, impeding recognition of the full extent of differences between sentences. This obscures the magnitude of numerical variation, potentially leading to inaccurate similarity scores. The algorithm’s greedy alignment process introduces further challenges, generating spurious matches where superficial token overlap does not indicate semantic equivalence. Listwise evaluation using bert-base confirmed these limitations, demonstrating a Kendall’s τb correlation coefficient of 0.54 with random augmentation and 0.342 with rule-based augmentation. These findings explain why BERTScore struggles to differentiate subtle, yet critical, numerical variations common in financial or scientific domains.

Towards Principled Financial Evaluation

Evaluation of financial statement analysis models requires careful consideration of numerical sensitivity. Utilizing models like FinBERT, pretrained on extensive financial text, can enhance evaluation metric sensitivity compared to general-purpose language models. Cross-pair evaluation demonstrates that bert-base and FinBERT achieve similar low accuracy (48.15% and 48.6% respectively) when assessed with rule-based augmentation, suggesting a fundamental challenge exists in discerning subtle numerical changes within financial reports, even with specialized language models. Combining these specialized models with rigorous anchor-based protocols offers a promising pathway to more reliable assessment. Future research should focus on developing metrics that explicitly account for numerical magnitude and context within financial domains.

The pursuit of effective evaluation in financial Natural Language Processing demands rigorous scrutiny, as demonstrated by the analysis of BERTScore’s limitations. The study reveals a susceptibility to numerical perturbations, undermining its reliability for gauging semantic similarity in contexts where precision is paramount. This echoes Carl Friedrich Gauss’s sentiment: “Few things are more deceptive than obviousness.” The ‘obvious’ utility of BERTScore, it turns out, masks a critical flaw – its inability to discern nuanced numerical relationships. The work advocates for more robust metrics, aligning with the principle that true understanding arises not from superficial assessment, but from a careful dissection of underlying complexities and potential vulnerabilities.

What Remains?

The persistence of automated evaluation metrics, even when demonstrably inadequate, reveals a fundamental human inclination: to quantify the unquantifiable. This work does not offer a superior metric; it merely exposes the failure of a convenient illusion. The temptation to assess financial natural language generation via proxy – semantic similarity judged by models ignorant of numerical consequence – will undoubtedly linger. A system that needs instructions has already failed; to judge its output by similarly flawed means is simply compounding the error.

Future work need not chase ever-more-complex algorithms for semantic comparison. Rather, the focus should shift toward defining what constitutes ‘correctness’ in financial text, independent of superficial similarity. The question is not whether a machine can mimic understanding, but whether it can reliably represent information critical to decision-making. Dataset construction, then, must prioritize the impact of textual changes, not merely their linguistic distance.

Clarity is courtesy. The enduring challenge lies not in building cleverer models, but in acknowledging the limits of current evaluation. A truly robust assessment will require a return to first principles: what does it mean for financial text to be true, and how can that truth be determined with a minimum of artifice? The pursuit of elegance demands ruthless reduction, not endless complication.

Original article: https://arxiv.org/pdf/2511.09997.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/