How Easily Can Language Models Be Fooled by Numbers?

Author: Denis Avetisyan

New research reveals that even small changes to numerical values within factual claims can significantly impact the accuracy of large language models’ veracity predictions.

This study introduces NumPert, a method for systematically perturbing numerical values to assess the robustness of open-weight language models in fact-checking tasks, and demonstrates the efficacy of prompt engineering for improving performance.

Despite advances in knowledge-intensive tasks, large language models remain surprisingly fragile when reasoning about numerical information. This vulnerability is explored in ‘NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction’, a systematic evaluation of model robustness to subtle alterations in numerical claims used for fact verification. Our findings reveal substantial performance drops—up to 62% in some cases—demonstrating that even state-of-the-art systems are easily misled by minor numerical perturbations, though targeted prompting with perturbed examples can offer significant recovery. Given these limitations, what novel strategies are needed to build truly robust and reliable language models for numerical fact-checking?

The Fragility of Quantitative Reasoning

Automated fact verification systems increasingly depend on the capabilities of large language models, yet these models exhibit a notable weakness when confronted with claims centered around precise numerical values. While proficient at processing general knowledge and linguistic patterns, language models often struggle with the quantitative reasoning required to assess the accuracy of statements involving numbers, dates, or statistical data. This difficulty stems from their training primarily focusing on textual relationships rather than rigorous mathematical or logical deduction. Consequently, verifying claims like “The population of Tokyo is 14 million” or “The highest recorded temperature was 56.7°C” presents a significant challenge, as models may prioritize superficial textual similarity over actual numerical correctness, leading to inaccurate veracity assessments and highlighting the need for specialized techniques to enhance quantitative reasoning abilities.

Automated fact verification systems frequently encounter difficulties when assessing claims that involve specific quantities, a challenge stemming from the inherent limitations of traditional reasoning methods. These systems often struggle to maintain accuracy while processing numerical information, leading to unreliable predictions about a claim’s veracity. The core issue lies in the difficulty of accurately interpreting and manipulating quantities within the complex framework of natural language; subtle changes in numerical values or units can dramatically alter the meaning and truthfulness of a statement. Consequently, even sophisticated models can be easily misled by claims containing numbers, highlighting a critical vulnerability in current fact-checking technology and emphasizing the need for improved quantitative reasoning capabilities.

The efficacy of automated fact verification systems is being rigorously tested through specialized datasets like QuanTemp, which centers on claims involving numerical values and quantities. Current evaluations reveal a significant vulnerability in even the most advanced language models; when presented with slightly altered numerical claims – masked perturbations designed to test reasoning ability – state-of-the-art models achieve an accuracy of less than 26% in a zero-shot setting. This low score highlights a critical limitation in their capacity to reliably assess the veracity of statements dependent on precise quantities, and underscores the need for improved techniques in numerical reasoning for robust fact verification.

Probing for Weakness: The Art of Numerical Perturbation

Numerical perturbation assesses language model robustness by introducing controlled modifications to numerical values present in input claims. This method systematically alters quantities – such as dates, measurements, or counts – and then observes the resulting changes in the model’s output or predictions. By quantifying the impact of these subtle numerical shifts, researchers can gain insight into a model’s sensitivity and identify potential vulnerabilities to adversarial manipulation or inaccuracies in data interpretation. The technique allows for a granular evaluation of how well a model generalizes beyond the exact numerical values it was trained on, providing a more comprehensive measure of its overall reliability than simple accuracy metrics.

Numerical perturbation techniques systematically alter numerical values within a claim to evaluate a language model’s robustness. Num Perturbation involves replacing a numerical value with a slightly modified one, such as adding or subtracting a small constant. Approx Perturbation replaces the original number with its rounded equivalent, testing sensitivity to precision. Range Perturbation replaces the number with a randomly selected value within a defined range around the original value, simulating real-world data variance. These methods allow researchers to quantify how changes in numerical inputs affect model predictions and identify potential vulnerabilities related to numerical reasoning.

Beyond simple numerical alterations, several perturbation techniques further probe language model sensitivity. Mask Perturbation replaces numerical values with a mask token, effectively removing the information. Neg-Num Perturbation alters the sign of numerical values, changing positive numbers to negative and vice versa. Rand-Repl Perturbation randomly replaces numerical values with other numbers drawn from a defined distribution. Empirical results indicate that models exhibit heightened vulnerability to both Mask Perturbation and Neg-Num Perturbation, suggesting these alterations disproportionately impact the model’s ability to correctly interpret and reason about the provided claims. These techniques provide a granular assessment of how models process numerical data and identify specific weaknesses in their reasoning capabilities.

Accurate identification and normalization of numerical values within input claims is a critical preprocessing step for numerical perturbation techniques. The Word2Number library is utilized to convert numerical words (e.g., “two”, “hundred”) into their corresponding numerical representation ($2$, $100$). SpaCy’s Named Entity Recognition (NER) capabilities are then employed to locate and classify numerical entities, distinguishing them from other text. This combined approach ensures that perturbations are applied to the intended numerical values and prevents misinterpretations or errors during the analysis of model responses. Failure to accurately identify numerical entities prior to perturbation can lead to inaccurate robustness assessments and misleading conclusions regarding model performance.

Mitigating the Flaws: Prompting for Resilience

Prompting strategies are fundamental to eliciting desired responses from Large Language Models (LLMs). Zero-shot prompting involves presenting a claim to the model without any prior examples, relying on its pre-existing knowledge to determine veracity. In contrast, two-shot prompting provides the model with two example claim-veracity pairs before presenting the test claim, aiming to guide its reasoning process through demonstration. These techniques differ in their reliance on in-context learning; two-shot prompting seeks to establish a pattern for the model to follow, while zero-shot relies entirely on the model’s inherent capabilities. The choice between these strategies, and variations thereof, can significantly impact the model’s performance on tasks requiring logical inference and factual verification.

Perturbation-Aware Prompting is a technique designed to improve the resilience of Language Models (LLMs) when faced with slightly altered input claims. This method augments standard prompts with illustrative examples demonstrating how minor numerical or factual changes—perturbations—should not drastically alter the model’s veracity assessment. By explicitly showcasing perturbed claims and their corresponding correct evaluations within the prompt itself, the LLM is better equipped to generalize and maintain accurate predictions even when presented with novel, yet similar, variations of the original claim. This approach effectively ‘teaches’ the model to focus on the core logical structure of the claim rather than being misled by superficial changes in numerical values or specific details.

Evaluation of large language models, including DeepSeek-R1, Qwen3, Llama3, Mistral, GPT-4o, and Gemini, incorporates assessment of their robustness to numerical perturbations in input claims. Specifically, model performance is measured by their ability to consistently predict veracity even when presented with altered numerical values. In zero-shot testing scenarios, Llama 3.3-70B achieves 63% accuracy when evaluating claims containing negative number perturbations, while DeepSeek-R1 exhibits an invalid output rate of 6.98% under the same conditions, indicating a lower tolerance for such alterations without prior examples.

Analysis of reasoning tokens generated by Large Language Models indicates a potential issue termed “Overthinking,” characterized by the production of excessively verbose reasoning chains. This extended reasoning does not necessarily correlate with improved accuracy and can, in fact, degrade performance. Specifically, evaluation of misclassified instances from the Gemini 2.5FT model revealed an average 15% increase in the length of reasoning token sequences compared to instances where the model correctly predicted veracity. This suggests that longer, more elaborate reasoning chains are associated with incorrect predictions, highlighting a potential area for optimization in model prompting and training to encourage concise and focused reasoning.

Towards Robust Fact-Checking: Implications and Future Directions

A fact-checking system’s reliability hinges on its capacity to remain accurate even when presented with subtly altered numerical data; this vulnerability to numerical perturbations represents a significant challenge for Language Models. Seemingly insignificant changes – a digit swapped, a decimal point shifted – can lead to demonstrably incorrect conclusions, particularly in tasks demanding precise quantitative reasoning. This is because many models treat numbers as tokens, and small alterations can disrupt the learned relationships between quantities and their associated facts. Consequently, building robust systems requires not only the ability to retrieve information but also to verify its numerical consistency, ensuring that even perturbed inputs yield dependable results. Addressing this weakness is paramount for deploying trustworthy fact-checking tools, especially as reliance on these systems grows within information ecosystems.

A thorough understanding of how different numerical perturbations impact language model performance is enabling the creation of more focused and effective evaluation methods. Researchers are moving beyond simple accuracy metrics to assess a model’s sensitivity to subtle changes in numerical data – such as altering digits, adding noise, or changing units of measurement. By systematically applying these perturbations, vulnerabilities in a model’s reasoning abilities become apparent, revealing whether it relies on superficial patterns or genuine understanding of quantities. This targeted approach allows developers to pinpoint specific areas for improvement, crafting datasets and training strategies that bolster robustness and ensure reliable fact-checking, particularly when dealing with complex numerical information and long-context reasoning tasks. Ultimately, it shifts the focus from simply detecting errors to proactively identifying and mitigating the conditions that cause them.

Recent research highlights the efficacy of numerical perturbation as a method for stress-testing and refining language models used in long-context fact-checking. By subtly altering numerical values within prompts, researchers can expose vulnerabilities in a model’s reasoning process, revealing where inaccuracies might arise. This technique led to the development of Perturbation-Aware Prompting (PAP), a strategy that significantly boosts accuracy; notably, PAP achieved up to 99% accuracy when tested on the Qwen3-32BT and DeepSeek-R1 language models. This suggests that intentionally challenging models with perturbed data not only identifies weaknesses but also paves the way for more reliable and robust fact-checking systems capable of handling complex numerical information.

The findings of this research illuminate a pathway towards more resilient language models capable of handling complex numerical reasoning tasks. Current architectures often struggle with even minor variations in numerical data, hindering their reliability in fact-checking and other critical applications. Consequently, future development will likely focus on innovative prompting strategies that guide models towards more stable and accurate calculations, alongside the exploration of novel model architectures specifically designed to mitigate the impact of numerical perturbations. These advancements promise to move beyond superficial pattern recognition, enabling language models to genuinely understand and process quantitative information, ultimately bolstering their trustworthiness in scenarios demanding precision and logical consistency. The goal is not simply to achieve high accuracy on benchmark datasets, but to build systems that maintain that accuracy even when confronted with real-world data complexities and potential adversarial manipulations.

The study illuminates a fragility within large language models—a susceptibility to deceptively minor alterations in numerical data. This echoes a fundamental principle of structural honesty; a system’s true strength isn’t measured by complexity, but by its resilience to disruption. As Tim Berners-Lee observed, “The Web is more a social creation than a technical one.” Similarly, this work suggests that evaluating a model’s ‘veracity prediction’ requires acknowledging the inherently social context of information—how easily facts can be subtly manipulated, and the resulting impact on trust. The pursuit of robustness, therefore, isn’t merely a technical challenge, but a crucial step towards a more reliable informational landscape.

What Remains?

The exercise, distilled, reveals not a failure of scale, but a curious fragility. The models, adept at mimicking understanding, stumble over alterations so slight a human would scarcely notice. This is not a matter of teaching them more facts, but of refining the very question. The vulnerability exposed by numerical perturbation isn’t a bug to be patched, but a symptom of a deeper reliance on superficial statistical relationships rather than grounded semantic comprehension. Subsequent work must address the core issue: how to move beyond pattern recognition and towards genuine, robust reasoning.

Future investigations should move beyond simple perturbation. The current paradigm tests for presence of factual recall. More telling will be explorations of absence – how models handle conflicting numerical information, or respond to deliberate ambiguity. Open-weight models, while offering increased transparency, do not inherently resolve this issue; the architecture itself demands scrutiny. The focus should shift from prompting strategies as a corrective, to an understanding of why those strategies work – what minimal interventions expose the underlying reasoning failures.

Ultimately, the pursuit of veracity prediction feels less like a technical challenge and more like an exercise in applied epistemology. The question is not merely whether a model can state a fact, but whether it understands what constitutes evidence, and how to evaluate its reliability. The remaining work, then, is not to add complexity, but to strip away the illusion of understanding, leaving only the essential core of reliable reasoning.

Original article: https://arxiv.org/pdf/2511.09971.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Fragility of Quantitative Reasoning

Probing for Weakness: The Art of Numerical Perturbation

Mitigating the Flaws: Prompting for Resilience

Towards Robust Fact-Checking: Implications and Future Directions

What Remains?

See also: