Distorting the Signal: How Skeletonization Fools Vision-Language Models

Author: Denis Avetisyan

A new attack method subtly alters mathematical formulas to mislead even the most advanced AI systems like ChatGPT during text recognition.

The study demonstrates that focusing adversarial attacks on character-level bounding boxes or skeletonized areas-reducing the target to a one-dimensional array-provides a viable, though ultimately limiting, alternative to full-image perturbations.

This research demonstrates a black-box adversarial attack leveraging image skeletonization to reduce the search space and effectively deceive vision transformers processing LaTeX mathematical expressions.

Despite the increasing capabilities of large vision-language models, their robustness to subtle visual perturbations remains a critical concern, particularly when processing complex content like mathematical formulas. This work, ‘Skeletonization-Based Adversarial Perturbations on Large Vision Language Model’s Mathematical Text Recognition’, introduces a novel attack method leveraging image skeletonization to efficiently reduce the search space for adversarial examples. We demonstrate that this approach effectively deceives models like ChatGPT in recognizing LaTeX-formatted mathematical expressions, revealing vulnerabilities in their visual reasoning. Could targeted skeletonization-based attacks represent a broader threat to the reliability of vision-language models in critical applications requiring precise text interpretation?

The Allure of Vision-Language Models and the Cracks Beneath

The evolution of foundation models, initially dominant in natural language processing, has extended dramatically to encompass visual understanding, creating systems capable of interpreting complex imagery. This convergence of language and vision unlocks powerful applications, particularly in the field of mathematical expression recognition. These models can now process images containing $\frac{a}{b} + c$ or complex equations, accurately identifying and translating them into a machine-readable format. Unlike traditional optical character recognition (OCR) systems limited to isolated characters, these vision-language models leverage contextual understanding, enabling them to decipher handwritten notes, diagrams, and even equations presented in non-standard formats. This capability promises to revolutionize fields like education, scientific research, and accessibility, by automating the conversion of visual mathematical content into digital, editable forms.

Vision-Language Models, despite demonstrating impressive abilities in tasks like Mathematical Expression Recognition, exhibit a surprising vulnerability to Adversarial Attacks. These attacks involve the addition of carefully calculated, often imperceptible, perturbations to input images – subtle alterations undetectable to the human eye. While seemingly innocuous, these modifications can dramatically alter the model’s interpretation, causing it to misidentify mathematical formulas or produce incorrect results. The core of this susceptibility lies in the high-dimensional nature of image data and the model’s reliance on specific feature patterns; adversarial perturbations effectively exploit these patterns, nudging the model towards an incorrect classification with minimal change to the input. This presents a considerable challenge, particularly in applications where reliability is paramount, as even slight image distortions – perhaps introduced through image compression or minor physical disturbances – could lead to critical errors in automated reasoning or analysis of $\lim_{x \to \in fty} f(x)$ .

The reliability of vision-language models extends to critical applications like automated theorem proving, scientific data analysis, and even medical diagnosis, where accurate mathematical expression recognition is paramount. However, these models exhibit a surprising fragility; meticulously designed, imperceptible image perturbations – known as adversarial attacks – can induce misinterpretations of mathematical formulas. For instance, a subtle alteration to a $\frac{1}{2}$ could be misinterpreted as $\frac{1}{7}$ , leading to drastically incorrect calculations or conclusions. This vulnerability isn’t simply a theoretical concern; it represents a genuine security risk in systems where model outputs directly inform decision-making, highlighting the need for robust defenses against such adversarial manipulations to ensure trustworthy performance in real-world deployments.

The proposed approach integrates <span class="katex-eq" data-katex-display="false">\mathbf{x}</span> and <span class="katex-eq" data-katex-display="false">\mathbf{u}</span> to estimate <span class="katex-eq" data-katex-display="false">\mathbf{s}</span>, enabling improved performance. — The proposed approach integrates $\mathbf{x}$ and $\mathbf{u}$ to estimate $\mathbf{s}$ , enabling improved performance.

Deconstructing the Attack: From Pixels to LaTeX Errors – A Closer Look

Adversarial attacks leverage the susceptibility of deep learning models to carefully crafted input perturbations. These attacks function by introducing minimal, often imperceptible, noise to input images – changes that would not typically affect human perception – yet can cause the model to misclassify the image. This vulnerability arises from the high dimensionality of the input space and the non-linear nature of deep neural networks, where small changes in input can propagate and amplify through multiple layers. The resulting misclassification is not due to a lack of model training data, but rather a fundamental limitation in how these models generalize from training data to unseen examples, particularly those specifically designed to exploit these vulnerabilities. The magnitude of the required perturbation is often quantified using metrics like the L₂ or L_∞ norm, measuring the distance between the original and adversarial input.

The One Pixel Attack method highlights the sensitivity of Deep Learning models to even minute input changes. This attack strategically alters a single pixel within an input image, and through iterative refinement, can induce misclassification with high probability. While seemingly counterintuitive, the impact stems from the high-dimensional nature of image data and the model’s reliance on specific feature activations. The altered pixel value propagates through the network’s layers, creating a sufficient change in the final output to cause an incorrect prediction; this demonstrates that robust image recognition requires more than simply achieving high overall accuracy and necessitates resistance to targeted, minimal perturbations. The effectiveness of this approach suggests vulnerabilities in the decision boundaries of the model, even when presented with seemingly valid inputs.

Analysis of adversarial attacks on Optical Mathematical Recognition (OMR) systems indicates that input perturbations designed to cause misclassification frequently result in errors during LaTeX code generation. These errors range from incorrect symbol substitutions and misplaced delimiters to the complete failure to produce valid LaTeX, impacting the accurate representation of mathematical expressions. For example, a correctly rendered equation $\in t_{a}^{b} x^2 dx$ might be incorrectly generated as $\in t a b x^2 dx$ or produce a compilation error. This disruption occurs because the perturbations, while visually subtle to humans, alter the model’s internal feature representations, leading to incorrect interpretations of mathematical structures and, consequently, flawed LaTeX output.

To identify vulnerabilities in Optical Mathematical Recognition (OMR) models Mathpix and pix2tex, we utilized Random Search as an optimization algorithm to refine adversarial perturbations. This involved generating numerous random perturbations to input images and evaluating the resulting errors in LaTeX code generation. Our experiments demonstrated that Random Search consistently outperformed both Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and Tree-structured Parzen Estimator (TPE) optimization algorithms in terms of success rate – that is, the frequency with which minimal perturbations resulted in incorrect $\LaTeX$ output. This suggests that, for this specific task of identifying weaknesses in OMR models, a simple, stochastic search strategy is more effective than gradient-based or Bayesian optimization techniques.

Measuring the Damage: A Metric for Quantifying LaTeX Similarity

A new metric for quantifying the impact of adversarial attacks on mathematical expression recognition systems has been developed. This approach assesses attack severity by comparing the LaTeX code generated from both clean and perturbed images. The core principle involves treating the LaTeX sequences as textual data, enabling the application of natural language processing techniques to measure the dissimilarity between the original and attacked expressions. This method provides a quantitative evaluation of how significantly an adversarial perturbation alters the recognized mathematical content, independent of traditional accuracy metrics.

The evaluation of LaTeX sequence similarity employs Term Frequency-Inverse Document Frequency (TF-IDF) to determine the weight of each term within the generated LaTeX code. TF-IDF assigns higher values to terms that appear frequently within a specific LaTeX sequence but are less common across the entire dataset of LaTeX sequences, effectively highlighting the unique and important components of each expression. Following TF-IDF weighting, Cosine Similarity is calculated to measure the angle between the TF-IDF vectors representing the clean and adversarial LaTeX sequences; a smaller angle, indicated by a Cosine Similarity score closer to 1, signifies greater resemblance, while values decreasing towards 0 indicate increasing dissimilarity between the two LaTeX representations.

The severity of adversarial attacks on mathematical expression recognition systems is quantified through Cosine Similarity scores calculated between the TF-IDF vectors of LaTeX code generated from original and attacked images. A score represents the degree of resemblance between the two LaTeX sequences; lower scores indicate greater discrepancies introduced by the attack. Based on empirical analysis, a Cosine Similarity value below 1, as detailed in Table I, reliably identifies successful attacks – those where the model’s output has been significantly altered and no longer accurately represents the original mathematical expression. This threshold provides a consistent and objective metric for evaluating model robustness against various attack strategies.

The proposed LaTeX similarity metric enables a systematic evaluation of the robustness of mathematical expression recognition models, specifically Mathpix and pix2tex, when subjected to various adversarial attack strategies. By quantifying the difference in LaTeX code generated from clean and perturbed images, researchers can determine the degree to which an attack successfully alters the model’s output. This assessment is performed by calculating a Cosine Similarity score; scores below a defined threshold (< 1) indicate a compromised model, allowing for comparative analysis of model vulnerabilities across different attack vectors and facilitating targeted improvements to model resilience. The methodology provides a quantifiable measure, moving beyond subjective visual inspection, to assess the effectiveness of defenses against adversarial manipulation of mathematical expressions.

Towards Robust Recognition: Exploring Defense Strategies and the Inevitable Arms Race

Adversarial training emerges as a powerful technique for fortifying mathematical expression recognition models against deliberately crafted, misleading inputs. This method intentionally exposes the model to adversarial examples – subtly altered images designed to cause misclassification – during the training process. By learning to correctly identify expressions even when subjected to these perturbations, the model develops enhanced robustness. The core principle involves augmenting the training dataset with these adversarial samples, effectively teaching the model to disregard minor, malicious distortions. This proactive approach significantly improves performance under attack, allowing the system to maintain accurate recognition even when confronted with inputs specifically engineered to deceive it, thereby bolstering the reliability of automated mathematical workflows and assistive technologies.

Research indicates that simplifying the input image through pre-processing techniques can significantly bolster a model’s resilience against adversarial attacks. Specifically, skeletonization – reducing an expression to its essential structural lines – and character bounding box detection, which isolates individual glyphs, serve to constrain the areas where malicious perturbations can effectively alter the image. By narrowing the search space for these subtle, yet impactful, modifications, these methods make it substantially more difficult for an attacker to craft an input that successfully fools the recognition system. This reduction in the effective ‘attack surface’ improves the model’s ability to correctly interpret mathematical expressions, even when presented with intentionally deceptive inputs.

A combined defense strategy, leveraging both adversarial training and pre-processing techniques like skeletonization and character bounding box detection, demonstrably reduces the efficacy of adversarial attacks on mathematical expression recognition models. Specifically, the Fast Gradient Sign Method (FGSM) experiences a substantially lowered success rate when applied to models fortified with these combined defenses. Quantitative results, detailed in Table I, illustrate the impact of these ‘narrowing’ methods, showing significant improvements not only in overall recognition accuracy but also in Peak Signal-to-Noise Ratio (PSNR). This indicates that the defended models are not simply avoiding misclassification, but are also producing outputs that are closer to the original, unperturbed mathematical expressions, thus increasing the reliability and trustworthiness of the recognition process.

A crucial step in verifying the efficacy of any defense strategy against adversarial attacks lies in subjecting it to realistic, evolving threats. Consequently, research now extends beyond traditional attack models to incorporate sophisticated language models like ChatGPT. These models, capable of generating nuanced and contextually relevant perturbations, present a more challenging and representative attack surface than previously considered. Evaluating defenses against such intelligent adversaries is paramount; a system robust against simple perturbations may still be vulnerable to attacks crafted by a language model capable of understanding the underlying mathematical structure and exploiting subtle weaknesses. This approach ensures that developed defenses aren’t merely effective in controlled settings, but possess genuine real-world applicability and can withstand the complexities of future, more advanced attacks.

The pursuit of robustness in large vision language models feels perpetually Sisyphean. This paper’s exploration of skeletonization-based adversarial perturbations-reducing the search space for attacks-merely confirms what seasoned engineers already know: every defense becomes tomorrow’s bypass. It’s a clever technique, certainly, leveraging the models’ reliance on visual cues, but it’s hardly a surprise. As Yann LeCun once observed, “The best machine learning systems are the ones that fail gracefully.” This work demonstrates precisely how not to build one of those. The efficacy of this black-box attack against LaTeX OCR highlights a fundamental truth: production will always find a way to break elegant theories, and mathematical notation, it seems, is particularly vulnerable.

What’s Next?

The demonstrated efficacy of skeletonization as a reduction technique for adversarial attacks on vision-language models feels less like a breakthrough and more like a shifting of the problem. It efficiently narrows the search space for perturbations, certainly, but production systems will invariably discover ways to normalize, or simply ignore, these simplified representations. The models won’t ‘see’ the attack; they’ll see noise and learn to filter it, adding yet another layer of complexity to the already fragile ecosystem. This isn’t defense; it’s an escalating arms race where each ‘solution’ introduces new vectors for failure.

Future work will undoubtedly focus on more robust, representation-agnostic attacks. Perhaps the real challenge isn’t deceiving the model, but understanding why these models are so easily deceived by seemingly minor alterations. The current reliance on cosine similarity as a metric feels… optimistic. It captures superficial resemblance but fails to account for the semantic integrity of the mathematical expressions. A more principled approach, grounded in formal logic or symbolic reasoning, might yield more lasting results-though it will almost certainly be slower and require actual mathematics.

The inevitable conclusion? This research, like all research, will become a dependency. A footnote in a future paper detailing how to circumvent skeletonization-based defenses. CI is the temple-and it prays nightly that nothing breaks. Documentation, as always, remains a myth invented by managers.

Original article: https://arxiv.org/pdf/2601.04752.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Allure of Vision-Language Models and the Cracks Beneath

Deconstructing the Attack: From Pixels to LaTeX Errors – A Closer Look

Measuring the Damage: A Metric for Quantifying LaTeX Similarity

Towards Robust Recognition: Exploring Defense Strategies and the Inevitable Arms Race

What’s Next?

See also: