Cleaning Up Code Training Data for Smarter AI

Author: Denis Avetisyan


A new framework dynamically filters out noisy labels during training, boosting the performance and reliability of language models used for code-related tasks.

This study investigates loss distribution and noise behavior during a training process that incorporates MANTRA, revealing a workflow designed to optimize performance and stability.
This study investigates loss distribution and noise behavior during a training process that incorporates MANTRA, revealing a workflow designed to optimize performance and stability.

MANTRA provides a multi-stage adaptive noise treatment method to improve data quality and model robustness in code summarization and commit intent classification.

Despite the growing reliance on large-scale datasets for training deep learning models in software engineering, the inherent noise and mislabeling within these repositories often degrades performance and robustness. This work introduces MANTRA: a Framework for Multi-stage Adaptive Noise TReAtment During Training, a novel approach that dynamically filters noisy data directly within the fine-tuning process for code-pretrained language models. Experiments across code summarization and commit intent classification demonstrate that MANTRA consistently improves model accuracy, even for those particularly sensitive to label noise. Could this adaptive filtering technique unlock more reliable and efficient fine-tuning strategies for a wider range of data-intensive software engineering applications?


Foundation: Elevating Software Engineering with Language Models

Large Language Models are rapidly transforming software engineering by automating tasks previously requiring significant human effort. These models demonstrate an impressive capacity for both understanding existing code and generating new code snippets, effectively bridging the gap between natural language instructions and executable programs. This automation extends beyond simple code completion; LLMs can assist with bug detection, code translation between programming languages, and even the generation of documentation. The increasing sophistication of these models promises to accelerate software development cycles, reduce development costs, and empower developers to focus on higher-level problem-solving, fundamentally changing the landscape of how software is created and maintained.

The burgeoning field of large language models for code is exemplified by a new generation of powerful tools, including CodeT5+, CodeLlama-7B-HF, StarCoder2-7B, Qwen2.5-Coder-7B, and CodeBERT. These models aren’t simply text processors adapted to code; they represent a significant leap in automated software engineering capabilities. CodeT5+ excels in both code generation and understanding through a unified text-to-text framework, while CodeLlama-7B-HF, specifically designed for code, demonstrates strong performance in code completion and debugging. StarCoder2-7B pushes the boundaries with its extensive training data and multi-lingual capabilities, and Qwen2.5-Coder-7B focuses on efficient performance and accessibility. Finally, CodeBERT, a pioneer in the field, laid the groundwork by pre-training on a vast corpus of code and natural language, enabling nuanced understanding of code semantics. Collectively, these models showcase the transformative potential of LLMs to assist developers, automate repetitive tasks, and ultimately accelerate the pace of software innovation.

The remarkable capabilities of code-focused large language models stem from extensive pre-training on vast datasets of publicly available code. This process, akin to providing a comprehensive education in multiple programming languages and software development practices, allows these models to learn the underlying structure, syntax, and semantics of code. By analyzing billions of lines of code from sources like GitHub, the models develop a statistical understanding of code patterns, enabling them to predict the next token in a sequence, complete code snippets, translate between programming languages, and even identify potential bugs. The sheer scale of these datasets – encompassing diverse coding styles, algorithms, and problem-solving approaches – is crucial; it allows the models to generalize beyond specific examples and perform complex code-related tasks with increasing accuracy and sophistication. Without this foundational pre-training, the models would lack the necessary knowledge to effectively reason about and generate code.

Loss distributions for CodeBERT and CodeT5+ at 5% show consistent performance across epochs 1, 2, 3, and 10.
Loss distributions for CodeBERT and CodeT5+ at 5% show consistent performance across epochs 1, 2, 3, and 10.

The Challenge of Imperfect Data in Code

The accuracy of machine learning models trained on code datasets is frequently compromised by inaccuracies present in the labels or annotations within those datasets. These inaccuracies, often termed “noise,” can stem from human error during data labeling, automated labeling processes with inherent limitations, or the evolving nature of software development practices. Specifically, incorrect labels can misrepresent the ground truth, leading the model to learn spurious correlations and generalize poorly to unseen data. The prevalence of noisy data is particularly notable in tasks involving subjective interpretation, such as code summarization or intent classification, where multiple valid labels may exist, or the correct label is context-dependent. Consequently, model performance, measured by metrics such as precision, recall, and F1-score, is directly impacted, resulting in decreased reliability and potential for flawed predictions.

Noise in code datasets directly impacts the performance of tasks such as Commit Intent Classification and Code Summarization. In Commit Intent Classification, inaccurate labels associating commits with incorrect intent categories lead to misclassification and reduced accuracy in understanding project evolution. Similarly, in Code Summarization, noisy annotations linking code snippets to imprecise or irrelevant summaries result in generated summaries that are unhelpful or misleading. Consequently, models trained on such data exhibit diminished predictive power and produce unreliable outputs, hindering the effectiveness of software engineering applications that rely on these tasks.

Mitigating the impact of noisy labels during model training involves several techniques, including robust loss functions such as symmetric cross-entropy, which down-weights potentially mislabeled examples, and label smoothing, which reduces the confidence in any single label. Data augmentation strategies can also be employed to increase the diversity of the training data and reduce the influence of individual noisy instances. Furthermore, techniques like co-training and self-training leverage unlabeled data to improve model robustness, while methods for explicitly identifying and filtering noisy labels – based on prediction disagreement or confidence scores – are increasingly utilized. The selection of the appropriate technique depends on the nature and extent of the noise within the dataset and the specific characteristics of the learning task.

Loss density distributions reveal how model performance evolves with training epochs and varying levels of noise during code commit intent classification.
Loss density distributions reveal how model performance evolves with training epochs and varying levels of noise during code commit intent classification.

MANTRA: A Robust Framework for Noise Mitigation

MANTRA is a multi-stage adaptive noise treatment framework developed to improve the resilience of Large Language Models (LLMs) when applied to code-related tasks. This framework addresses the issue of noisy training data – code samples containing errors or inconsistencies – which can significantly degrade model performance. MANTRA’s adaptive approach dynamically identifies and mitigates the impact of these noisy samples throughout the training process, rather than relying on static filtering methods. The multi-stage design allows for iterative refinement of noise detection and treatment, progressively enhancing the model’s ability to generalize from imperfect data and maintain accuracy in the presence of code-related noise.

MANTRA employs Gaussian Mixture Models (GMMs) to model the distribution of training data and identify samples that deviate significantly from the expected distribution, flagging them as potentially noisy. This probabilistic approach allows for nuanced identification beyond simple thresholding. Coupled with GMMs, Adaptive Dropout dynamically adjusts the dropout rate for individual samples during training; samples identified as noisy by the GMM receive higher dropout rates, effectively reducing their influence on model parameters. This combination allows MANTRA to selectively down-weight or exclude problematic samples, mitigating the negative impact of noise on model performance without requiring manual labeling or pre-filtering of the training dataset.

MANTRA utilizes Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, significantly reducing computational costs and improving training efficiency. LoRA achieves this by freezing the pre-trained language model’s parameters and introducing trainable low-rank decomposition matrices. These matrices are applied to specific layers, allowing the model to adapt to the task-specific data with a substantially smaller number of trainable parameters – typically less than 5% of the original model size. This approach minimizes GPU memory requirements and accelerates the fine-tuning process without significant performance degradation, making it feasible to train robust models even with limited computational resources.

Applying MANTRA consistently improves F1 scores across all noise levels and epochs for the tested LLMs compared to standard LoRA fine-tuning.
Applying MANTRA consistently improves F1 scores across all noise levels and epochs for the tested LLMs compared to standard LoRA fine-tuning.

Validating MANTRA’s Performance

Evaluations reveal that MANTRA consistently enhances performance across crucial software engineering applications. In scenarios with perfectly clean data – that is, 0% noise – the system achieves a BLEU-4 score of 18.30, a notable, albeit subtle, improvement over the 18.24 score obtained without MANTRA’s intervention. This indicates that even with high-quality inputs, MANTRA refines the output of large language models, leading to more accurate and contextually relevant results. While the difference at 0% noise might seem marginal, it establishes a baseline for substantial gains demonstrated when dealing with more realistic, imperfect datasets, proving MANTRA’s capacity to consistently elevate performance regardless of input quality.

Evaluations demonstrate MANTRA’s resilience against data imperfections, specifically showcasing a marked improvement in performance when subjected to a 15% noise level. The system achieves a BLEU-4 score of 18.14 under these conditions, exceeding the 17.69 score obtained without MANTRA’s intervention. Furthermore, MANTRA exhibits a substantial gain in the F1 score, reaching 69.2% compared to 59.6% without the system – a ten-point increase. These results highlight MANTRA’s ability to not only maintain, but actively enhance, the reliability of large language models even when processing flawed or ambiguous data, thereby bolstering their utility in real-world software engineering applications.

Evaluations using the Qwen2.5-Coder-7B large language model under conditions of 15% label noise reveal a substantial performance increase facilitated by MANTRA. Specifically, the model, when paired with MANTRA, achieves an F1 score of 52.3%, representing a significant improvement over the 44.8% F1 score attained without MANTRA’s intervention. This demonstrates MANTRA’s capacity to bolster the reliability of LLMs even when confronted with imperfect or noisy training data, a common challenge in real-world software engineering applications. The gain in F1 score highlights MANTRA’s effectiveness in precisely identifying and correcting errors, ultimately leading to more accurate and dependable code-related outputs.

A key strength of MANTRA lies in its resilience to inaccuracies within training data; the system demonstrably limits performance decline to under 3% across critical software engineering tasks – code summarization and commit intent classification – even when subjected to 15% label noise. This robustness is particularly valuable in real-world scenarios where datasets are rarely perfectly curated, and errors or inconsistencies are commonplace. By mitigating the negative impacts of noisy labels, MANTRA ensures consistently reliable outputs and allows large language models to maintain a high degree of accuracy without requiring extensive and costly data cleaning processes, ultimately streamlining software development workflows and improving overall efficiency.

The promise of large language models (LLMs) in software engineering hinges on their ability to reliably process and interpret code, but real-world data is rarely clean. MANTRA directly tackles this challenge by mitigating the impact of noisy labels – inaccuracies and inconsistencies commonly found in code datasets. This innovative approach demonstrably improves performance across critical tasks like code summarization and commit intent classification, even when data contains significant errors. By effectively filtering out misleading information, MANTRA enables LLMs to generalize more robustly and deliver more accurate results, ultimately paving the way for greater automation and increased efficiency throughout the software development lifecycle. The system’s resilience to noise unlocks a level of dependability previously unattainable, allowing developers to confidently integrate LLMs into practical workflows and realize their full potential.

Across three large language models, training loss increased with noise levels of 5%, 10%, and 15%-as indicated by the divergence between clean-sample (solid line) and noisy-sample (dashed line) loss curves-demonstrating a degradation in commit intent classification performance.
Across three large language models, training loss increased with noise levels of 5%, 10%, and 15%-as indicated by the divergence between clean-sample (solid line) and noisy-sample (dashed line) loss curves-demonstrating a degradation in commit intent classification performance.

The pursuit of robust models, as exemplified by MANTRA, inherently demands a holistic understanding of the training data. A system built upon flawed foundations, even with sophisticated filtering, will always exhibit limitations. This echoes Donald Knuth’s observation: “Premature optimization is the root of all evil.” MANTRA doesn’t simply attempt to fix noisy labels; it adapts to their presence throughout training, recognizing that a static approach to data quality is often insufficient. The framework’s multi-stage adaptive filtering acknowledges the interconnectedness of data and model behavior – if the system survives on duct tape, it’s probably overengineered – and strives for an elegant, dynamic solution rather than a brittle, corrective one. It is a clear demonstration that structure dictates behavior.

Future Directions

The introduction of MANTRA represents a step toward acknowledging that data, like any city’s infrastructure, is rarely pristine. Attempts to ‘correct’ labels often resemble wholesale redevelopment – disruptive and frequently ignoring the existing, functional components. A more nuanced approach, treating noise as a feature of the landscape rather than a defect, proves more resilient. However, the current framework operates within the confines of supervised learning, implicitly assuming a ‘ground truth’ against which to calibrate adaptation. The interesting question isn’t merely filtering noise, but whether a system can learn to function effectively – even thrive – in its presence, building internal models of data reliability.

Future iterations should investigate the interplay between MANTRA’s dynamic filtering and the inherent biases within large language models. A truly robust system wouldn’t simply remove problematic examples, but actively learn from them, distinguishing genuine errors from stylistic variations or edge cases. Moreover, extending the framework beyond code-related tasks requires careful consideration; the heuristics that define ‘noise’ are domain-specific, demanding adaptive strategies that move beyond pre-defined thresholds.

Ultimately, the goal is not perfect data, but a system capable of graceful degradation. Like a well-designed city, it should evolve – adding layers of redundancy and self-correction – rather than collapsing under the weight of imperfection. The challenge lies in moving from reactive filtering to proactive resilience, building models that anticipate and accommodate the inevitable entropy of real-world data.


Original article: https://arxiv.org/pdf/2512.04319.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-08 04:03