Spotting the Scam: A New Dataset to Combat Crypto Rug Pulls

Author: Denis Avetisyan


Researchers have released a rigorously curated dataset designed to help detect fraudulent token projects before investors lose their funds.

A data pipeline aggregates information from blockchain platforms, open-source intelligence, and security reports, temporally aligning and verifying it to create a leakage-resistant dataset specifically designed for the early detection of rug-pull schemes.
A data pipeline aggregates information from blockchain platforms, open-source intelligence, and security reports, temporally aligning and verifying it to create a leakage-resistant dataset specifically designed for the early detection of rug-pull schemes.

TM-RugPull provides a comprehensive, leakage-resistant collection of 1,028 token projects for improved blockchain fraud detection using multimodal analysis and on-chain data.

Despite the increasing prevalence of fraudulent “rug pull” schemes within the tokenized ecosystem, robust research into early detection is hampered by a lack of suitable datasets. To address this critical gap, we introduce ‘TM-RUGPULL: A Temporary Sound, Multimodal Dataset for Early Detection of RUG Pulls Across the Tokenized Ecosystem’, a rigorously curated collection of 1,028 token projects designed to mitigate temporal data leakage and provide comprehensive multimodal signals. This dataset-spanning DeFi, meme coins, NFTs, and celebrity tokens-enables causally valid analysis and establishes a new benchmark for reproducible fraud detection research. Will this improved resource unlock more effective, proactive defenses against these increasingly sophisticated financial threats?


Deceptive Designs: Unmasking the Threat of Rug Pulls

The burgeoning decentralized finance (DeFi) ecosystem, celebrated for its potential to revolutionize financial systems, unfortunately attracts malicious actors engaging in “rug pulls.” These exit scams, increasingly prevalent, involve developers abandoning a project and absconding with investor funds – often leaving little recourse for those affected. While the promise of DeFi lies in its transparency and accessibility, the speed and pseudonymous nature of blockchain transactions create a fertile ground for fraudulent schemes. Millions of dollars have already been lost to rug pulls, eroding investor confidence and hindering the broader adoption of decentralized finance. The allure of quick profits and minimal regulatory oversight within the DeFi space presents a significant challenge, demanding proactive measures to protect participants from these damaging scams.

The inherent speed and decentralized nature of blockchain transactions present a significant challenge to conventional fraud detection systems. Traditional methods, often reliant on centralized databases and manual review, simply cannot keep pace with the velocity of activity within the decentralized finance (DeFi) space. These systems struggle to analyze the complex web of smart contracts, token swaps, and liquidity pools that characterize DeFi, leaving investors particularly vulnerable to rapidly executed exit scams. Because fraudulent activity can occur and funds can be drained within minutes, the time-consuming processes of legacy fraud detection prove ineffective, creating a critical gap in investor protection and highlighting the urgent need for novel, blockchain-native security solutions.

Identifying deceptive practices in decentralized finance necessitates a detailed analysis of on-chain activity, as traditional fraud detection struggles with the speed and intricacies of blockchain transactions. To facilitate the development of effective detection models, researchers have created the TM-RugPull dataset, a comprehensive resource encompassing data from over 1,028 token projects. This dataset is carefully constructed to minimize data leakage – a common pitfall in machine learning where models learn spurious correlations – ensuring that any patterns identified genuinely reflect malicious intent rather than accidental artifacts of data collection. By providing a robust and reliable foundation for model training, the TM-RugPull dataset empowers developers to build more accurate and trustworthy systems for safeguarding investors in the rapidly evolving DeFi landscape.

Analysis of on-chain data reveals that scam tokens are distinguished by significantly higher token concentration and holder variance within the top 1%, offering reliable indicators for early detection of rug pulls.
Analysis of on-chain data reveals that scam tokens are distinguished by significantly higher token concentration and holder variance within the top 1%, offering reliable indicators for early detection of rug pulls.

Constructing a Foundation: Introducing the TM-RugPull Dataset

The TM-RugPull dataset represents a substantial improvement in resources for analyzing token project risk by offering a collection of 1,028 projects designed to minimize data leakage. Previous datasets often suffered from information prematurely available to predictive models, leading to inflated performance metrics not representative of real-world efficacy. This dataset addresses this limitation through careful construction and validation, providing a more realistic benchmark for evaluating rug pull detection algorithms and fostering research into genuinely predictive features. The size of the dataset enables statistically significant analysis and robust model training, improving the reliability of research findings in the decentralized finance (DeFi) space.

The TM-RugPull dataset encompasses a wide range of token project categories, including Decentralized Finance (DeFi), meme tokens, Non-Fungible Tokens (NFTs), and celebrity tokens, demonstrating the expansive nature of the current token landscape. Analysis of the dataset reveals that a substantial portion – over 40% – of identified scam instances originate from token projects outside of the DeFi sector. This indicates that rug pull schemes are not limited to complex DeFi protocols and are prevalent across various token types, necessitating broad-spectrum detection methodologies.

The TM-RugPull dataset employs a dual-source data collection methodology, integrating on-chain and off-chain data to provide a comprehensive assessment of token project behavior. On-chain signals consist of transaction data, including token transfers, liquidity pool interactions, and smart contract deployments, which are directly extracted from the blockchain. Complementing this, off-chain data incorporates information sourced from social media, project websites, and team member profiles. This combined approach allows for the identification of anomalous patterns, discrepancies between stated project goals and actual activity, and potential indicators of malicious intent that may not be apparent when analyzing either data source in isolation.

The TM-RugPull dataset captures the broad spectrum of rug-pull threats by encompassing diverse token types-from DeFi protocols to meme coins, NFT games, and celebrity tokens-demonstrating that these threats extend beyond typical DeFi contexts.
The TM-RugPull dataset captures the broad spectrum of rug-pull threats by encompassing diverse token types-from DeFi protocols to meme coins, NFT games, and celebrity tokens-demonstrating that these threats extend beyond typical DeFi contexts.

Preserving Temporal Integrity: Rigorous Feature Engineering

The TM-RugPull dataset mitigates temporal data leakage through a practice termed ‘temporal hygiene’, strictly limiting feature extraction to data available before the identified attack event. This methodology ensures that predictive models are trained solely on historical information, preventing the inclusion of post-attack data that could artificially inflate performance and invalidate real-world applicability. All features used for classification are calculated using data points exclusively preceding the project’s midpoint, which defines the boundary between pre-attack and post-attack periods. This approach avoids the risk of models learning patterns based on information that would not have been available to an investor at the time of the potential rug pull.

The TM-RugPull dataset establishes a defined ‘Project Midpoint’ to strictly separate pre-attack and post-attack data, preventing temporal data leakage. This boundary is crucial for ensuring that feature engineering utilizes only information available before a potential rug pull event. Specifically, all feature calculations – including those for indicators like Token Concentration Ratio and Holder Variance – are constrained to data points preceding the Project Midpoint. This methodology isolates the predictive power of pre-attack characteristics, avoiding the introduction of bias from subsequent events and ensuring the validity of model predictions based on historical data alone.

The identification of potential rug pulls leverages quantifiable on-chain metrics, specifically the Token Concentration Ratio and Holder Variance. The Token Concentration Ratio measures the percentage of tokens held by the top 1% of addresses, while Holder Variance quantifies the distribution of token holdings across all addresses. Statistical analysis, utilizing the Mann-Whitney U test with a significance level of p < 0.001, demonstrates a significant difference in token concentration between projects identified as scams and those considered legitimate. This indicates that scam projects tend to exhibit a higher concentration of tokens among a small number of holders, a pattern not observed in legitimate projects.

Analysis of midpoint ratios across TM-RugPull projects reveals a consistent temporal boundary application, as evidenced by the prominent peak near <span class="katex-eq" data-katex-display="false">0.5</span>.
Analysis of midpoint ratios across TM-RugPull projects reveals a consistent temporal boundary application, as evidenced by the prominent peak near 0.5.

Empowering the Ecosystem: Accessibility and Impact

The creation of the TM-RugPull dataset involved a rigorous process of curation and verification undertaken by experienced blockchain security analysts. These specialists meticulously examined a substantial volume of data, focusing on identifying and confirming instances of potential ‘rug pulls’ – a deceptive practice within decentralized finance where developers abandon a project and abscond with investor funds. This detailed scrutiny extended beyond automated checks, incorporating manual review to validate the authenticity and accuracy of each flagged transaction and project characteristic. The result is a dataset distinguished by its high fidelity, offering researchers a reliable foundation for developing and testing advanced fraud detection algorithms and ultimately bolstering investor confidence in the rapidly evolving DeFi landscape.

The TM-RugPull dataset is freely available through the Hugging Face Datasets platform, a strategic decision intended to maximize its impact on the field of fraud detection. This open access approach dismantles traditional barriers to research, allowing data scientists, developers, and security analysts worldwide to readily utilize and contribute to the dataset’s ongoing refinement. By providing a centralized and easily accessible resource, the platform facilitates collaborative efforts, accelerates the development of innovative detection tools, and ultimately empowers a more proactive defense against malicious actors within the decentralized finance ecosystem. The ease of integration with popular machine learning frameworks further streamlines the research process, promising a quicker path toward robust and reliable rug pull detection systems.

Open access to the TM-RugPull dataset directly facilitates the development of innovative fraud detection tools within the decentralized finance (DeFi) landscape. By providing a readily available, verified resource, researchers and developers can efficiently prototype, train, and validate algorithms designed to identify and mitigate the risks associated with rug pulls. This accelerates the creation of more robust security measures, ultimately bolstering investor confidence and fostering a more secure and sustainable DeFi ecosystem. The ability to easily access and analyze this data empowers the community to proactively address emerging threats and build defenses against malicious actors, contributing to a safer environment for all participants.

A comprehensive benchmark dataset yields better-calibrated scam probability predictions and clearer distinction between legitimate and fraudulent projects than a dataset limited to DeFi, demonstrating the benefits of cross-domain and multimodal data.
A comprehensive benchmark dataset yields better-calibrated scam probability predictions and clearer distinction between legitimate and fraudulent projects than a dataset limited to DeFi, demonstrating the benefits of cross-domain and multimodal data.

The creation of TM-RugPull underscores a fundamental principle of system design: structure dictates behavior. This dataset isn’t merely a collection of data points; it’s a meticulously constructed framework intended to reveal patterns indicative of fraudulent activity. By prioritizing temporal hygiene and multimodal analysis, the researchers demonstrate an awareness that isolated observations are insufficient. As Barbara Liskov aptly stated, “Programs must be correct, but correctness is not enough; they must also be understandable.” TM-RugPull facilitates understanding by providing a rigorously curated resource, allowing for more transparent and reliable investigation into the complex dynamics of blockchain token projects and addressing the limitations of prior, less structured datasets.

What’s Next?

The introduction of TM-RugPull does not, of course, solve the problem of blockchain fraud. It merely shifts the burden of ingenuity. Existing approaches often treat symptoms – price volatility, liquidity pool drains – rather than the underlying pathology. This dataset aims to provide a more temporally sound foundation for identifying those pathologies, but a clean dataset is a plea for better theory, not a substitute for one. If the models built upon it merely replicate existing biases, the exercise will have been…efficiently disappointing.

A persistent challenge lies in the inherent asymmetry of information. Detecting a rug pull requires predicting malicious intent, and intent rarely broadcasts itself directly. Future work should therefore prioritize not simply identifying what happened, but modelling the preconditions – the network effects, the social engineering, the sheer opportunism – that make such events possible. The architecture of a system dictates its failure modes; understanding those modes demands a holistic view, not just a clever algorithm.

Ultimately, the value of any such dataset resides in its capacity to be superseded. A truly successful fraud detection system will render datasets like TM-RugPull obsolete, forcing researchers to confront more subtle, and therefore more dangerous, forms of deception. If this work accelerates that process, even by a small margin, it will have served its purpose. The art of system design, after all, is knowing what vulnerabilities to accept.


Original article: https://arxiv.org/pdf/2602.21529.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-02-26 22:02