From Text to Knowledge: A Smarter Way to Classify Data

Author: Denis Avetisyan


A new approach combines the power of language models with graph networks to achieve accurate text classification, even when labeled data is limited.

The pipeline transforms textual documents into a graph structure, initially extracting representations and defining relationships between nodes, then leverages a large language model to label a selection of these nodes before propagating those labels across the entire graph using a graph neural network.
The pipeline transforms textual documents into a graph structure, initially extracting representations and defining relationships between nodes, then leverages a large language model to label a selection of these nodes before propagating those labels across the entire graph using a graph neural network.

This review details a method for efficient text classification leveraging lightweight language models for initial annotation and graph neural networks for label propagation in data-scarce environments.

While Large Language Models excel at zero-shot text classification, their substantial computational demands hinder scalable deployment, particularly in resource-constrained environments. This work introduces \textsc{Text2Graph}: Combining Lightweight LLMs and GNNs for Efficient Text Classification in Label-Scarce Scenarios, an open-source framework that synergistically combines initial LLM-based annotation with Graph Neural Network label propagation. Our results demonstrate that this hybrid approach achieves competitive performance with significantly reduced energy consumption and carbon emissions compared to LLM-only methods. Could this represent a pathway towards more sustainable and scalable text annotation workflows in high-performance computing and beyond?


The Limits of Pattern Matching: Why LLMs Fall Short

Conventional text classification techniques, especially those powered by large language models, often falter when confronted with tasks demanding intricate reasoning and the synthesis of diverse knowledge. These models, while adept at identifying statistical patterns within text, frequently treat words as isolated entities or sequential tokens, overlooking the underlying relationships and contextual dependencies crucial for genuine understanding. Consequently, they can struggle with nuances like coreference resolution, logical inference, or the integration of background knowledge – capabilities that require a more holistic representation of information. The inherent limitations in capturing and manipulating relational information hinder their ability to move beyond superficial pattern matching and achieve robust, human-like reasoning in complex text classification scenarios.

Large language models, despite their impressive abilities in identifying patterns within data, operate as essentially sophisticated “black boxes.” Though capable of generating human-like text and achieving high accuracy in many tasks, these models often lack the capacity to explicitly represent and reason with knowledge. This reliance on statistical correlations, rather than structured understanding, leads to computational expense – requiring massive datasets and processing power – and can result in brittle performance when faced with novel or ambiguous inputs. The absence of explicit knowledge representation also hinders explainability; it’s difficult to discern why a model arrived at a particular conclusion, limiting trust and hindering debugging efforts. Consequently, researchers are exploring alternative approaches that prioritize knowledge encapsulation and symbolic reasoning to overcome these limitations.

Recognizing the constraints of traditional text classification, current research is increasingly focused on graph-based approaches to unlock more robust reasoning capabilities. These methods move beyond sequential text processing by representing documents and their constituent parts – concepts, entities, and relationships – as nodes and edges in a graph structure. This allows the model to explicitly capture the semantic connections within the text, fostering a deeper understanding than pattern recognition alone. By leveraging graph neural networks, these systems can propagate information across the graph, inferring relationships and drawing conclusions that would be difficult for sequential models. This paradigm shift promises improved performance in tasks requiring complex reasoning, knowledge integration, and a nuanced comprehension of textual data, offering a pathway toward more intelligent and reliable text classification systems.

Performance comparisons reveal consistent advantages of the proposed method across all datasets.
Performance comparisons reveal consistent advantages of the proposed method across all datasets.

From Text to Relationships: The Text2Graph Pipeline

The Text2Graph pipeline begins by utilizing Large Language Models (LLMs) to generate contextualized embeddings of input text. These LLMs, pre-trained on extensive corpora, encode semantic information, capturing relationships between words and phrases beyond simple lexical matching. This initial representation transforms raw text into a numerical format suitable for downstream processing. Specifically, the LLM outputs vector representations for individual text segments – sentences or clauses – which serve as nodes in the subsequent graph construction phase. The quality of these embeddings directly impacts the accuracy of relational understanding, as they define the initial semantic basis for identifying connections between different parts of the text.

Sentence embeddings are vector representations of sentences, generated using models trained to capture semantic meaning. These embeddings map each sentence to a point in a high-dimensional space, where the distance between vectors reflects the semantic similarity of the corresponding sentences; shorter distances indicate greater similarity. Techniques such as cosine similarity are commonly employed to quantify this proximity. The resulting vectors enable efficient computation of relationships between text segments, facilitating downstream tasks like identifying related information or clustering similar sentences. The quality of the embedding model significantly impacts the accuracy of these similarity measurements and, consequently, the performance of the entire Text2Graph pipeline.

The construction of the Text Attributed Graph relies on identifying relationships between text segments through algorithms such as k-Nearest Neighbors (k-NN) and Minimum Spanning Tree (MST). k-NN identifies the $k$ most semantically similar text segments based on vector embeddings, establishing initial connections. Subsequently, MST algorithms are applied to create a connected graph with the minimum total edge weight, representing the most significant relationships between segments. The resulting Text Attributed Graph consists of nodes representing text segments and edges representing the identified relationships, with node attributes derived from the text content and edge weights reflecting the strength of the semantic similarity.

Graph Neural Networks (GNNs) are employed on the Text Attributed Graph to facilitate text classification tasks. Specifically, the Graph Convolutional Network (GCN) is utilized as a representative GNN architecture, enabling the aggregation of feature information from neighboring nodes within the graph. This process allows the model to capture relational dependencies between text segments, improving classification performance compared to traditional methods that operate on isolated text units. The efficiency gains stem from the GNN’s ability to leverage graph structure, reducing the computational complexity associated with analyzing contextual relationships and enabling scalable text analysis.

Validation Across Diverse Domains: A Robust Framework

The Text2Graph pipeline’s adaptability has been validated through evaluation on the AG News, Reuters, Ohsumed, and IMDB datasets. The AG News dataset consists of news articles categorized into four classes: World, Sports, Business, and Sci/Tech. The Reuters dataset comprises news articles with categories focusing on topical events. Ohsumed is a collection of abstracts from biomedical literature categorized by MeSH descriptors. Finally, the IMDB dataset contains movie reviews labeled with positive or negative sentiment. Performance across these diverse datasets-spanning news classification, topic categorization, and sentiment analysis-demonstrates the framework’s capacity to generalize beyond a single domain and handle varying text characteristics and classification tasks.

The Text2Graph pipeline facilitates zero-shot text classification by leveraging graph structures to represent semantic relationships between text and potential labels, eliminating the requirement for labeled training data specific to each classification task. This is achieved by constructing a graph where nodes represent both text instances and predefined class labels, with edge weights indicating semantic similarity. Classification is then performed by identifying the label node most strongly connected to the input text node. This approach minimizes the need for extensive, task-specific fine-tuning of model parameters, offering a significant reduction in computational cost and data labeling effort while maintaining competitive performance on diverse text classification benchmarks.

The Text2Graph framework leverages DistilBERT, a distilled version of BERT, trained via a Teacher-Student framework to minimize computational expense. In this approach, a larger, more complex “teacher” model – typically BERT – first generates soft labels, providing richer information than traditional hard labels. DistilBERT, acting as the “student,” is then trained to mimic the output distribution of the teacher model, effectively transferring knowledge while reducing the number of parameters by approximately 40%. This knowledge transfer allows DistilBERT to achieve performance levels approaching that of the full BERT model, but with a significantly lower computational footprint and faster inference speed, making it suitable for resource-constrained environments.

Evaluation across the AG News, Reuters, Ohsumed, and IMDB datasets indicates the Text2Graph framework achieves F1-Macro scores comparable to those obtained using full Large Language Model (LLM) labeling. Specifically, performance metrics demonstrate parity with LLM results while utilizing significantly fewer computational resources. This reduced resource consumption is attributed to the graph-based approach, which allows for efficient knowledge representation and inference without the parameter size and processing demands associated with complete LLM deployments. The framework’s ability to maintain competitive accuracy with lower overhead presents a viable alternative for text classification tasks where resource constraints are a concern.

The Cost of Intelligence: Towards Sustainable AI Practices

The proliferation of large language models, while revolutionizing artificial intelligence, carries a substantial environmental cost. Training these complex systems demands immense computational resources, translating directly into significant energy consumption and a corresponding increase in carbon emissions. The process isn’t limited to initial training; ongoing deployment and inference also contribute to this footprint. Each query processed, each response generated, requires power, and the cumulative effect across millions of users is considerable. Researchers are increasingly focused on quantifying this impact, revealing that the carbon footprint of training a single, large model can be comparable to several transatlantic flights, highlighting the urgent need for more sustainable AI practices and energy-efficient algorithms.

Quantifying the environmental cost of artificial intelligence development requires dedicated tools, and CodeCarbon emerges as a pivotal resource in this evolving landscape. This open-source library meticulously tracks the energy consumption and associated carbon emissions throughout the entire lifecycle of an AI experiment – from initial training to final deployment. By integrating seamlessly into standard machine learning workflows, CodeCarbon provides granular metrics, detailing energy usage per epoch, hardware specifications, and regional carbon intensity factors. These insights enable researchers and developers to pinpoint energy bottlenecks, compare the environmental impact of different model architectures, and ultimately, make informed decisions to minimize their carbon footprint. The tool not only offers precise calculations, but also facilitates reporting and benchmarking, fostering greater transparency and accountability within the AI community and paving the way for more sustainable practices.

The Text2Graph pipeline presents a novel approach to artificial intelligence model development, prioritizing energy efficiency alongside performance. By strategically reducing model size and refining training procedures, it minimizes the computational resources required without compromising accuracy or functionality. This optimization yields a demonstrably higher Performance/Energy Ratio when contrasted with conventional methods; studies indicate a substantial decrease in energy consumption for comparable tasks. The pipeline effectively navigates the trade-off between model complexity and resource utilization, offering a pathway toward more sustainable AI practices and reduced environmental impact without necessitating a sacrifice in output quality. This focus on efficiency is crucial as the demand for increasingly sophisticated AI models continues to grow.

Investigations reveal that artificial intelligence models leveraging graph-based methodologies demonstrate a substantially reduced environmental footprint compared to conventional approaches. This efficiency stems from the inherent data structures within graph neural networks, which require fewer parameters and computational operations to achieve comparable performance. Consequently, training and deploying these models translates to significantly lower energy consumption and, crucially, diminished carbon dioxide emissions. Studies indicate a marked decrease in CO$_2$ output-a critical step towards aligning artificial intelligence development with principles of sustainability and mitigating the broader environmental impact of increasingly complex algorithms.

Runs demonstrate a clear trade-off between average equivalent CO2 emissions and total energy consumption.
Runs demonstrate a clear trade-off between average equivalent CO2 emissions and total energy consumption.

The pursuit of elegant solutions in text classification, as demonstrated by this extsc{Text2Graph} approach, feels predictably optimistic. The paper attempts to sidestep the annotation bottleneck by marrying Large Language Models with Graph Neural Networks – a clever trick, if it holds. But one anticipates production systems will reveal edge cases that shatter the initial efficiency gains. As Claude Shannon famously observed, “Communication is the transmission of information, not the transmission of truth.” This research transmits a promising idea about reducing resource consumption, yet the ‘truth’ of its real-world performance will inevitably be messier. It’s a beautiful theory, and it will become tomorrow’s tech debt.

What’s Next?

The pursuit of ‘efficient’ classification, as demonstrated by this work, inevitably circles back to the age-old problem of data. Substituting the cost of annotation with model complexity merely shifts the burden, not eliminates it. This approach – leveraging Large Language Models for a bootstrap, then handing off to Graph Neural Networks – feels suspiciously like a refined version of knowledge distillation, a trick the field thought it had largely moved beyond. One anticipates diminishing returns as models scale, and the ‘label-scarce’ scenario will simply evolve into a ‘compute-scarce’ one.

The real question isn’t whether one can classify text with fewer labels, but whether the resulting classifications matter. Production, as always, will reveal the brittleness of these elegant architectures when confronted with the delightful messiness of real-world data. Expect to see the error cases cluster around edge cases that are predictably uninteresting to researchers but devastating to end-users.

Future work will undoubtedly explore increasingly convoluted methods for ‘active learning’ and ‘self-supervision’. The cycle continues. Everything new is old again, just renamed and still broken. Perhaps the most efficient classification algorithm remains a well-trained human, but good luck scaling that.


Original article: https://arxiv.org/pdf/2512.10061.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-14 16:41