Decoding Sustainability: Can AI Unlock EU Taxonomy Data?

Author: Denis Avetisyan


A new study investigates how artificial intelligence can automatically analyze corporate sustainability reports to identify key performance indicators for EU regulatory compliance.

The analysis demonstrates a direct comparison between actual values and model forecasts for Key Performance Indicators aligned with the EU Taxonomy, illuminating the model’s capacity to reflect real-world performance metrics within the framework of sustainable finance.
The analysis demonstrates a direct comparison between actual values and model forecasts for Key Performance Indicators aligned with the EU Taxonomy, illuminating the model’s capacity to reflect real-world performance metrics within the framework of sustainable finance.

Researchers demonstrate successful qualitative activity identification using Large Language Models, but face hurdles in accurately predicting quantitative KPI values from sustainability disclosures.

Despite increasing demands for corporate sustainability disclosure, automated analysis of complex reporting frameworks like the EU Taxonomy remains a substantial challenge. This is addressed in ‘Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs’, which introduces a novel dataset and systematically evaluates Large Language Models (LLMs) for extracting and predicting Key Performance Indicators (KPIs). Our findings reveal a marked disparity between LLM performance on qualitative activity identification and quantitative KPI prediction, with models failing to accurately forecast financial metrics despite moderate success in understanding report content. Could strategically combining LLMs with human expertise unlock truly efficient and reliable sustainability reporting workflows?


Decoding the Sustainability Data Maze

The European Union Taxonomy, designed to establish a unified language for sustainable investment, faces a significant practical hurdle: the sheer volume and complexity of corporate reporting. While the initiative seeks to bring clarity to environmental claims, current reliance on manual data extraction and analysis is both financially burdensome and susceptible to human error. Companies disclose sustainability data across diverse formats – annual reports, sustainability reports, and various online platforms – requiring extensive, time-consuming review. This manual process not only increases operational costs for investors and regulatory bodies but also introduces the risk of inaccurate assessments, potentially undermining the credibility of sustainable finance and hindering the flow of capital towards genuinely impactful projects. The limitations of manual analysis emphasize the urgent need for automated tools capable of efficiently processing and validating the Key Performance Indicators (KPIs) central to the EU Taxonomy’s framework.

The proliferation of Environmental, Social, and Governance (ESG) reporting frameworks has created a substantial challenge for investors and analysts attempting to discern genuine sustainability performance. Existing methods for extracting Key Performance Indicators (KPIs) from corporate disclosures – often lengthy, unstructured documents – rely heavily on manual review or basic text-based searches. This approach proves inefficient, incredibly time-consuming, and susceptible to human error, particularly when navigating the diverse reporting standards and varying levels of transparency across different companies and industries. Consequently, validating the accuracy and reliability of reported KPIs remains a significant hurdle, hindering effective investment decisions and potentially undermining efforts to direct capital toward truly sustainable initiatives. Automated solutions capable of intelligently parsing complex reports and verifying data against established benchmarks are increasingly vital to overcome these limitations and unlock the full potential of sustainable finance.

The reliability of sustainability claims is increasingly pivotal in directing capital towards genuinely impactful initiatives. Investors are no longer solely focused on financial returns; they actively seek evidence of positive environmental and social contributions, demanding transparency and verifiable data. Without accurate assessment, however, “greenwashing” – the practice of misleadingly portraying products or practices as environmentally friendly – can erode trust and misallocate resources, hindering progress towards crucial sustainability goals. Robust evaluation frameworks are therefore essential not only for attracting investment but also for ensuring that financial flows genuinely support projects and companies committed to measurable, positive environmental impact and long-term value creation. This scrutiny drives innovation in reporting and verification, ultimately fostering a more sustainable and accountable financial ecosystem.

Analysis of EU Taxonomy KPIs reveals that aligned KPIs consistently center around zero, indicating minimal impact, whereas eligible KPIs demonstrate a broader range of potential contributions.
Analysis of EU Taxonomy KPIs reveals that aligned KPIs consistently center around zero, indicating minimal impact, whereas eligible KPIs demonstrate a broader range of potential contributions.

Automated Extraction: A First Attempt at Systemization

A system was developed to automate the analysis of sustainability reports using Large Language Models (LLMs). This system is designed to identify and extract information pertaining to specific economic activities as reported by companies. The process involves processing unstructured text from these reports and categorizing the described activities, enabling quantitative and qualitative data to be surfaced for further analysis. The system’s architecture leverages the LLM’s natural language understanding capabilities to parse complex reporting language and pinpoint details relevant to defined economic criteria, ultimately streamlining the process of sustainability data collection and assessment.

The system utilizes multi-label text classification to categorize reported economic activities based on the EU Taxonomy. This approach allows for the assignment of multiple relevant taxonomy labels to each activity described within sustainability reports. Evaluation of the classification performance, using a dataset incorporating concise company metadata, resulted in an F1-score of 0.311. This metric indicates a limited ability to both precisely identify relevant activities and avoid false positives during categorization, suggesting potential for improvement in the model’s accuracy and reliability.

The system incorporates quantitative KPI regression to estimate numerical values for key financial indicators, specifically Turnover, Capital Expenditure (CapEx), and Operational Expenditure (OpEx). However, evaluation of this regression model resulted in a negative R² value of -0.2106. This indicates that the model’s predictions are demonstrably less accurate than simply using the mean value of each KPI as a baseline prediction; a negative R² signifies that the model explains less variance in the data than a constant model, suggesting the chosen features do not reliably predict these quantitative values.

A pair plot reveals the distributions of six EU Taxonomy KPI percentages and their relationships, using kernel density estimates along the diagonal and scatter plots for pairwise correlations.
A pair plot reveals the distributions of six EU Taxonomy KPI percentages and their relationships, using kernel density estimates along the diagonal and scatter plots for pairwise correlations.

Refining the Process: An Agentic Approach

The Agentic Workflow employs an iterative process for KPI extraction and classification, moving beyond single-pass Large Language Model (LLM) outputs. This workflow consists of multiple sequential steps where initial LLM outputs are refined through subsequent processing stages. Each iteration leverages the results of the previous step, allowing the system to correct errors and improve the accuracy of KPI identification and categorization. This multi-step approach is designed to address the inherent limitations of relying on a single LLM prediction, enhancing overall reliability and precision by systematically improving results with each cycle.

Single-pass Large Language Model (LLM) outputs, while efficient, are inherently limited by the model’s initial interpretation of the input and potential for error propagation. The Agentic Workflow mitigates these limitations through iterative refinement; initial KPI extraction and classification results are subjected to subsequent validation and correction steps within the workflow. This multi-step process allows the system to self-correct, reducing the impact of initial errors and improving overall precision. By decoupling KPI identification from immediate classification and introducing intermediate review stages, the Agentic Workflow avoids the constraints of a single, potentially flawed, LLM pass.

Evaluation of the Agentic Workflow demonstrated an F1-score of 0.3285 on quantitative KPI extraction and classification tasks. While this represents a marginal performance increase compared to the highest-performing single-step model, which achieved an F1-score of 0.311, the results indicate a substantial deficiency in the model’s ability to perform accurate quantitative reasoning. This limited improvement suggests that iterative refinement, while beneficial, is insufficient to fully address the challenges inherent in extracting and interpreting numerical data from unstructured text.

Zero-Shot Learning was implemented to improve the performance of the Quantitative KPI Regression model; however, evaluation revealed substantial miscalibration. Specifically, the model demonstrated an Expected Calibration Error (ECE) of 0.684. This metric indicates a significant discrepancy between the predicted probabilities and the observed frequencies of correct predictions, meaning the model’s confidence levels do not reliably reflect its actual accuracy. While Zero-Shot Learning offered some performance gains, the high ECE suggests that the model’s output probabilities should not be directly interpreted as reliable estimates of correctness without further calibration techniques.

The agentic workflows differ architecturally, with a sequential pipeline in the first approach and a parallel structure in the second to improve prediction refinement.
The agentic workflows differ architecturally, with a sequential pipeline in the first approach and a parallel structure in the second to improve prediction refinement.

Establishing Ground Truth: Validation and Scalability

To rigorously evaluate the automated sustainability analysis system, its performance is measured against a newly constructed ‘Structured Benchmark Dataset’. This dataset comprises company reports containing verified economic activities and key performance indicators (KPIs), offering a standardized basis for comparison. The dataset’s structure allows for objective assessment of the system’s ability to accurately extract and interpret quantitative data from complex documents. By using verified data, researchers can determine the system’s precision and identify areas where improvements are needed, ultimately ensuring the reliability of automated sustainability assessments and fostering trust in the reported metrics.

A meticulously curated ‘Structured Benchmark Dataset’ of company reports serves as the foundation for rigorously evaluating the automated analysis system. This dataset, comprising verified economic activities and key performance indicators (KPIs), enables an objective assessment of the system’s capabilities, moving beyond subjective interpretations. By comparing the automated analysis against established ground truth data, researchers can precisely measure both the accuracy and efficiency of the process. The resulting metrics demonstrate the system’s ability to reliably extract and interpret quantitative data from complex corporate reporting, ultimately establishing a transparent and verifiable standard for sustainability claim assessment.

Recent advancements in large language models (LLMs) have not translated to reliable quantitative analysis, specifically in the crucial task of key performance indicator (KPI) regression. Rigorous testing reveals a comprehensive failure of these models when asked to extract and interpret numerical data from company reports in a ‘zero-shot’ setting – meaning without prior training on similar tasks. This limitation signifies a fundamental challenge; while LLMs excel at processing language, they struggle with the precise numerical reasoning required to validate sustainability claims and accurately assess economic activities. The inability to perform this quantitative assessment hinders the potential for automated analysis of sustainability data, and underscores the need for further development in LLM capabilities to unlock their full potential in driving investment towards a sustainable economy.

The potential for scalable sustainability hinges on the efficient verification of environmental, social, and governance (ESG) claims, and this approach offers a pathway to unlock substantial investment in genuinely responsible projects. Currently, the laborious process of manually assessing company reports creates a bottleneck, hindering capital flow towards environmentally beneficial initiatives. However, realizing this potential demands a critical advancement: improved quantitative reasoning within automated analysis systems. While automation can streamline assessment, the current limitations in accurately extracting and interpreting key performance indicators (KPIs) pose a significant challenge. Overcoming this hurdle is not merely about speed, but about ensuring investors can confidently identify and support projects that demonstrably contribute to a sustainable economy, thereby accelerating the transition towards a more responsible and equitable future.

The dataset's companies are distributed across various sectors, reflecting a diverse range of industries.
The dataset’s companies are distributed across various sectors, reflecting a diverse range of industries.

The pursuit of automated KPI extraction from sustainability reports, as detailed in this study, inherently involves a process of controlled demolition of established reporting structures. It’s a deliberate attempt to dissect complex documents and rebuild their information into a quantifiable format. This resonates with David Hilbert’s assertion: “We must be able to answer the question: What are the ultimate foundations of mathematics?”-a similar spirit drives the research. The work isn’t merely about finding KPIs; it’s about rigorously testing the underlying assumptions within corporate disclosures and revealing the structural weaknesses in how sustainability data is presented. The challenges encountered in quantitative prediction aren’t failures, but rather opportunities to refine the system and expose its inherent limitations, much like a stress test reveals the fault lines in any complex mechanism.

Beyond the Numbers

The apparent success in identifying qualitative activities aligned with the EU Taxonomy raises a pointed question: is the current focus on quantifiable Key Performance Indicators (KPIs) a necessary constraint, or a self-imposed limitation? The struggle to reliably predict those KPIs from textual reports suggests the information is often present, but obscured by the very act of standardization. Perhaps the ‘signal’ isn’t in the numbers themselves, but in the narrative around them – the justifications, the caveats, the unspoken assumptions baked into sustainability disclosures.

Future work needn’t replicate the pursuit of perfect prediction. Instead, attention should turn towards identifying why these models fail – what subtle linguistic patterns betray non-compliance, or indicate ‘greenwashing’? The inconsistencies revealed by imperfect prediction are, after all, potentially more valuable than accurate data points. A system that flags ambiguity, demands justification, or highlights discrepancies would be a far more potent regulatory tool than one that simply automates reporting.

The ultimate test isn’t whether an LLM can extract KPIs, but whether it can expose the underlying logic – or illogic – of corporate sustainability claims. The next iteration should treat sustainability reports not as sources of truth, but as complex systems to be reverse-engineered, probing for the hidden rules governing ‘impact’ and ‘alignment’.


Original article: https://arxiv.org/pdf/2512.24289.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-01 20:39