Truth in ESG: Evaluating AI’s Grip on Sustainability Reports

Author: Denis Avetisyan

As companies increasingly rely on artificial intelligence to analyze lengthy Environmental, Social, and Governance reports, ensuring the factual accuracy of those analyses is paramount.

The construction of an ESG benchmark necessitates a defined workflow, acknowledging that any structured system, despite initial calibration, will inevitably exhibit entropy over time.

This work introduces ESG-Bench, a new benchmark designed to assess and mitigate hallucinations in long-context AI systems processing ESG data, demonstrating improved performance with Chain-of-Thought prompting and groundedness-based supervision.

Increasingly stringent corporate responsibility standards demand comprehensive ESG reporting, yet the length and complexity of these disclosures pose significant challenges for automated analysis and reliable interpretation. To address this, we introduce ‘ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation’, a novel benchmark dataset and evaluation framework designed to assess and mitigate factual inaccuracies-or hallucinations-in large language models applied to lengthy ESG documents. Our experiments demonstrate that employing Chain-of-Thought prompting, coupled with groundedness-based supervision, substantially reduces hallucinations and improves the reliability of LLM-driven ESG analysis. Will these techniques unlock more trustworthy and scalable approaches to evaluating corporate sustainability and ethical performance?

The Erosion of Trust: Hallucinations in ESG Data

The escalating demand for Environmental, Social, and Governance (ESG) data analysis is increasingly met with the implementation of Large Language Models (LLMs), yet this reliance introduces substantial risks to data integrity. While LLMs offer speed and scalability in processing vast datasets – crucial for evaluating sustainability performance – their inherent limitations pose a challenge to accurate reporting. These models, designed to identify patterns and generate text, are not infallible and can misinterpret complex ESG metrics or draw incorrect conclusions from nuanced reporting. Consequently, stakeholders relying on LLM-derived ESG insights face the potential for flawed investment decisions, inaccurate risk assessments, and ultimately, a misallocation of resources intended to drive positive social and environmental impact. This dependence necessitates careful validation and oversight to ensure the reliability of LLM outputs and maintain trust in ESG data.

Large Language Models, despite their analytical capabilities, are susceptible to “hallucinations”-the generation of fabricated or unsupported information-posing a considerable risk to the reliability of Environmental, Social, and Governance (ESG) data. Recent evaluations of GPT-4o, a leading LLM, reveal that while generally proficient, inconsistencies remain in its responses; assessments indicate an agreement rate of only 81.5% when compared to human annotations. This means over 18% of the time, the model deviates from established facts, potentially leading to flawed ESG reporting and misinformed investment decisions. The propensity for these models to invent information, rather than simply retrieving it, creates a critical challenge for organizations increasingly reliant on automated analysis for sustainability-related data, and underscores the need for robust verification processes.

The rising prevalence of fabricated information within Environmental, Social, and Governance (ESG) data is becoming increasingly problematic in light of evolving regulations designed to standardize and improve sustainability reporting. New directives, such as the Corporate Sustainability Reporting Directive (CSRD) and the Sustainable Finance Disclosure Regulation (SFDR), demand greater transparency and reliability in ESG disclosures. These regulations not only increase the legal and financial risks associated with inaccurate reporting-including potential fines and reputational damage-but also amplify the consequences of LLM hallucinations. As companies become legally obligated to provide detailed and verifiable ESG data, the potential for fabricated information to trigger regulatory scrutiny and impede access to capital is substantial, creating a pressing need for robust validation mechanisms and careful oversight of AI-driven ESG analysis.

Evaluation on ESG-Bench reveals that GPT-4o’s generated answers closely align with human responses and its own groundedness assessments, as demonstrated by the strong agreement shown in the confusion matrices for both answer comparison and groundedness judgment correlation.

Defining Fidelity and Veracity: The Foundations of Trust

Faithfulness, as a metric for evaluating Large Language Models (LLMs), quantifies the degree to which an LLM’s generated response is grounded in and directly attributable to the provided source document. This assessment specifically focuses on preventing the introduction of unsupported claims or inferences; a faithful response will only contain information explicitly or implicitly present within the source material. Evaluations of faithfulness involve verifying that each statement made by the LLM can be traced back to a specific segment of the source document, and that no novel information is added. Distinguishing between faithful and unfaithful responses is critical for applications requiring verifiable outputs, as unfaithful responses introduce inaccuracies and undermine trust in the LLM’s reasoning process.

Factuality, as a metric for evaluating Large Language Model (LLM) outputs, necessitates verifying statements against a body of established, independently verifiable knowledge. This assessment is distinct from faithfulness, which focuses on source document support; factuality checks are performed irrespective of the provided context. The purpose of this evaluation is to identify inaccuracies, biases, or outdated information potentially present in the LLM’s response. This is critical because LLMs are trained on massive datasets that may contain flawed or incomplete data, and factuality assessment serves as a crucial safeguard against the propagation of misinformation, even if the LLM accurately reflects the information within its training data.

Hallucination in Large Language Models (LLMs) manifests in distinct forms requiring tailored evaluation methods. Additive hallucination occurs when a model generates information not present in the source document, necessitating techniques like fact verification against external knowledge bases and source attribution analysis to identify unsupported claims. Conversely, omissive hallucination represents a failure to respond to a query despite relevant information being available in the source material; evaluation here focuses on assessing the model’s ability to correctly identify and retrieve pertinent data, often employing question-answering metrics and coverage analysis. Accurate diagnosis of these distinct hallucination types is crucial for developing targeted mitigation strategies and improving the reliability of LLM outputs.

Introducing ESG-Bench: A Rigorous Framework for Assessment

ESG-Bench is a newly developed benchmark dataset and evaluation framework specifically designed to assess the performance of Large Language Models (LLMs) in the domain of Environmental, Social, and Governance (ESG) question answering, with a particular focus on identifying and measuring instances of hallucination. This framework facilitates rigorous evaluation by providing a standardized dataset and metrics for gauging LLM accuracy and reliability when responding to ESG-related queries. The intention is to move beyond simple accuracy measurements and explicitly test an LLM’s ability to provide truthful and verifiable answers, critical given the increasing reliance on these models for ESG analysis and reporting.

ESG-Bench’s data foundation is built upon publicly accessible Environmental, Social, and Governance (ESG) disclosures. These disclosures are systematically collected from sources such as ResponsibilityReports.com, a repository of corporate responsibility reports. To ensure data standardization and comparability, the framework incorporates established reporting guidelines, specifically the Global Reporting Initiative (GRI) Standards. Utilization of GRI Standards facilitates consistent measurement and reporting of ESG performance across different organizations, enabling a more reliable and objective evaluation of LLM responses concerning ESG topics. This approach allows ESG-Bench to assess not only factual accuracy but also adherence to recognized ESG reporting conventions.

ESG-Bench employs advanced Large Language Models, specifically GPT-4o, for both the creation of questions and the assessment of responses, building upon established methodologies from benchmarks like BioASQ and HaluEval to ensure a robust evaluation process. Notably, GPT-4o exhibits a high degree of self-awareness regarding the factual basis of its answers; its self-evaluated groundedness demonstrates 80.4% agreement with human annotations, indicating reliable internal validation. Further supporting its assessment capabilities, the model maintains strong internal consistency, with 83.7% alignment observed between its question answering decisions and corresponding self-evaluations.

The analysis of question-answer pairs reveals a distribution skewed towards environmental, social, and governance (ESG) categories, with a corresponding distribution of labels indicating the types of questions being asked.

Implications for Responsible AI and the Future of ESG Analysis

Current large language models (LLMs) applied to Environmental, Social, and Governance (ESG) data are susceptible to generating inaccurate or misleading information – a phenomenon known as hallucination. Recent evaluations, notably using the ESG-Bench dataset, highlight a critical need to refine techniques for both detecting and mitigating these fabrications. A four-step Chain-of-Thought (CoT) fine-tuning approach has emerged as particularly effective, consistently exceeding the performance of alternative models across diverse benchmarks including ESG-Bench itself, the HaluEval hallucination benchmark, and the BioASQ biomedical question answering platform. This demonstrates that strategically guiding the model’s reasoning process, rather than simply increasing data volume, significantly improves the reliability of LLM outputs when dealing with complex ESG data, fostering more trustworthy AI-driven insights for investors and sustainability initiatives.

The successful integration of artificial intelligence into Environmental, Social, and Governance (ESG) analytics hinges on establishing unwavering trust in the data and insights generated. Limitations in current large language models – specifically the propensity for ‘hallucinations’ or generating factually incorrect information – directly impede this trust, potentially leading to misinformed investment strategies and ineffective sustainability initiatives. Accurate ESG assessments are crucial for capital allocation, risk management, and demonstrating genuine corporate responsibility; therefore, mitigating these AI-driven inaccuracies isn’t merely a technical challenge, but a prerequisite for unlocking the full potential of AI to drive positive environmental and social impact. Robust and reliable AI tools in this space are fundamental to enabling investors and organizations to make data-driven decisions aligned with sustainability goals and fostering a more transparent and accountable ESG landscape.

Continued advancement in responsible AI for Environmental, Social, and Governance (ESG) analysis necessitates a concentrated effort on refining evaluation methodologies, integrating specialized domain expertise, and bolstering the reliability of large language model outputs. Current metrics often fail to fully capture the nuances of factual accuracy within the complex landscape of ESG data; therefore, future studies should prioritize the development of more robust assessment tools. The recently developed model exhibits a particularly strong ability to avoid generating incorrect responses – as evidenced by its high F1 Score for ‘Without Answer’ classifications – which signifies a significant step towards minimizing false positives while maintaining comprehensive recall, a crucial characteristic for trustworthy AI-driven sustainability insights. Further research building upon this foundation will be essential to unlock the full potential of LLMs in fostering informed investment strategies and impactful environmental and social governance practices.

Environmental, Social, and Governance (ESG) represent the core pillars for evaluating an organization's sustainability and ethical impact. — Environmental, Social, and Governance (ESG) represent the core pillars for evaluating an organization’s sustainability and ethical impact.

The pursuit of reliable knowledge from complex systems, as demonstrated by ESG-Bench, echoes a fundamental truth about architecture itself. Systems, whether software or corporate sustainability reports, are not static entities but evolve over time, inevitably accumulating imperfections. This benchmark, focused on mitigating hallucinations in Large Language Models when processing long-form ESG data, acknowledges this inherent decay. As Edsger W. Dijkstra observed, “It’s not enough to have good code; you must also have good design.” The design of ESG-Bench, utilizing Chain-of-Thought prompting and groundedness-based supervision, represents an attempt to build a more resilient system-one that can age gracefully despite the complexities of long-context understanding and the ever-present risk of factual inaccuracies. The study’s core idea – improving factual accuracy – is a testament to the need for meticulous construction, recognizing that even the most sophisticated systems require constant vigilance and refinement.

The Long View

ESG-Bench, as a diagnostic for factual decay in long-context LLM analysis, identifies not a solution, but a symptom. The benchmark itself is merely a snapshot – a precise measurement of error at a specific moment in the relentless march toward entropy. Each hallucination revealed isn’t a failure of the model, but a moment of truth in the timeline, exposing the inherent fragility of knowledge representation. The observed improvements through Chain-of-Thought prompting and groundedness supervision offer temporary stasis, delaying, not defeating, the inevitable accumulation of informational debt.

Future efforts will likely focus on increasingly sophisticated methods of ‘truth maintenance’ – attempts to build systems that actively resist the corrosive effects of time. However, a more fruitful line of inquiry may lie in accepting that imperfect recall is not a bug, but a feature. Perhaps the goal isn’t to eliminate hallucinations entirely, but to develop models that are aware of their own fallibility, capable of expressing uncertainty, and transparent about the provenance of their claims.

The true challenge isn’t building a perfect memory, but designing systems that age gracefully. Technical debt, in this context, is the past’s mortgage paid by the present-and a truly robust system will account for the inevitable foreclosures.

Original article: https://arxiv.org/pdf/2603.13154.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Erosion of Trust: Hallucinations in ESG Data

Defining Fidelity and Veracity: The Foundations of Trust

Introducing ESG-Bench: A Rigorous Framework for Assessment

Implications for Responsible AI and the Future of ESG Analysis

The Long View

See also: