Beyond Averages: Profiling Social Harm in AI Language Models

Author: Denis Avetisyan

A new framework moves beyond traditional fairness metrics to assess the risks of harmful outputs from large language models, focusing on how those risks are distributed across different groups.

Harmful outputs, as quantified by the SHARP sub-index, exhibit distinct distributions across evaluated models at the prompt level, as evidenced by the range of values captured in the box plots.

SHARP introduces a distributional approach to evaluating social harm by analyzing risk profiles and tail events in language model outputs.

While increasingly deployed in high-stakes applications, evaluations of large language models often obscure critical details of potential social harms through reliance on scalar metrics. This limitation motivates ‘SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models’, which introduces a framework for multidimensional, distribution-aware evaluation, modeling harm as a multivariate random variable and quantifying tail risks using metrics like Conditional Value at Risk $CVaR_{95}$ . Application of SHARP to eleven frontier LLMs reveals substantial heterogeneity in failure structures, demonstrating that models with similar average risk can exhibit drastically different worst-case behaviors across dimensions of bias, fairness, ethics, and epistemic reliability. Will this shift toward tail-sensitive risk profiling prove essential for the responsible development and governance of increasingly powerful language technologies?

Dissecting Harm: A Multidimensional Approach to LLM Evaluation

The increasing prevalence of large language models presents a growing potential for societal harm, prompting an urgent need for robust evaluation frameworks. These models, capable of generating human-quality text, are susceptible to propagating biases present in their training data, potentially leading to discriminatory or unfair outcomes across various applications. Furthermore, their capacity to generate misleading or factually incorrect information raises concerns about the erosion of trust and the spread of misinformation. A comprehensive evaluation is therefore critical not simply to identify these harms, but to develop strategies for mitigation and ensure responsible development and deployment of these powerful technologies, safeguarding against unintended negative consequences for individuals and society as a whole.

Existing evaluations of large language models frequently struggle to identify the precise nature of potential harms, creating a significant obstacle to effective mitigation strategies. Broadly flagging problematic outputs – such as toxicity or bias – provides limited actionable insight; a model exhibiting biased behavior, for instance, may do so through disparate impact, stereotype reinforcement, or the propagation of harmful misinformation, each requiring a distinct corrective approach. Without this granular understanding of how harm manifests, developers are left addressing symptoms rather than root causes, leading to interventions that are either ineffective or introduce unintended consequences. Consequently, a more refined methodology is needed – one that dissects harmful outputs into their constituent elements to facilitate targeted and meaningful improvements in model safety and alignment.

Truly understanding the potential risks posed by large language models demands a shift from simplistic, single-metric evaluations to a detailed analysis across several crucial dimensions of harm. Bias, for example, isn’t a monolithic issue, but manifests in various forms-from prejudiced outputs reinforcing societal stereotypes to disparate performance across demographic groups. Similarly, fairness extends beyond equal treatment to consider equitable outcomes, while ethical concerns encompass issues like privacy violations and the potential for malicious use. Crucially, epistemic reliability-the model’s ability to generate truthful and well-supported information-represents a separate but equally vital dimension, as confidently delivered misinformation can be profoundly damaging. Only by meticulously defining and measuring harms across these interconnected facets – bias, fairness, ethics, and epistemic reliability – can researchers and developers effectively pinpoint vulnerabilities and build safeguards against the potential for societal detriment.

Reliance on singular metrics to assess the safety of large language models presents a significant risk of overlooking critical vulnerabilities. A model might perform well on a general benchmark, yet still exhibit dangerous biases or generate demonstrably false information in specific contexts. This is because a single score collapses complex behaviors into a simplistic representation, obscuring the nuanced interplay of different harm dimensions. Consequently, a truly robust evaluation necessitates a multidimensional approach, examining factors like fairness, ethical considerations, and epistemic reliability independently – and in combination – to reveal a more complete and accurate picture of a model’s potential for societal impact. Such detailed scrutiny is vital for identifying and mitigating risks that would otherwise remain hidden beneath a deceptively positive overall score.

Distributions of safety and harm metrics reveal that SHARP prioritizes tail-aware statistics to better capture and mitigate risks associated with cumulative effects and rare, high-impact failures, rather than relying solely on mean-centered evaluation.

Introducing SHARP: A Rigorous Framework for Evaluating LLM Harm

The SHARP framework establishes a fixed evaluation protocol designed to consistently assess social harms generated by Large Language Models (LLMs). This protocol mandates specific prompting strategies, harm categories, and evaluation criteria, eliminating ambiguity and enabling reproducible safety testing. By fixing these elements, SHARP moves beyond ad-hoc evaluations and facilitates comparative analysis of different LLMs or iterative improvements to a single model. The standardized approach allows for consistent monitoring of safety performance across model versions and provides a reliable basis for benchmarking and reporting on social harm mitigation efforts. This fixed protocol is crucial for ensuring that safety evaluations are not influenced by arbitrary choices in prompting or harm assessment, yielding more objective and comparable results.

Cumulative Log-Risk serves as the central metric within SHARP by transforming diverse safety signals – such as toxicity, bias, and factual inaccuracy – into a standardized, additive scale. This reparameterization enables the decomposition of overall risk into contributions from individual harm types, facilitating targeted mitigation strategies. Specifically, safety signals are converted to log-probabilities, summed, and then exponentiated, resulting in a single, interpretable risk score. This additive property allows developers to isolate the impact of specific interventions, enabling efficient refinement of model safety profiles and focused resource allocation for harm reduction. The use of a logarithmic scale also addresses the often-skewed distribution of safety signals, providing a more robust and sensitive measure of overall risk.

SHARP employs Conditional Value at Risk at the 95th percentile (CVaR95) as a key metric for quantifying and prioritizing severe harms generated by large language models. CVaR95, also known as Expected Shortfall, calculates the average loss given that the loss exceeds the 95th percentile of the predicted harm distribution. This focuses evaluation efforts on ‘tail risk’ – the potential for infrequent but high-impact harmful outputs. Unlike simply measuring average harm, CVaR95 provides a more sensitive indicator of worst-case scenarios, allowing for targeted mitigation strategies focused on reducing the severity of the most dangerous possible responses. The $CVaR_{95}(X) = \mathbb{E}[X | X > VaR_{95}(X)]$ calculation effectively weights the most extreme harm values, ensuring that safety interventions prioritize minimizing these high-impact risks.

Traditional LLM safety evaluations often rely on single-draw metrics, assessing harm based on a limited number of generated samples. SHARP diverges from this approach by adopting a distribution-aware evaluation, which considers the full probability distribution of model outputs. This is crucial because LLMs are stochastic; given the same prompt, they can produce a wide range of responses, some harmless and others potentially harmful. By analyzing this distribution, SHARP captures the variability inherent in LLM generation and provides a more realistic estimate of potential harm, rather than relying on a single, potentially unrepresentative sample. This distribution-aware approach allows for a more nuanced understanding of risk and facilitates targeted interventions to mitigate the most likely and severe harms across a broader range of possible outputs.

Box plots reveal that prompt-level distributions of SHARP metrics-specifically joint safety probability, any-harm probability, and cumulative log risk-characterize failure likelihood and accentuate potentially dangerous tail behaviors.

Validating SHARP: Establishing Statistical Confidence in Harm Assessment

Inter-judge reliability was quantified using Mean Absolute Deviation (MAD) to assess the consistency of harm evaluations. MAD calculates the average of the absolute differences between the ratings assigned to the same instance by different judges; lower MAD values indicate higher agreement. Evaluations demonstrated acceptable consistency, confirming that observed differences in harm assessments were not attributable to random or subjective variations in the judging process. This metric provides a statistically sound basis for aggregating individual judge scores and establishing confidence in the overall harm ranking.

Judge aggregation utilized the Log-Sum-Exp method to combine individual harm assessments. This technique calculates the maximum log-likelihood of the observed scores, effectively prioritizing more severe evaluations without being unduly influenced by outliers. The Log-Sum-Exp function, represented as $\log\left(\sum_{i=1}^{N} e^{x_i}\right)$ , where $x_i$ are the individual judge scores, normalizes the scores while amplifying the impact of higher values. This approach ensures robustness against individual judge biases or errors and provides a more conservative estimate of overall harm potential compared to simple averaging.

Statistical significance of harm evaluations across different models was assessed using non-parametric tests due to the non-normal distribution of the data. Both the Wilcoxon Signed-Rank Test and the Friedman Test were employed; the Friedman Test, specifically, demonstrated a statistically significant difference between models with a p-value of less than 0.01. This p-value indicates that observed differences in harm scores are unlikely to be due to random chance, supporting the conclusion that models exhibit varying levels of risk when prompted with similar inputs.

Two-Way Fixed Effects Decomposition was employed to determine the relative contributions of large language model (LLM) and prompt identity to overall harm risk scores. This statistical method isolates the variance attributable to each factor by accounting for fixed effects – specifically, unique characteristics of each LLM and each prompt used in the evaluation. Analysis indicates that prompt identity explains a greater proportion of the total variance in harm assessments than does model identity, suggesting that the specific phrasing and instructions provided to the LLM are more influential in determining risk levels than the underlying model architecture itself. This decomposition provides a granular understanding of risk factors, facilitating targeted mitigation strategies focused on prompt engineering and refinement.

Violin plots reveal that prompt-level distributions of the SHARP sub-index harms-across evaluated models-exhibit dispersion, skew, and tail mass not captured by simple model-level averages, highlighting the variability of harm generation.

Quantifying Systemic Risk: The Union-of-Failures Perspective

The evaluation of large language model (LLM) safety often requires considering multiple potential harms simultaneously. Researchers addressed this complexity by modeling the overall risk as the probability of any harm dimension being triggered, a concept known as the Union-of-Failures. This approach moves beyond assessing individual harms in isolation, instead providing a comprehensive view of potential risks by acknowledging that even a single activated harm represents a failure. By treating each harm dimension as an independent potential failure point, the analysis effectively calculates the likelihood of at least one undesirable outcome occurring, offering a more realistic and nuanced understanding of the LLM’s overall safety profile. This holistic perspective is crucial for identifying vulnerabilities and prioritizing mitigation strategies, as it highlights the interconnectedness of various harm types and their collective contribution to overall risk.

Traditional risk assessments of large language models often focus on identifying specific harm types – such as toxicity, bias, or privacy violations – in isolation. However, this approach overlooks the crucial reality that overall risk arises not from a single failure, but from the potential for any of these harms to manifest. A more nuanced understanding, therefore, requires considering the combined probability of at least one harm occurring. This ‘Union-of-Failures’ perspective shifts the focus from individual vulnerabilities to a holistic view of system safety, acknowledging that even a low probability of multiple independent harms can collectively represent a significant overall risk. By evaluating the likelihood of any harm dimension being activated, this framework provides a more comprehensive and realistic assessment of an LLM’s potential for negative impact.

A key finding of the study demonstrates a substantial capacity to differentiate between large language models based on their risk profiles. Analysis reveals that Conditional Value at Risk (CVaR) successfully separates 80% of model pairs, indicating a measurable difference in potential harm generation. Of the 55 pairs examined, 44 exhibited statistically significant differences in CVaR, as confirmed by a Kendall’s W coefficient of 0.1809. This result suggests that CVaR serves as a reliable metric for quantifying and comparing the overall risk associated with different LLMs, providing a data-driven approach to evaluating their safety characteristics and informing responsible development practices.

The evaluation of large language model (LLM) safety requires a multifaceted approach, and the SHARP framework, when integrated with the Union-of-Failures perspective, delivers a notably robust and comprehensive solution. This combination moves beyond assessing individual harm types – such as bias, toxicity, or privacy violations – to quantify the overall probability of any harmful dimension being activated. By considering the interconnectedness of these potential failures, the framework provides a holistic risk assessment, offering a more nuanced understanding of an LLM’s safety profile. This allows for a more reliable comparative analysis of different models, as demonstrated by the study’s findings regarding Conditional Value at Risk (CVaR) separation, ultimately facilitating the development and deployment of safer and more responsible AI systems.

The SHARP framework, as detailed in the study, rightly pivots from simplistic, mean-centered evaluations toward a more nuanced understanding of distributional risk – recognizing that harm isn’t evenly spread. This echoes Donald Davies’ observation that, “Simplicity is the ultimate sophistication.” The pursuit of elegant evaluation, much like elegant system design, demands stripping away unnecessary complexity to reveal fundamental truths. SHARP’s focus on ‘tail events’-rare but potentially devastating harms-demonstrates a commitment to understanding the system’s full behavior, not just its average performance. Good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

Where Do We Go From Here?

The introduction of SHARP, with its emphasis on distributional risk and tail events, feels less like a solution and more like a necessary recalibration. For too long, evaluation has fixated on the average case, assuming a bell curve where harm is evenly distributed. This is, predictably, naive. The real damage resides in the extremes, in the unlikely but devastating outputs that slip past conventional metrics. The framework’s strength lies in acknowledging this asymmetry, though quantifying the truly unforeseen remains, as always, the central challenge.

A crucial next step involves expanding the dimensionality of ‘harm’ itself. The current taxonomy, while a reasonable starting point, will inevitably prove incomplete. Social nuance is rarely captured by neatly defined categories. Overly elaborate schemes risk becoming brittle, but a purely data-driven approach threatens to miss subtle yet significant injustices. The art, it seems, will be in finding the minimal sufficient structure – a principle often overlooked in the rush to complexity.

Ultimately, SHARP underscores a broader truth: robust systems are not built by eliminating all risk, but by understanding its shape. If a design feels clever, it’s probably fragile. The pursuit of ever-higher average scores will always be a distraction. Real progress requires a relentless focus on the failures, the edge cases, and the vulnerabilities that inevitably lurk beneath the surface. A simple idea, perhaps, but one persistently ignored.

Original article: https://arxiv.org/pdf/2601.21235.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Dissecting Harm: A Multidimensional Approach to LLM Evaluation

Introducing SHARP: A Rigorous Framework for Evaluating LLM Harm

Validating SHARP: Establishing Statistical Confidence in Harm Assessment

Quantifying Systemic Risk: The Union-of-Failures Perspective

Where Do We Go From Here?

See also: