Beyond Checklists: Rethinking How We Measure AI Safety
![A framework for improving AI safety benchmarking-derived from an analysis of 210 existing benchmarks-identifies nine key concerns and ten recommendations centered on extending safety evaluations beyond familiar scenarios, employing probabilistic risk quantification-as expressed by [latex]P(risk)[/latex]-and ensuring measurements correlate with demonstrable real-world safety outcomes.](https://arxiv.org/html/2601.23112v1/x1.png)
Current AI safety benchmarks often fall short of accurately assessing real-world risks, prompting a need for more rigorous and nuanced evaluation methods.
![A framework for improving AI safety benchmarking-derived from an analysis of 210 existing benchmarks-identifies nine key concerns and ten recommendations centered on extending safety evaluations beyond familiar scenarios, employing probabilistic risk quantification-as expressed by [latex]P(risk)[/latex]-and ensuring measurements correlate with demonstrable real-world safety outcomes.](https://arxiv.org/html/2601.23112v1/x1.png)
Current AI safety benchmarks often fall short of accurately assessing real-world risks, prompting a need for more rigorous and nuanced evaluation methods.
A new review systematically deconstructs the mathematical foundations of today’s most powerful AI systems, offering a unified perspective on their inner workings.

A new benchmark assesses the ability of artificial intelligence to synthesize information from text, images, and videos for complex financial analysis.
Across fields as diverse as physics, biology, and social science, researchers are repeatedly stumbling upon the same mathematical tools to describe moments of dramatic transition.
New research reveals how both direct and indirect impacts from tropical cyclones can significantly reduce the accuracy of medium-range weather predictions.

Researchers have developed an advanced artificial intelligence system capable of forecasting solar flares with improved accuracy and interpretability.

A new analysis reveals that commonly used datasets for evaluating secret detection models are riddled with duplicated data, leading to inflated performance scores and a false sense of security.
![A causal Wiener filter, implemented via a spectral transformation incorporating parameters [latex]\alpha = 0[/latex], [latex]\beta = 0.9[/latex], and [latex]\omega_0 = 5[/latex] rad/s, effectively estimates a scale-free signal [latex]S_{xx} = A\gamma^{2}/((|\omega|-\omega_{c})^{2}+\gamma^{2})[/latex]-where [latex]\gamma = 2\pi[/latex] rad/s, [latex]A = 0.9[/latex], and [latex]\omega_c = 10 \cdot 2\pi[/latex] rad/s-from noisy measurements characterized by a power spectral density of [latex]S_{nn} = 5/\omega^{1.8} + 0.01[/latex], achieving performance comparable to a non-causal Wiener filter with relative error power spectral densities demonstrably reduced through Welch’s method averaging across approximately 250 logarithmically spaced bins.](https://arxiv.org/html/2601.22294v1/x1.png)
A novel method overcomes limitations in forecasting signals obscured by complex, scale-free noise, opening doors for more accurate predictions in diverse applications.
New research reveals that large language models are surprisingly more likely than humans to perpetuate harmful misconceptions about autism spectrum disorder.

New research evaluates how well artificial intelligence can anticipate security vulnerabilities based on bug reports.