Finding Equations in the Noise: A New Approach to Symbolic Regression

Author: Denis Avetisyan


A novel Bayesian framework leverages probabilistic modeling to discover underlying equations from data, even in the presence of significant noise.

The study demonstrates Sequential Monte Carlo Stochastic Resampling (SMC-SR) effectively modulates the distribution of Normalized Maximum Likelihood (NML) through likelihood tempering, ultimately shaping the model posterior as a refined histogram informed by the training data.
The study demonstrates Sequential Monte Carlo Stochastic Resampling (SMC-SR) effectively modulates the distribution of Normalized Maximum Likelihood (NML) through likelihood tempering, ultimately shaping the model posterior as a refined histogram informed by the training data.

This work introduces a Sequential Monte Carlo method for Bayesian Symbolic Regression that enhances robustness, enables uncertainty quantification, and improves performance compared to traditional genetic programming techniques.

Despite its potential for scientific discovery, symbolic regression-the task of uncovering governing equations from data-is notoriously sensitive to noise. This limitation motivates the work presented in ‘Bayesian Symbolic Regression via Posterior Sampling’, which introduces a Sequential Monte Carlo (SMC) framework to approximate the posterior distribution over symbolic expressions. By efficiently exploring the search space and quantifying uncertainty, the proposed method demonstrably improves robustness and generalization compared to traditional genetic programming approaches. Could this Bayesian approach unlock more reliable and interpretable models for complex systems across diverse scientific and engineering domains?


The Challenge of Complex Systems: A Matter of Mathematical Rigor

Conventional modeling approaches frequently encounter limitations when applied to systems characterized by a large number of interacting components and non-linear relationships. These high-dimensional, non-linear systems – prevalent in fields like climate science, financial markets, and biological networks – often defy simple, linear extrapolations. The intricate web of feedback loops and emergent behaviors can render predictions based on simplified models fundamentally inaccurate. For example, a small initial change in a highly non-linear system can trigger disproportionately large and unpredictable outcomes, a phenomenon known as the butterfly effect. Consequently, traditional methods may fail to capture the full range of possible behaviors, leading to flawed forecasts and potentially misguided decision-making. This challenge necessitates the development of more sophisticated techniques capable of handling complexity and uncertainty, such as agent-based modeling or machine learning algorithms, to achieve reliable insights into these intricate systems.

Representing uncertainty and effectively leveraging prior knowledge pose significant hurdles in complex systems modeling. Traditional approaches often assume precise values for parameters, failing to account for inherent variability and the limitations of data. This can lead to overconfident predictions and a poor understanding of potential outcomes. Incorporating prior knowledge – existing expertise or data from related systems – is crucial for guiding models, especially when data is scarce, but doing so requires careful calibration to avoid biasing results. Bayesian methods offer a powerful framework for addressing these challenges, allowing models to quantify uncertainty and update beliefs as new evidence emerges, yet they can be computationally demanding and require careful specification of prior distributions. The ongoing development of techniques that seamlessly integrate uncertainty quantification and prior knowledge remains a central focus for improving the reliability and predictive power of complex systems models.

Model complexity for the Feynman dataset I-17-32 is determined by both the model's functional form and the number of parameters used.
Model complexity for the Feynman dataset I-17-32 is determined by both the model’s functional form and the number of parameters used.

Bayesian Statistics: A Foundation of Probabilistic Truth

Bayesian statistics fundamentally differs from frequentist approaches by treating parameters as random variables with probability distributions. This allows for the expression of prior beliefs about a parameter – quantified as a prior distribution, denoted as $P(\theta)$ – which are then updated in light of observed data $D$ using Bayes’ Theorem to yield the posterior distribution, $P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}$. The term $P(D|\theta)$ represents the likelihood of observing the data given a specific parameter value, while $P(D)$ acts as a normalizing constant ensuring the posterior distribution integrates to one. This iterative updating process allows for continuous refinement of beliefs as more data becomes available, providing a coherent framework for incorporating both existing knowledge and new evidence into statistical inference.

Bayesian model complexity is managed through the posterior distribution, which inherently balances model fit and simplicity. Unlike frequentist methods that often rely on techniques like cross-validation or regularization to prevent overfitting, Bayesian inference automatically penalizes complex models by assigning them lower posterior probability unless their added complexity is justified by a corresponding improvement in explaining the observed data. This is achieved because the posterior is proportional to the likelihood multiplied by the prior; assigning broader or lower-valued priors to model parameters discourages excessively large values and complex relationships. Consequently, the resulting posterior distribution favors models that generalize well to unseen data by avoiding overly specific fits to the training set, effectively mitigating the risk of overfitting without requiring explicit regularization terms or separate validation procedures.

Calculating the posterior distribution – the updated belief after observing data – is often analytically intractable, particularly in high-dimensional spaces where the number of parameters exceeds the amount of available data. Consequently, approximation techniques are essential for performing Bayesian inference. Common methods include Markov Chain Monte Carlo (MCMC) algorithms, such as Gibbs sampling and Metropolis-Hastings, which generate samples from the posterior distribution to estimate its properties. Variational inference offers an alternative by approximating the posterior with a simpler, tractable distribution, minimizing the Kullback-Leibler divergence between the approximation and the true posterior. Furthermore, techniques like Hamiltonian Monte Carlo address challenges in high dimensions by incorporating gradient information to improve sampling efficiency, while methods like expectation propagation and Laplace approximation provide deterministic alternatives to sampling-based approaches.

Model selection methodology significantly impacts both the normalized root mean squared error (NRMSE) on test data and the accuracy of ground-truth identification across Feynman datasets.
Model selection methodology significantly impacts both the normalized root mean squared error (NRMSE) on test data and the accuracy of ground-truth identification across Feynman datasets.

Symbolic Regression and Evolutionary Algorithms: A Search for Mathematical Elegance

Symbolic regression is a type of regression analysis that searches for mathematical expressions to best model the relationship between input variables and output variables. Unlike traditional regression methods which assume a specific functional form – such as linear or polynomial – symbolic regression automatically discovers both the structure and parameters of the model. The resulting model is expressed as an equation – for example, $y = ax^2 + bx + c$ – and is therefore inherently interpretable by humans. This contrasts with “black box” models like neural networks, where the relationship between inputs and outputs is difficult to discern. The process typically involves an optimization algorithm that evaluates the fitness of candidate expressions against the observed data, aiming to minimize the error between predicted and actual values.

Genetic Programming (GP) is an evolutionary algorithm used to discover mathematical expressions that model data. It operates by maintaining a population of candidate expressions, typically represented as tree structures where terminal nodes represent variables and constants, and internal nodes represent mathematical operators like +, -, *, and /. These expressions are evaluated based on a fitness function – a measure of how well they fit the observed data – and then undergo selection, crossover, and mutation to create a new generation of expressions. Selection favors higher-fitness expressions, crossover combines parts of two expressions, and mutation introduces random changes. This iterative process continues until a satisfactory expression is found or a predefined termination criterion is met, effectively searching the exponentially large space of possible mathematical formulations.

Standard Genetic Programming (GP) implementations often exhibit high computational cost due to the extensive evaluation of numerous candidate expressions within the population across multiple generations. This expense scales rapidly with increasing problem complexity and data dimensionality. Furthermore, GP is prone to generating models that overfit the training data, resulting in complex expressions with many terms and operators that perform poorly on unseen data – a phenomenon known as bloat. Such models, while achieving high accuracy on the training set, lack the generalization capability necessary for predictive performance and offer limited insight into the underlying relationships within the data. Techniques like parsimony pressure and regularization are frequently employed to mitigate these issues, but do not eliminate them entirely.

SMC-SR adaptively adjusts its exploration-exploitation balance (ϕ) across diverse Feynman datasets, maintaining population diversity as demonstrated for dataset I-32-17.
SMC-SR adaptively adjusts its exploration-exploitation balance (ϕ) across diverse Feynman datasets, maintaining population diversity as demonstrated for dataset I-32-17.

Bayesian Symbolic Regression: A Principled Approach to Model Discovery

Bayesian Symbolic Regression represents a significant advancement over traditional methods by integrating the principles of Bayesian inference into the process of discovering mathematical expressions from data. Unlike conventional symbolic regression, which often yields a single, potentially overfitted model, this approach treats the entire space of possible equations as a probability distribution. This allows the algorithm to quantify uncertainty and explore multiple plausible solutions, enhancing the robustness and generalization capabilities of the resulting model. By assigning probabilities to different expressions based on their ability to fit the observed data, Bayesian Symbolic Regression effectively avoids getting trapped in local optima and provides a more reliable estimate of the underlying relationship, ultimately leading to more accurate predictions and a better understanding of the system being modeled. The probabilistic framework inherently handles noisy data and limited samples more gracefully, delivering more stable and interpretable results.

Successfully navigating the vast landscape of possible mathematical expressions in symbolic regression requires a robust method for estimating the probability of each candidate. Sequential Monte Carlo (SMC) addresses this challenge by representing the posterior probability distribution-the likelihood of an expression given the data-as a collection of weighted particles. Each particle embodies a potential expression, and its weight reflects its goodness-of-fit to the observed data. Through iterative sampling and weighting, SMC progressively refines this particle representation, focusing computational effort on promising expressions while effectively exploring the search space. This probabilistic approach not only identifies a single best-fit expression but also provides a measure of uncertainty, allowing for a more nuanced understanding of the relationship between variables and offering improved generalization compared to deterministic methods. The technique effectively tackles the inherent complexities of the problem by approximating an otherwise intractable probability distribution, offering a powerful tool for scientific modeling and data analysis.

Evaluations demonstrate that the newly developed Sequential Monte Carlo Symbolic Regression (SMC-SR) framework consistently surpasses the performance of traditional Genetic Programming (GP) approaches in the challenging task of equation discovery. Across a comprehensive benchmark of 12 Feynman datasets – representing a diverse range of physical phenomena – SMC-SR achieves notably lower Normalized Root Mean Squared Error (NRMSE-test) values. This consistent improvement suggests that the probabilistic nature of SMC, combined with symbolic regression, yields more robust and generalizable models. The framework’s ability to effectively explore the space of possible equations, while quantifying uncertainty, leads to solutions that better approximate the underlying relationships within the data, offering a significant advancement in the field of automated scientific discovery.

SMC-SR adaptively adjusts its exploration-exploitation balance (ϕ) across diverse Feynman datasets, maintaining population diversity as demonstrated for dataset I-32-17.
SMC-SR adaptively adjusts its exploration-exploitation balance (ϕ) across diverse Feynman datasets, maintaining population diversity as demonstrated for dataset I-32-17.

Toward Robustness and Generalization: Implications for Real-World Data

Traditional symbolic regression methods often struggle when confronted with the inherent noise present in real-world datasets, leading to unstable or inaccurate model discovery. Bayesian Symbolic Regression, however, addresses this limitation through its probabilistic framework, effectively quantifying uncertainty and mitigating the impact of noisy observations. This robustness stems from the algorithm’s ability to integrate prior knowledge and marginalize over a distribution of possible models, rather than relying on a single, potentially overfitted solution. Consequently, the framework demonstrates a significantly improved capacity to recover underlying relationships from data containing measurement errors or extraneous variables, making it particularly well-suited for applications in fields like physics, engineering, and finance where data quality is rarely perfect and reliable model generalization is paramount.

The Bayesian Symbolic Regression framework showcases a remarkable ability to autonomously rediscover established physical laws when applied to the Feynman Datasets. These datasets, consisting of observational data generated from various physical systems, served as a challenging benchmark for the method’s capacity to perform scientific discovery. The framework successfully identified equations governing phenomena such as harmonic oscillation, free fall, and Kepler’s third law, effectively mirroring results obtained through traditional scientific investigation. This outcome highlights the potential of the approach to not merely fit data, but to infer underlying principles, offering a compelling pathway for automated hypothesis generation and validation in scientific domains. The recovered equations, expressed in symbolic form-such as $y = ax^2 + bx + c$ for parabolic motion-demonstrate the system’s capacity to express complex relationships from raw data, reinforcing its utility as a tool for data-driven scientific exploration.

Evaluations reveal the Sequential Monte Carlo Symbolic Regression (SMC-SR) framework demonstrates a significantly reduced tendency to overfit compared to Gaussian Process Symbolic Regression (GPSR) methods, overfitting on only five of twelve tested datasets. This improved generalization isn’t achieved through sheer computational force; SMC-SR accomplishes this with remarkable efficiency, incorporating roughly half the number of unique models considered by traditional Gaussian Process approaches. This suggests that the framework’s sampling strategy effectively navigates the model space, focusing computational resources on promising regions and avoiding exhaustive – and often fruitless – exploration, ultimately leading to more robust and parsimonious solutions when applied to complex data.

Results on Feynman datasets demonstrate that the best models, assessed by both training and test NRMSE, consistently outperform the noise level of the training data, though some GP-agg bars are truncated in the test NRMSE plot.
Results on Feynman datasets demonstrate that the best models, assessed by both training and test NRMSE, consistently outperform the noise level of the training data, though some GP-agg bars are truncated in the test NRMSE plot.

The pursuit of robust model identification, as detailed in the paper’s Sequential Monte Carlo framework, echoes a fundamental tenet of mathematical rigor. The study champions an approach where algorithms aren’t merely assessed by empirical performance, but by their ability to navigate uncertainty and avoid the pitfalls of overfitting-a testament to provable correctness. This resonates deeply with John McCarthy’s observation: “The best way to program is to start with a very clear idea of what you want to accomplish and then write code that implements that idea.” The SMC-SR method, by prioritizing exploration of the search space and quantifying uncertainty, embodies this clarity – a deliberate construction of a solution, rather than a hopeful approximation.

What Lies Ahead?

The presented work, while demonstrating a marked improvement in robustness and uncertainty quantification within symbolic regression, merely addresses symptoms of a deeper malady. The continued reliance on stochastic search, even within a formally Bayesian framework, remains a compromise. One suspects the true elegance of model discovery lies not in cleverly navigating the space of possible equations, but in a systematic, deductive approach-a means to prove the simplest correct equation, rather than approximate it through sampling. The efficiency gains achieved through Sequential Monte Carlo are valuable, yet represent a pragmatic concession, not a philosophical victory.

Future efforts should not focus solely on refining the search process. A fruitful avenue lies in incorporating prior knowledge more effectively-not as mere constraints on equation form, but as axioms guiding the deduction. The current paradigm implicitly assumes a blank slate, a curiously unmotivated stance given the underlying physical laws governing most observed data. To truly move beyond heuristic approximation, the field must grapple with the challenge of representing and leveraging such knowledge in a mathematically rigorous manner.

Ultimately, the question remains: can symbolic regression be elevated from an art of intelligent guesswork to a science of logical inference? The answer, one suspects, will necessitate a reevaluation of fundamental assumptions and a willingness to embrace mathematical purity, even at the expense of immediate practical gains. The pursuit of elegance, after all, is rarely a shortcut.


Original article: https://arxiv.org/pdf/2512.10849.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-12-13 22:05