Many Body
Data for scientific AI
Back to home
02 / Manifesto

The Many-Body Systems Manifesto

Scientific AI is advancing. Machine learning models now predict molecular properties, relax atomic structures, and approximate energies at scales that were once unreachable. These advances are real. They demonstrate that learning-based methods can accelerate scientific workflows and reduce experimental cost.

This progress, however, is uneven. Most visible success occurs in regimes where physical interactions are weak, behavior is smooth, and approximations behave predictably. In these domains, errors remain small and interpolation works.

Benchmarks reinforce this pattern. They reward performance where models already succeed and avoid regimes where accuracy becomes unstable. What is easy to compute and easy to learn becomes the definition of progress. What is difficult remains largely invisible.

But the hardest physical regimes, which also carry the highest scientific and economic value, remain underexplored.


The most valuable scientific problems involve collective electronic behavior, near-degenerate states, and interactions that cannot be decomposed into independent parts. These are many-body systems.

Catalysts, batteries, functional materials, metal-containing drugs, and complex interfaces all fall into this hard physical regime category. Their properties emerge from coordinated electronic motion rather than isolated particles. Small changes in structure or environment can produce large changes in behavior.

In these regimes, model accuracy collapses. These failures are often attributed to noise, insufficient data, or architectural choices.


Density Functional Theory reshaped computational science by making quantum calculations tractable at scale. It remains one of the most successful approximations in physics and chemistry.

Over time, DFT became the default source of scientific ground truth. Most datasets, benchmarks, and evaluation pipelines are derived from it. What began as an approximation gradually became treated as reality.

Machine learning systems trained on this data learned efficiently. They reproduced DFT behavior across large chemical spaces. In doing so, they also inherited its limitations.

Learning systems do not understand where approximations break down. They generalize patterns without context. Known failure regimes became embedded in the data and propagated at scale.

This is not a critique of DFT. Approximations enabled progress and now define its boundaries. Moving forward requires new reference points.


Scientific AI does not suffer from a lack of data volume. It suffers from a lack of coverage.

Most existing datasets oversample regimes where approximations perform well and avoid regimes where they fail. Strong correlation, excited states, and complex coordination environments are rare by construction. They are expensive and historically difficult to compute.

Learning systems cannot infer physics they have never encountered.

What is required is a shift in how datasets must be structured. Accuracy must be highest where uncertainty matters most.

The missing layer is reference-grade, post-approximation data that captures true electronic behavior in the hardest cases. This layer defines the limits of what scientific AI can learn.


Many-Body exists to build this missing data layer.

Our focus is electronic-scale ground truth in regimes dominated by many-body effects. We generate high-fidelity, post-DFT datasets designed specifically for machine learning. The goal is not to replace models, but to give them reliable reference physics where approximations fail.

Recent advances make this possible. GPU-accelerated quantum chemistry and optimized infrastructure have reduced the cost of high-accuracy calculations. At the same time, the field is reaching diminishing returns from scale alone. Data quality has become decisive.

Reference-grade scientific data compounds in value. It can be reused across customers, domains, and generations of models.

Many-Body is built on the belief that scientific progress accelerates when ground truth improves. Better data enables better models, better decisions, and better discoveries.