Post 1: From DFT to Data-Driven — Why I'm Adding ML to My Computational Toolkit

I've spent years doing first-principles calculations — Wien2k, VASP, Abinit, SIESTA, Quantum ESPRESSO, and a handful of other codes — on 3d transition metal chalcogenides (MS, MSe, MTe compounds, where M is a transition metal). Each calculation gives me a deep, accurate picture of one compound: its band structure, density of states, magnetic moment, ordering. But it's slow. A converged DFT+U calculation with proper magnetic configuration testing can take days, even on a good cluster.

Over the years, I've accumulated results for dozens of these compounds, scattered across papers, notebooks, and output files. And every time, I notice the same thing: the trends across compounds are often more interesting than any single result. Why does one MTe compound open a gap while a structurally similar MSe doesn't? Why does the magnetic ordering flip from antiferromagnetic to ferromagnetic when the M–X–M angle crosses some threshold? I know the qualitative physics — Goodenough-Kanamori-Anderson rules, crystal field theory, Zaanen-Sawatzky-Allen classification — but I've never had the systematic, quantitative cross-compound view.

This is where I think machine learning, and specifically physics-informed ML, can help.

What I mean by "physics-informed"

I'm not interested in black-box prediction. A model that says "band gap = 0.8 eV" with no explanation is useless to me as a researcher — I need to understand why. So my approach is to:

Encode the descriptors I already trust from DFT and theory (bond angles, crystal field splitting, d-electron count, electronegativity differences) as explicit features
Use these features to fit interpretable models — regression, decision rules, symbolic regression
Validate against my own DFT results and Materials Project data
End up with formulae or rules I can actually check against the physics

The goal isn't to replace DFT. It's to build a layer on top of it: a way to rapidly screen candidate compounds, generate hypotheses, and identify which structures are worth the computational cost of a full calculation.

Why now

Two things converged. First, the Materials Project now has structured data for hundreds of transition metal chalcogenides — band gaps, magnetic moments, formation energies, all queryable via API. That's a dataset I couldn't have built by hand in a reasonable time. Second, tools like PySR (symbolic regression) and SHAP (model interpretation) make it possible to go from "the model predicts X" to "the model predicts X because of descriptor Y" — which is the only kind of result I actually trust.

Application: octahedral crystal field splitting

As a first concrete descriptor — one every compound in the eventual dataset will carry — consider the crystal field splitting Δ in an octahedral environment. It scales roughly as Δ ∝ 1/(M–X bond length)⁵. Combined with the d-electron count, Δ determines the high-spin/low-spin state (via comparison with the pairing energy) and gives a rough first read on whether the compound trends toward a wide gap, a moderate gap, or a metallic state. Try it below — drag the sliders.

Octahedral crystal field splitting — interactive

Adjust the M–X bond length and d-electron count to see how Δ and the t₂g / eg occupation shift — a first read on the band gap regime.

M–X bond (Å) 2.45

d-electron count 5

Δ (crystal field)

– eV

Spin state

–

Est. gap regime

–

Illustrative scaling only — Δ ∝ 1/(M–X)⁵, high-spin/low-spin split estimated from Δ vs typical pairing energy. Real values come from DFT.

This is the kind of descriptor that will feed the dataset built in the next post: pulled from my own DFT outputs where available, extracted from papers via LLM parsing, and cross-checked against Materials Project entries.

What this series will document

This is a research notebook, written as I go. I expect false starts, dead ends, and revisions. The plan, roughly:

Build a unified dataset combining my own DFT results, literature data extracted from papers, and Materials Project entries
Engineer physics-informed descriptors — the same quantities I'd reason about manually, just made explicit and computable
Look for structure → electronic property relationships first (band gaps, metal-insulator transitions)
Then structure → magnetic property relationships (ordering type, moments, Néel temperatures)
Then try to capture the coupling between electronic and magnetic behavior
Use symbolic regression to search for formula forms, not just fit coefficients
End with a "rule bank" — a set of interpretable, physics-grounded predictive rules I can apply to compounds I haven't calculated yet

If it works, the payoff is concrete: faster screening of candidate MC compounds, better-targeted DFT calculations, and maybe a few rules worth publishing alongside the usual first-principles results.

Next post: building the dataset itself — what data I'm pulling from my own papers, what I'm getting from Materials Project, and the inevitable mess of reconciling the two.