Post 2: Building My First Compound Dataset: From Papers to Pandas

Once I'd decided to add a data-driven layer to my DFT work, the first practical question was: what does my dataset actually look like? Not in the abstract — row by row, what goes in each cell.

I settled on three sources, each with very different reliability profiles:

My own DFT calculations — the most trustworthy, but covering only the handful of compounds I've personally converged with Wien2k, VASP, or SIESTA.
Literature papers — broad coverage, but inconsistent reporting. Some papers give Δ directly, others give crystal field parameters in cm⁻¹, others bury the magnetic moment in a table caption.
Materials Project — structured, queryable, consistent units, but computed with a fixed methodology (usually GGA or GGA+U) that may not match what's appropriate for a given compound's correlation strength.

The reconciliation problem

The moment I tried to merge these, the mess started. The same compound — say FeS in a rock-salt-like environment — might show up with a band gap of 0.0 eV in Materials Project (predicted metallic under GGA), 0.3 eV in one paper using GGA+U, and a qualitative "small gap semiconductor" in another paper with no number at all. None of these are "wrong" — they're different methods answering slightly different questions.

So every row in the dataset needs not just values, but provenance: which method, which U value if applicable, which paper or MP entry ID. Without that, any correlation I find later is unfalsifiable — I won't be able to tell if a trend reflects physics or just which method happened to be used for which compounds.

Application: dataset row builder

Below is a small interactive mock-up of what a single dataset row looks like once assembled — pick a compound family and a source, and see how the same nominal property can carry different values depending on where it came from.

Dataset row preview

Pick a compound and a data source — see how the reported band gap and magnetic moment shift, and what provenance gets attached to the row.

Compound

Source

Band gap (eV)	–
Magnetic moment (μB)	–
Method / U (eV)	–
Provenance	–

Illustrative values — the real dataset stores one row per (compound, source) pair rather than overwriting, so conflicting reports stay visible.

The schema I landed on

Each row carries: compound formula, space group, lattice parameters, M–X bond length(s), M–X–M angle(s), d-electron count, then the property block — band gap, magnetic moment, ordering type, Néel/Curie temperature where available — and finally the provenance block: source type, method, U value, reference. Multiple rows per compound are not just allowed, they're expected. Reconciliation happens later, at the feature-engineering stage, not at ingestion.

With the schema fixed, the next step is the pipeline architecture: how papers get parsed by an LLM into this schema, how the Materials Project API query is structured, and how the two streams merge into one table without silently dropping the provenance information that makes the whole thing trustworthy.