Post 3: My Materials Project + Paper-Mining Pipeline: Architecture Notes

With the dataset schema fixed — one row per (compound, source) pair, structural descriptors plus property block plus provenance — the next question is mechanical: how does data actually get into that table?

Two streams feed in, and they need very different handling.

Stream 1 — Materials Project

This is the easy half. The mp-api client lets me query by chemical system (e.g. Fe-S, Mn-Se, Co-Te) and pull back structure, band gap, total magnetization, formation energy, and the calculation's exchange-correlation functional in one go. Each entry already has a stable mp-id, which becomes the provenance field directly. The main work here is just mapping MP's field names onto my schema and computing the derived structural descriptors (bond lengths, angles, d-electron count) from the returned structure object — MP doesn't give those directly.

I had network restrictions early on that blocked mp-api installation locally, so this stream currently runs from Colab/Kaggle, where the package is pre-available or installs cleanly.

Stream 2 — Paper mining

This is where an LLM does the heavy lifting. Papers report data in wildly inconsistent formats: inline text ("the calculated gap of 0.31 eV"), tables with non-standard column headers, figure captions, sometimes only a plot with no numeric value at all. A regex-based extractor would need a different rule for every paper.

Instead, each PDF gets converted to text (with a pypdf fallback for awkward layouts), then passed to an LLM with a prompt that asks specifically for: compound formula, space group if stated, lattice parameters, band gap with units, magnetic moment, ordering type, Néel/Curie temperature, and the method/functional/U value used — returned as structured rows matching the schema. Anything not explicitly stated is left blank rather than guessed.

Application: pipeline flow

The diagram below shows how the two streams converge. Click a stage to see what happens at that step.

Why no overwriting

The temptation, when merging two sources for the same compound, is to pick "the better one" and discard the other. I'm deliberately not doing that. A conflict between MP's GGA gap and a paper's GGA+U gap for the same compound is itself a data point — it tells me something about how sensitive that compound's electronic structure is to correlation effects, which is exactly the kind of signal the later feature engineering wants to capture, not throw away.

With both streams landing in a unified table, the next post covers feature engineering — turning the raw structural fields (lattice parameters, bond lengths, angles) into the physics-informed descriptors that the structure-property models will actually use: crystal field splitting, GKA-rule indicators, Mott criterion values, and the rest.