2026-06-11 · pipelines
Extracting structured claims from 28,000 papers on a desk, not a datacenter
The architecture of a research extraction pipeline that has triaged 333,409 papers and fully extracted 28,812 of them: acquisition, schema-constrained extraction, entity resolution, and the boring optimizations that mattered more than model choice.
My extraction pipeline has now sourced and triaged 333,409 research papers across seven corpora, and fully extracted 28,812 of them into structured claims, treatments, methods, and findings. It runs on hardware that fits on a desk, with cheap batch APIs picking up what local models cannot. Nothing about it is exotic. That is the point of this essay: at corpus scale, the wins come from plumbing, not from model selection.
Why not just rent a datacenter
The first version of the cost math was sobering. Extracting a single corpus with a frontier model at interactive prices would have cost more than the hardware I already owned, and I wanted to run many corpora, repeatedly, as schemas improved. Two decisions changed the economics completely.
First, local inference for the high-volume, low-difficulty work. Page classification, relevance triage, and first-pass extraction do not need a frontier model. They need a competent model running at zero marginal cost so you can afford to be wrong and rerun.
Second, batch APIs for the work that genuinely needs a stronger model. Batch pricing is roughly half of interactive pricing, and extraction is the perfect batch workload: thousands of independent documents, no user waiting. One corpus of 25,037 papers cost about twenty five dollars to triage and extract this way. The lesson generalizes: if your pipeline has a human waiting on the other end, you pay interactive prices; if it does not, you should almost never pay them.
The stages
Every corpus runs the same five stages, and each stage quarantines its failures instead of passing them along.
Acquisition. OpenAlex and PubMed queries plus citation chaining. Everything gets an identifier on the way in, because every later stage will need to prove where a claim came from.
Triage. A cheap model classifies relevance before anything expensive happens. On the largest corpus this cut 25,037 candidates to 13,222 relevant papers. Triage is the highest-leverage stage in the pipeline: every dollar spent here saves several downstream.
Text extraction. PDFs are the swamp. Some parse cleanly, some are scans, some are forty-year-old typewriter pages. The pipeline records extraction quality per document and downgrades its confidence accordingly, because pretending all text is equally trustworthy is how silent garbage enters a dataset.
Claim extraction. The model reads each paper against a strict schema: claims with statistics, populations, methods, evidence levels, quotes. The schema is enforced outside the model. Outputs that fail validation are rejected and retried, and fields the model cannot fill stay empty. An empty field is honest. A guessed field is a landmine.
Entity resolution and publication. Authors, treatments, and concepts get normalized so that the same thing mentioned two ways becomes one thing. Published claims carry their provenance: source identifier, extraction timestamp, verification status.
What actually moved throughput
I expected model upgrades to matter most. They mattered least. The changes that actually moved the needle:
Caching by content hash. Papers reappear across corpora and reruns. Hashing the content and skipping anything already processed turned reruns from days into hours.
Batching with bounded concurrency. A few parallel workers with backpressure beat both the single-threaded version and the unbounded version that fell over on rate limits.
Chunking tuned to the document, not the model. Respecting section boundaries in papers produced better extractions than any prompt improvement on badly chunked text. I have since relearned this lesson on legal documents, where it matters even more.
Where clever lost to conservative
Entity resolution was the stage where I tried to be smart and paid for it. The clever version used a model to decide whether two author names referred to the same person. It merged people who should not be merged, which is the worst failure mode in a research tool, because it silently corrupts everything built on top.
The version that works is boring: deterministic normalization, conservative merge rules, and a refusal to merge on ambiguity. Two records that might be the same person stay separate until evidence says otherwise. The clever version optimized for completeness. The boring version optimizes for never being confidently wrong, and that is the right trade in any system where trust compounds.
The numbers
As of this writing: 333,409 papers sourced and triaged across seven research corpora, 28,812 fully extracted into structured records, and every published citation machine-verified against OpenAlex and Crossref before it ships, a gate I built after discovering my own pipeline fabricating citations. Total extraction spend is measured in tens of dollars per corpus, not thousands, because triage is aggressive, the cheap model does the bulk work, and nothing expensive runs twice.
If you are building anything similar, the order of operations that I would hand you: get provenance on every document first, build the quarantine paths second, tune triage third, and only then start arguing about which model to use. The model is the most replaceable part of the whole machine.