Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval.
Abstract
Public datasets were designed for a world of indexable web search: crawlable, HTML-rendered, and optimized for human analyst retrieval via traditional search engines. LLM-mediated retrieval introduces a distinct consumption pattern, agent-driven, tool-call native, citation-dependent, and structurally sensitive to machine-readable metadata, that existing design conventions do not address. We propose agent-native dataset design as a distinct category of publishing practice, complementary to but separate from training-corpus datasets (Common Crawl, LAION, The Pile) and traditional research datasets (BLS OES, ICPSR, UCI ML Repository).
We articulate six design principles for retrieval-optimized data, machine-readability first, license clarity, multi-surface distribution, schema completeness, entity grounding, and citation affordances, and instantiate each principle in Orbyt Intelligence, a free (CC BY 4.0) U.S. compensation dataset covering 3,445 roles across 81 metropolitan areas.
We conduct an empirical retrieval evaluation across five LLM vendors (Anthropic, OpenAI, Google, Perplexity, xAI) in ten configurations, issuing 50 stratified compensation queries (500 responses) plus a 100-query Orbyt-targeted follow-up (1,000 responses).
The headline empirical finding is that schema completeness and CC BY 4.0 licensing are necessary, not sufficient, for retrieval discovery: they qualify a dataset for citation candidacy but do not, alone, overcome the link-graph and click-signal advantage that two specific concentration incumbents (Glassdoor for factual queries, BuiltIn for broader compensation queries) accumulated over decade-plus histories.
We support this with: a retrieval-mode effect (0% URL citation rate without retrieval rising to 70-100% with retrieval, paired t(49) ≤ -8.9, p < 10⁻¹¹); approximately 4× cross-vendor citation-volume spread at constant model family; source concentration (top five = 68% of named-source mentions); and four pre-registered hypothesis tests on the targeted corpus at α = 0.0125 (Bonferroni-corrected) — H1 and H2 (predicted Orbyt-over-baseline uplift) rejected in the opposite direction at six of eight retrieval cells, H3 and H4 (within-Orbyt schema effects) supported where retrieval discovers Orbyt at non-floor rates.
We conclude with a ten-item retrofit checklist that addresses the necessary half of the condition at approximately one person-week of effort and zero monetary cost, while explicitly disclaiming that the sufficient half (citation uplift over established sources) requires longitudinal accrual the checklist cannot accelerate.
Key findings
Five empirical results, ranked by load-bearing weight on the paper’s thesis.
Reproducibility package
Every artifact required to reproduce the evaluation lives at a stable URL with a permissive license. Researchers can re-run the entire pipeline against new vendor configurations, longer corpora, or different models.
Cite this paper
Permissive (CC BY 4.0) attribution. Both citations resolve to the same Zenodo record; choose whichever your reference manager expects.
Plain text (APA-style)
Bartak, J. (2026). Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval. Zenodo. https://doi.org/10.5281/zenodo.19754393
BibTeX
@misc{bartak2026agentnative,
author = {Bartak, Justin},
title = {Agent-Native Dataset Design: Schema, Licensing,
and Distribution Patterns for {LLM} Retrieval},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19754393},
url = {https://doi.org/10.5281/zenodo.19754393}
}Where this paper lives
The same artifact is mirrored across multiple publication surfaces. Permanent DOI on Zenodo is the canonical citation; this page is the canonical owned URL with full Schema.org structured data.
License
Released under Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, remix, transform, and build on the paper for any purpose, including commercially, with attribution.
The companion dataset (Orbyt Intelligence) is also CC BY 4.0. Both share the same license terms — quote, redistribute, build on, with attribution to Bartak, J. (2026) for the paper and Orbyt Intelligence for the dataset.
Related on this site
Read the full paper.
The full preprint runs ~32 pages including the §B.6 targeted-corpus appendix with the H1-H4 paired tests, per-baseline breakdown, and figure-accuracy spot-check.