Preprint · CC BY 4.0 · 2026-04-25

Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval.

LLM vendors evaluated

Cells in evaluation matrix

1,500

Total responses scored

Pre-registered hypotheses

α=0.0125

Bonferroni-corrected

Abstract

Public datasets were designed for a world of indexable web search: crawlable, HTML-rendered, and optimized for human analyst retrieval via traditional search engines. LLM-mediated retrieval introduces a distinct consumption pattern, agent-driven, tool-call native, citation-dependent, and structurally sensitive to machine-readable metadata, that existing design conventions do not address. We propose agent-native dataset design as a distinct category of publishing practice, complementary to but separate from training-corpus datasets (Common Crawl, LAION, The Pile) and traditional research datasets (BLS OES, ICPSR, UCI ML Repository).

We articulate six design principles for retrieval-optimized data, machine-readability first, license clarity, multi-surface distribution, schema completeness, entity grounding, and citation affordances, and instantiate each principle in Orbyt Intelligence, a free (CC BY 4.0) U.S. compensation dataset covering 3,445 roles across 81 metropolitan areas.

We conduct an empirical retrieval evaluation across five LLM vendors (Anthropic, OpenAI, Google, Perplexity, xAI) in ten configurations, issuing 50 stratified compensation queries (500 responses) plus a 100-query Orbyt-targeted follow-up (1,000 responses).

The headline empirical finding is that schema completeness and CC BY 4.0 licensing are necessary, not sufficient, for retrieval discovery: they qualify a dataset for citation candidacy but do not, alone, overcome the link-graph and click-signal advantage that two specific concentration incumbents (Glassdoor for factual queries, BuiltIn for broader compensation queries) accumulated over decade-plus histories.

We support this with: a retrieval-mode effect (0% URL citation rate without retrieval rising to 70-100% with retrieval, paired t(49) ≤ -8.9, p < 10⁻¹¹); approximately 4× cross-vendor citation-volume spread at constant model family; source concentration (top five = 68% of named-source mentions); and four pre-registered hypothesis tests on the targeted corpus at α = 0.0125 (Bonferroni-corrected) — H1 and H2 (predicted Orbyt-over-baseline uplift) rejected in the opposite direction at six of eight retrieval cells, H3 and H4 (within-Orbyt schema effects) supported where retrieval discovers Orbyt at non-floor rates.

We conclude with a ten-item retrofit checklist that addresses the necessary half of the condition at approximately one person-week of effort and zero monetary cost, while explicitly disclaiming that the sufficient half (citation uplift over established sources) requires longitudinal accrual the checklist cannot accelerate.

Key findings

Five empirical results, ranked by load-bearing weight on the paper’s thesis.

1. Retrieval mode is the dominant driver of citation behavior.

Across the same model family, switching from default-mode to web-search-enabled increases the URL citation rate from 0% to 70-100%, with paired t(49) ≤ -8.9, p < 10⁻¹¹. Schema and licensing are necessary, but invisible without retrieval.

2. Cross-vendor citation volume varies by approximately 4×.

At constant model family and retrieval mode, citation density ranges from ~3 sources per query (Anthropic Haiku) to ~15 (Perplexity Sonar). Vendors apply different per-query citation budgets that materially shape source visibility.

3. Source citation is heavily concentrated.

Across 1,500 responses, the top five sources account for 68% of all named-source mentions. Glassdoor and Levels.fyi alone produce ~25%. New datasets enter a winner-take-most retrieval market.

4. H1 and H2 are rejected in the opposite direction.

Predicted Orbyt-over-baseline citation uplift was rejected at six of eight retrieval cells. The rejection is concentrated: BuiltIn drives most of the H1 incumbency (Orbyt actually beats Huntr and Jobscan in Cell 5); Glassdoor drives all of H2 on factual queries.

5. H3 and H4 are conditionally supported.

Where retrieval discovers Orbyt at non-floor rates (GPT-4o + web_search), the cited URLs are heavily skewed toward schema-rich routes (/salaries, /orbyt-intelligence) over prose routes (/blog, /guides). The schema effect manifests conditionally on indexation.

Reproducibility package

Every artifact required to reproduce the evaluation lives at a stable URL with a permissive license. Researchers can re-run the entire pipeline against new vendor configurations, longer corpora, or different models.

GitHub repository

docs/research/agent-native-datasets/

All drafts, eval scripts, per-cell response files, scoring outputs, hypothesis-test runner, and figure-accuracy spot-check.

Zenodo record (paper)

10.5281/zenodo.19754393

PDF + permanent DOI + DataCite metadata. CC BY 4.0.

Zenodo record (dataset)

10.5281/zenodo.19653006

Orbyt Intelligence dataset (2026.1) the paper documents. CC BY 4.0.

Live dataset landing page

/orbyt-intelligence/dataset

Schema.org/Dataset + DataDownload + DataFeed. Google Dataset Search qualifying.

Dataset methodology

/orbyt-intelligence/methodology

How the underlying salary data is sourced, cleaned, and updated.

Author ORCID

0009-0005-2615-3624

Justin Bartak (Purecraft LLC). Linked Wikidata entity Q139551829.

Cite this paper

Permissive (CC BY 4.0) attribution. Both citations resolve to the same Zenodo record; choose whichever your reference manager expects.

Plain text (APA-style)

Bartak, J. (2026). Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval. Zenodo. https://doi.org/10.5281/zenodo.19754393

BibTeX

@misc{bartak2026agentnative,
  author    = {Bartak, Justin},
  title     = {Agent-Native Dataset Design: Schema, Licensing,
              and Distribution Patterns for {LLM} Retrieval},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19754393},
  url       = {https://doi.org/10.5281/zenodo.19754393}
}

Where this paper lives

The same artifact is mirrored across multiple publication surfaces. Permanent DOI on Zenodo is the canonical citation; this page is the canonical owned URL with full Schema.org structured data.

Zenodo (primary, DOI)

https://doi.org/10.5281/zenodo.19754393

live

ORCID Works

https://orcid.org/0009-0005-2615-3624

live

Orbyt Research (this page)

https://www.orbytjobs.ai/research/agent-native-datasets

live

GitHub reproducibility package

https://github.com/justinbartak/app/tree/main/docs/research/agent-native-datasets

live

SSRN

https://papers.ssrn.com (in editorial review)

in review

HuggingFace Papers

Deferred to v2.0 (requires arXiv endorsement)

deferred

License

Released under Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, remix, transform, and build on the paper for any purpose, including commercially, with attribution.

The companion dataset (Orbyt Intelligence) is also CC BY 4.0. Both share the same license terms — quote, redistribute, build on, with attribution to Bartak, J. (2026) for the paper and Orbyt Intelligence for the dataset.

Related on this site

Dataset (Orbyt Intelligence)

The case-study dataset the paper documents. Full Schema.org/Dataset markup.

Dataset methodology

How the underlying compensation data is sourced and computed.

Data Catalog

All endpoints, fields, and identifiers in the public API.

About the author

Justin Bartak — founder profile, prior work, and contact.

Read the full paper.

The full preprint runs ~32 pages including the §B.6 targeted-corpus appendix with the H1-H4 paired tests, per-baseline breakdown, and figure-accuracy spot-check.

Download PDF →View on Zenodo

Preprint · CC BY 4.0 · 2026-04-25

Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval.

By Justin Bartak · ORCID 0009-0005-2615-3624

DOI: 10.5281/zenodo.19754393

Download PDF (320 KB) →View on Zenodo Reproducibility package

LLM vendors evaluated

Cells in evaluation matrix

1,500

Total responses scored

Pre-registered hypotheses

α=0.0125

Bonferroni-corrected

Abstract

Key findings

Five empirical results, ranked by load-bearing weight on the paper’s thesis.

1. Retrieval mode is the dominant driver of citation behavior.

2. Cross-vendor citation volume varies by approximately 4×.

3. Source citation is heavily concentrated.

Across 1,500 responses, the top five sources account for 68% of all named-source mentions. Glassdoor and Levels.fyi alone produce ~25%. New datasets enter a winner-take-most retrieval market.

4. H1 and H2 are rejected in the opposite direction.

5. H3 and H4 are conditionally supported.

Reproducibility package

GitHub repository

docs/research/agent-native-datasets/

All drafts, eval scripts, per-cell response files, scoring outputs, hypothesis-test runner, and figure-accuracy spot-check.

Zenodo record (paper)

10.5281/zenodo.19754393

PDF + permanent DOI + DataCite metadata. CC BY 4.0.

Zenodo record (dataset)

10.5281/zenodo.19653006

Orbyt Intelligence dataset (2026.1) the paper documents. CC BY 4.0.

Live dataset landing page

/orbyt-intelligence/dataset

Schema.org/Dataset + DataDownload + DataFeed. Google Dataset Search qualifying.

Dataset methodology

/orbyt-intelligence/methodology

How the underlying salary data is sourced, cleaned, and updated.

Author ORCID

0009-0005-2615-3624

Justin Bartak (Purecraft LLC). Linked Wikidata entity Q139551829.

Cite this paper

Permissive (CC BY 4.0) attribution. Both citations resolve to the same Zenodo record; choose whichever your reference manager expects.

Plain text (APA-style)

Bartak, J. (2026). Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval. Zenodo. https://doi.org/10.5281/zenodo.19754393

BibTeX

@misc{bartak2026agentnative,
  author    = {Bartak, Justin},
  title     = {Agent-Native Dataset Design: Schema, Licensing,
              and Distribution Patterns for {LLM} Retrieval},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19754393},
  url       = {https://doi.org/10.5281/zenodo.19754393}
}

Where this paper lives

The same artifact is mirrored across multiple publication surfaces. Permanent DOI on Zenodo is the canonical citation; this page is the canonical owned URL with full Schema.org structured data.

Zenodo (primary, DOI)

https://doi.org/10.5281/zenodo.19754393

live

ORCID Works

https://orcid.org/0009-0005-2615-3624

live

Orbyt Research (this page)

https://www.orbytjobs.ai/research/agent-native-datasets

live

GitHub reproducibility package

https://github.com/justinbartak/app/tree/main/docs/research/agent-native-datasets

live

SSRN

https://papers.ssrn.com (in editorial review)

in review

HuggingFace Papers

Deferred to v2.0 (requires arXiv endorsement)

deferred

License

Related on this site

Dataset (Orbyt Intelligence)

The case-study dataset the paper documents. Full Schema.org/Dataset markup.

Dataset methodology

How the underlying compensation data is sourced and computed.

Data Catalog

All endpoints, fields, and identifiers in the public API.

About the author

Justin Bartak — founder profile, prior work, and contact.

Read the full paper.

The full preprint runs ~32 pages including the §B.6 targeted-corpus appendix with the H1-H4 paired tests, per-baseline breakdown, and figure-accuracy spot-check.

Download PDF →View on Zenodo