Skip to main content
Orbyt
Jobs
Overview
Everything Orbyt Jobs does
Features
Orbyt Jobs product home
Compare
Orbyt vs. the competition
Pricing
Plans and pricing
API Docs
22 endpoints, MCP native
Job Search
15 tracks tailored to your exact moment
Job Salaries
3,500+ roles across 81 cities
Guides
Long-form career playbooks for every search
Intelligence
Overview
The authority on AI compensation
Features
Orbyt Intelligence product home
Compare
Orbyt Intelligence vs. the field
Pricing
Plans and pricing
API Docs
18 endpoints, free tier
Data Catalog
What the API returns
Companies
54 company leveling frameworks
Compensation Reports
Free Summary + Enterprise Annual
Free ToolsDeveloperBlogSupport
Log inBegin
Blog
Products
Orbyt One
Job Search
Job Search HubCareer ChangersNew GraduatesRecently Laid OffExecutivesRemote Job SeekersBurned OutAfter the CutsSeasonedVeteransReturning ParentsVisa HoldersTeachersReplaced by AIHealthcare WorkersSales Professionals
Orbyt Jobs
Overview
OverviewEverything Orbyt Jobs doesFeaturesOrbyt Jobs product homeCompareOrbyt vs. the competitionPricingPlans and pricing
Explore
API Docs22 endpoints, MCP nativeJob Search15 tracks tailored to your exact momentJob Salaries3,500+ roles across 81 citiesGuidesLong-form career playbooks for every search
Orbyt Intelligence
Overview
OverviewThe authority on AI compensationFeaturesOrbyt Intelligence product homeCompareOrbyt Intelligence vs. the fieldPricingPlans and pricing
Explore
API Docs18 endpoints, free tierData CatalogWhat the API returnsCompanies54 company leveling frameworksCompensation ReportsFree Summary + Enterprise Annual
Free Tools
Free Tools HubThe full Free Tools hubJob SearchOrbyt for your exact momentCompensation ReportsFree Summary PDF, no signupInterview PrepAI-powered interview coachingResume ScoreGrade your resume against any roleCover Letter GeneratorTailored AI letter, free PDFSalary Explorer3,500+ roles across 81 citiesSalary CalculatorBase, bonus, equity in minutesTake-Home CalculatorAfter federal and state taxTotal Comp CalculatorFull compensation mathSkills ImpactWhat each skill adds to compCompare OffersSide-by-side offer mathSalary Projections 20305-year comp forecastsSalary WidgetEmbed salary data anywhereUnemployment CalculatorState-by-state benefits mathAI Skills AssessmentRate your AI-era readinessAI Skills LabThe skills that pay in 2026AI & Tech Job BoardCurated AI-era rolesCareer GuidesLong-form career strategy
Compare
Compare Jobs
Orbyt vs TealOrbyt vs HuntrOrbyt vs JobscanOrbyt vs LinkedInOrbyt vs TrelloOrbyt vs NotionOrbyt vs SpreadsheetsOrbyt vs SimplifyOrbyt vs CareerflowOrbyt vs ApplyArcOrbyt vs JobrightOrbyt vs Sprout
Compare Intelligence
Orbyt vs LevelsOrbyt vs PayscaleOrbyt vs ComprehensiveOrbyt vs GlassdoorOrbyt vs Pave
Developer
Developer HubOrbyt APIIntelligence API
Company
AboutWhat Orbyt is, and why it existsValuesThe principles that shape every build decisionCreedWhat we believe about the future of workFounderJustin BartakLabsS4 skunkworks projectsPressMedia kit, logos, and press inquiriesContactEmail the teamBlogEngineering, design, and the build journalSupportHelp center and contact
BeginAlready have an account? Log in
  1. Home/
  2. Research/
  3. Agent-Native Dataset Design
Preprint · CC BY 4.0 · 2026-04-25

Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval.

By Justin Bartak · ORCID 0009-0005-2615-3624

DOI: 10.5281/zenodo.19754393

Download PDF (320 KB) →View on ZenodoReproducibility package
5
LLM vendors evaluated
10
Cells in evaluation matrix
1,500
Total responses scored
4
Pre-registered hypotheses
α=0.0125
Bonferroni-corrected

Abstract

Public datasets were designed for a world of indexable web search: crawlable, HTML-rendered, and optimized for human analyst retrieval via traditional search engines. LLM-mediated retrieval introduces a distinct consumption pattern, agent-driven, tool-call native, citation-dependent, and structurally sensitive to machine-readable metadata, that existing design conventions do not address. We propose agent-native dataset design as a distinct category of publishing practice, complementary to but separate from training-corpus datasets (Common Crawl, LAION, The Pile) and traditional research datasets (BLS OES, ICPSR, UCI ML Repository).

We articulate six design principles for retrieval-optimized data, machine-readability first, license clarity, multi-surface distribution, schema completeness, entity grounding, and citation affordances, and instantiate each principle in Orbyt Intelligence, a free (CC BY 4.0) U.S. compensation dataset covering 3,445 roles across 81 metropolitan areas.

We conduct an empirical retrieval evaluation across five LLM vendors (Anthropic, OpenAI, Google, Perplexity, xAI) in ten configurations, issuing 50 stratified compensation queries (500 responses) plus a 100-query Orbyt-targeted follow-up (1,000 responses).

The headline empirical finding is that schema completeness and CC BY 4.0 licensing are necessary, not sufficient, for retrieval discovery: they qualify a dataset for citation candidacy but do not, alone, overcome the link-graph and click-signal advantage that two specific concentration incumbents (Glassdoor for factual queries, BuiltIn for broader compensation queries) accumulated over decade-plus histories.

We support this with: a retrieval-mode effect (0% URL citation rate without retrieval rising to 70-100% with retrieval, paired t(49) ≤ -8.9, p < 10⁻¹¹); approximately 4× cross-vendor citation-volume spread at constant model family; source concentration (top five = 68% of named-source mentions); and four pre-registered hypothesis tests on the targeted corpus at α = 0.0125 (Bonferroni-corrected) — H1 and H2 (predicted Orbyt-over-baseline uplift) rejected in the opposite direction at six of eight retrieval cells, H3 and H4 (within-Orbyt schema effects) supported where retrieval discovers Orbyt at non-floor rates.

We conclude with a ten-item retrofit checklist that addresses the necessary half of the condition at approximately one person-week of effort and zero monetary cost, while explicitly disclaiming that the sufficient half (citation uplift over established sources) requires longitudinal accrual the checklist cannot accelerate.

Key findings

Five empirical results, ranked by load-bearing weight on the paper’s thesis.

1. Retrieval mode is the dominant driver of citation behavior.
Across the same model family, switching from default-mode to web-search-enabled increases the URL citation rate from 0% to 70-100%, with paired t(49) ≤ -8.9, p < 10⁻¹¹. Schema and licensing are necessary, but invisible without retrieval.
2. Cross-vendor citation volume varies by approximately 4×.
At constant model family and retrieval mode, citation density ranges from ~3 sources per query (Anthropic Haiku) to ~15 (Perplexity Sonar). Vendors apply different per-query citation budgets that materially shape source visibility.
3. Source citation is heavily concentrated.
Across 1,500 responses, the top five sources account for 68% of all named-source mentions. Glassdoor and Levels.fyi alone produce ~25%. New datasets enter a winner-take-most retrieval market.
4. H1 and H2 are rejected in the opposite direction.
Predicted Orbyt-over-baseline citation uplift was rejected at six of eight retrieval cells. The rejection is concentrated: BuiltIn drives most of the H1 incumbency (Orbyt actually beats Huntr and Jobscan in Cell 5); Glassdoor drives all of H2 on factual queries.
5. H3 and H4 are conditionally supported.
Where retrieval discovers Orbyt at non-floor rates (GPT-4o + web_search), the cited URLs are heavily skewed toward schema-rich routes (/salaries, /orbyt-intelligence) over prose routes (/blog, /guides). The schema effect manifests conditionally on indexation.

Reproducibility package

Every artifact required to reproduce the evaluation lives at a stable URL with a permissive license. Researchers can re-run the entire pipeline against new vendor configurations, longer corpora, or different models.

GitHub repository
docs/research/agent-native-datasets/
All drafts, eval scripts, per-cell response files, scoring outputs, hypothesis-test runner, and figure-accuracy spot-check.
Zenodo record (paper)
10.5281/zenodo.19754393
PDF + permanent DOI + DataCite metadata. CC BY 4.0.
Zenodo record (dataset)
10.5281/zenodo.19653006
Orbyt Intelligence dataset (2026.1) the paper documents. CC BY 4.0.
Live dataset landing page
/orbyt-intelligence/dataset
Schema.org/Dataset + DataDownload + DataFeed. Google Dataset Search qualifying.
Dataset methodology
/orbyt-intelligence/methodology
How the underlying salary data is sourced, cleaned, and updated.
Author ORCID
0009-0005-2615-3624
Justin Bartak (Purecraft LLC). Linked Wikidata entity Q139551829.

Cite this paper

Permissive (CC BY 4.0) attribution. Both citations resolve to the same Zenodo record; choose whichever your reference manager expects.

Plain text (APA-style)

Bartak, J. (2026). Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval. Zenodo. https://doi.org/10.5281/zenodo.19754393

BibTeX

@misc{bartak2026agentnative,
  author    = {Bartak, Justin},
  title     = {Agent-Native Dataset Design: Schema, Licensing,
              and Distribution Patterns for {LLM} Retrieval},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19754393},
  url       = {https://doi.org/10.5281/zenodo.19754393}
}

Where this paper lives

The same artifact is mirrored across multiple publication surfaces. Permanent DOI on Zenodo is the canonical citation; this page is the canonical owned URL with full Schema.org structured data.

Zenodo (primary, DOI)
https://doi.org/10.5281/zenodo.19754393
live
ORCID Works
https://orcid.org/0009-0005-2615-3624
live
Orbyt Research (this page)
https://www.orbytjobs.ai/research/agent-native-datasets
live
GitHub reproducibility package
https://github.com/justinbartak/app/tree/main/docs/research/agent-native-datasets
live
SSRN
https://papers.ssrn.com (in editorial review)
in review
HuggingFace Papers
Deferred to v2.0 (requires arXiv endorsement)
deferred

License

Released under Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, remix, transform, and build on the paper for any purpose, including commercially, with attribution.

The companion dataset (Orbyt Intelligence) is also CC BY 4.0. Both share the same license terms — quote, redistribute, build on, with attribution to Bartak, J. (2026) for the paper and Orbyt Intelligence for the dataset.

Related on this site

Dataset (Orbyt Intelligence)
The case-study dataset the paper documents. Full Schema.org/Dataset markup.
Dataset methodology
How the underlying compensation data is sourced and computed.
Data Catalog
All endpoints, fields, and identifiers in the public API.
About the author
Justin Bartak — founder profile, prior work, and contact.

Read the full paper.

The full preprint runs ~32 pages including the §B.6 targeted-corpus appendix with the H1-H4 paired tests, per-baseline breakdown, and figure-accuracy spot-check.

Download PDF →View on Zenodo

Job Search

  • Job Search Hub
  • Career Changers
  • New Graduates
  • Recently Laid Off
  • Executives
  • Remote Job Seekers
  • Burned Out
  • After the Cuts
  • Seasoned
  • Veterans
  • Returning Parents
  • Visa Holders
  • Teachers
  • Replaced by AI
  • Healthcare Workers
  • Sales Professionals

Free Tools

  • Free Tools Hub
  • Compensation Reports
  • Interview Prep
  • Resume Score
  • Cover Letter Generator
  • Salary Explorer
  • Salary Calculator
  • Take-Home Calculator
  • Total Comp Calculator
  • Skills Impact
  • Compare Offers
  • Salary Projections 2030
  • Wage Growth by Metro
  • Submit Your Salary
  • Salary Widget
  • Unemployment Calculator
  • AI Skills Assessment
  • AI Skills Lab
  • AI & Tech Job Board
  • Career Guides

Compare Jobs

  • Orbyt vs Teal
  • Orbyt vs Huntr
  • Orbyt vs Jobscan
  • Orbyt vs LinkedIn
  • Orbyt vs Trello
  • Orbyt vs Notion
  • Orbyt vs Spreadsheets
  • Orbyt vs Simplify
  • Orbyt vs Careerflow
  • Orbyt vs ApplyArc
  • Orbyt vs Jobright
  • Orbyt vs Sprout

Compare Intelligence

  • Orbyt vs Levels
  • Orbyt vs Payscale
  • Orbyt vs Comprehensive
  • Orbyt vs Glassdoor
  • Orbyt vs Pave

Product

  • Orbyt One
  • Orbyt Jobs
  • Orbyt Intelligence

Developers

  • Developer Hub
  • Orbyt API
  • Intelligence API

Integrations

  • Claude Desktop
  • OpenClaw
  • ChatGPT
  • Apple Shortcuts
  • Zapier

Connect

  • Refer a Friend
  • Recruiter Program

Account

  • Sign In
  • Sign Up

Company

  • About
  • Values
  • Creed
  • Founder
  • Labs
  • Press
  • Contact
  • Blog
  • Support
Products
  • Orbyt One
  • Orbyt Jobs
  • Orbyt Intelligence
Job Search
  • Job Search Hub
  • Career Changers
  • New Graduates
  • Recently Laid Off
  • Executives
  • Remote Job Seekers
  • Burned Out
  • After the Cuts
  • Seasoned
  • Veterans
  • Returning Parents
  • Visa Holders
  • Teachers
  • Replaced by AI
  • Healthcare Workers
  • Sales Professionals
Free Tools
  • Free Tools Hub
  • Compensation Reports
  • Interview Prep
  • Resume Score
  • Cover Letter Generator
  • Salary Explorer
  • Salary Calculator
  • Take-Home Calculator
  • Total Comp Calculator
  • Skills Impact
  • Compare Offers
  • Salary Projections 2030
  • Submit Your Salary
  • Salary Widget
  • Unemployment Calculator
  • AI Skills Assessment
  • AI Skills Lab
  • AI & Tech Job Board
  • Career Guides
Compare

Compare Jobs

  • Orbyt vs Teal
  • Orbyt vs Huntr
  • Orbyt vs Jobscan
  • Orbyt vs LinkedIn
  • Orbyt vs Trello
  • Orbyt vs Notion
  • Orbyt vs Spreadsheets
  • Orbyt vs Simplify
  • Orbyt vs Careerflow
  • Orbyt vs ApplyArc
  • Orbyt vs Jobright
  • Orbyt vs Sprout

Compare Intelligence

  • Orbyt vs Levels
  • Orbyt vs Payscale
  • Orbyt vs Comprehensive
  • Orbyt vs Glassdoor
  • Orbyt vs Pave
Developers
  • Developer Hub
  • Orbyt API
  • Intelligence API
  • Claude Desktop
  • OpenClaw
  • ChatGPT
  • Apple Shortcuts
  • Zapier
Connect
  • Refer a Friend
  • Recruiter Program
Company
  • About
  • Values
  • Creed
  • Founder
  • Labs
  • Press
  • Contact
  • Blog
  • Support
Sign InSign Up
Orbyt

© 2026 Purecraft LLC  All rights reserved.

Privacy·Terms·Security·Accessibility·DPA·Refund·Status·Sitemap