Skip to main content
Orbyt
Jobs
Overview
Everything Orbyt Jobs does
Features
Orbyt Jobs product home
Compare
Orbyt vs. the competition
Pricing
Plans and pricing
API Docs
22 endpoints, MCP native
Job Search
15 tracks tailored to your exact moment
Job Salaries
3,500+ roles across 81 cities
Guides
Long-form career playbooks for every search
Intelligence
Overview
AI compensation, decoded. The intelligence stack.
Features
Every engine, surface, and integration in one place.
Compare
Side-by-side vs Levels, Payscale, Pave, Glassdoor.
Pricing
Build $99, Pro $299, Scale $1,999. 14-day free trial.
Connect to Claude
MCP server wired into Claude Code in three steps.
API Docs
20 endpoints with Decision-Ready Responses.
Playground
Three engine response shapes with cURL examples.
Try It Live
One-click live API call returning a real response.
Data Catalog
Every role, city, and engine in the underlying dataset.
Companies
54 leveling frameworks with cross-company salary bands.
Compensation Reports
Free Summary PDF plus the Enterprise Annual report.
International
AI compensation for the US, UK, and Canada.
Free ToolsDeveloperBlogSupport
Log inStart
Blog
Products
Orbyt One
Job Search
Job Search HubCareer ChangersNew GraduatesRecently Laid OffExecutivesRemote Job SeekersBurned OutAfter the CutsSeasonedVeteransReturning ParentsVisa HoldersTeachersReplaced by AIHealthcare WorkersSales Professionals
Orbyt Jobs
Overview
OverviewEverything Orbyt Jobs doesFeaturesOrbyt Jobs product homeCompareOrbyt vs. the competitionPricingPlans and pricing
Explore
API Docs22 endpoints, MCP nativeJob Search15 tracks tailored to your exact momentJob Salaries3,500+ roles across 81 citiesGuidesLong-form career playbooks for every search
Orbyt Intelligence
Overview
OverviewAI compensation, decoded. The intelligence stack.FeaturesEvery engine, surface, and integration in one place.CompareSide-by-side vs Levels, Payscale, Pave, Glassdoor.PricingBuild $99, Pro $299, Scale $1,999. 14-day free trial.
Try & Build
Connect to ClaudeMCP server wired into Claude Code in three steps.API Docs20 endpoints with Decision-Ready Responses.PlaygroundThree engine response shapes with cURL examples.Try It LiveOne-click live API call returning a real response.
Browse Data
Data CatalogEvery role, city, and engine in the underlying dataset.Companies54 leveling frameworks with cross-company salary bands.Compensation ReportsFree Summary PDF plus the Enterprise Annual report.InternationalAI compensation for the US, UK, and Canada.
Free Tools
Free Tools HubThe full Free Tools hubJob SearchOrbyt for your exact momentCompensation ReportsFree Summary PDF, no signupInterview PrepAI-powered interview coachingResume ScoreGrade your resume against any roleCover Letter GeneratorTailored AI letter, free PDFSalary Explorer3,500+ roles across 81 citiesSalary CalculatorBase, bonus, equity in minutesTake-Home CalculatorAfter federal and state taxTotal Comp CalculatorFull compensation mathSkills ImpactWhat each skill adds to compCompare OffersSide-by-side offer mathSalary Projections 20305-year comp forecastsSalary WidgetEmbed salary data anywhereUnemployment CalculatorState-by-state benefits mathAI Skills AssessmentRate your AI-era readinessAI Skills LabThe skills that pay in 2026AI & Tech Job BoardCurated AI-era rolesCareer GuidesLong-form career strategy
Compare
Compare Jobs
Orbyt vs TealOrbyt vs HuntrOrbyt vs JobscanOrbyt vs LinkedInOrbyt vs TrelloOrbyt vs NotionOrbyt vs SpreadsheetsOrbyt vs SimplifyOrbyt vs CareerflowOrbyt vs ApplyArcOrbyt vs JobrightOrbyt vs Sprout
Compare Intelligence
Orbyt vs LevelsOrbyt vs PayscaleOrbyt vs ComprehensiveOrbyt vs GlassdoorOrbyt vs Pave
Developer
Developer HubOrbyt APIIntelligence API
Company
AboutWhat Orbyt is, and why it existsValuesThe principles that shape every build decisionCreedWhat we believe about the future of workFounderJustin BartakLabsS4 skunkworks projectsPressMedia kit, logos, and press inquiriesContactEmail the teamBlogEngineering, design, and the build journalSupportHelp center and contact
StartAlready have an account? Log in
  1. Home/
  2. Research/
  3. Agent-Native Dataset Design
Preprint · CC BY 4.0 · 2026-04-25

Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval.

By Justin Bartak · ORCID 0009-0005-2615-3624

DOI: 10.5281/zenodo.19754393

Download PDF (320 KB) →View on ZenodoReproducibility package
5
LLM vendors evaluated
10
Cells in evaluation matrix
1,500
Total responses scored
4
Pre-registered hypotheses
α=0.0125
Bonferroni-corrected

Abstract

Public datasets were designed for a world of indexable web search: crawlable, HTML-rendered, and optimized for human analyst retrieval via traditional search engines. LLM-mediated retrieval introduces a distinct consumption pattern, agent-driven, tool-call native, citation-dependent, and structurally sensitive to machine-readable metadata, that existing design conventions do not address. We propose agent-native dataset design as a distinct category of publishing practice, complementary to but separate from training-corpus datasets (Common Crawl, LAION, The Pile) and traditional research datasets (BLS OES, ICPSR, UCI ML Repository).

We articulate six design principles for retrieval-optimized data, machine-readability first, license clarity, multi-surface distribution, schema completeness, entity grounding, and citation affordances, and instantiate each principle in Orbyt Intelligence, a free (CC BY 4.0) U.S. compensation dataset covering 3,445 roles across 81 metropolitan areas.

We conduct an empirical retrieval evaluation across five LLM vendors (Anthropic, OpenAI, Google, Perplexity, xAI) in ten configurations, issuing 50 stratified compensation queries (500 responses) plus a 100-query Orbyt-targeted follow-up (1,000 responses).

The headline empirical finding is that schema completeness and CC BY 4.0 licensing are necessary, not sufficient, for retrieval discovery: they qualify a dataset for citation candidacy but do not, alone, overcome the link-graph and click-signal advantage that two specific concentration incumbents (Glassdoor for factual queries, BuiltIn for broader compensation queries) accumulated over decade-plus histories.

We support this with: a retrieval-mode effect (0% URL citation rate without retrieval rising to 70-100% with retrieval, paired t(49) ≤ -8.9, p < 10⁻¹¹); approximately 4× cross-vendor citation-volume spread at constant model family; source concentration (top five = 68% of named-source mentions); and four pre-registered hypothesis tests on the targeted corpus at α = 0.0125 (Bonferroni-corrected) , H1 and H2 (predicted Orbyt-over-baseline uplift) rejected in the opposite direction at six of eight retrieval cells, H3 and H4 (within-Orbyt schema effects) supported where retrieval discovers Orbyt at non-floor rates.

We conclude with a ten-item retrofit checklist that addresses the necessary half of the condition at approximately one person-week of effort and zero monetary cost, while explicitly disclaiming that the sufficient half (citation uplift over established sources) requires longitudinal accrual the checklist cannot accelerate.

Key findings

Five empirical results, ranked by load-bearing weight on the paper’s thesis.

1. Retrieval mode is the dominant driver of citation behavior.
Across the same model family, switching from default-mode to web-search-enabled increases the URL citation rate from 0% to 70-100%, with paired t(49) ≤ -8.9, p < 10⁻¹¹. Schema and licensing are necessary, but invisible without retrieval.
2. Cross-vendor citation volume varies by approximately 4×.
At constant model family and retrieval mode, citation density ranges from ~3 sources per query (Anthropic Haiku) to ~15 (Perplexity Sonar). Vendors apply different per-query citation budgets that materially shape source visibility.
3. Source citation is heavily concentrated.
Across 1,500 responses, the top five sources account for 68% of all named-source mentions. Glassdoor and Levels.fyi alone produce ~25%. New datasets enter a winner-take-most retrieval market.
4. H1 and H2 are rejected in the opposite direction.
Predicted Orbyt-over-baseline citation uplift was rejected at six of eight retrieval cells. The rejection is concentrated: BuiltIn drives most of the H1 incumbency (Orbyt actually beats Huntr and Jobscan in Cell 5); Glassdoor drives all of H2 on factual queries.
5. H3 and H4 are conditionally supported.
Where retrieval discovers Orbyt at non-floor rates (GPT-4o + web_search), the cited URLs are heavily skewed toward schema-rich routes (/salaries, /orbyt-intelligence) over prose routes (/blog, /guides). The schema effect manifests conditionally on indexation.

Reproducibility package

Every artifact required to reproduce the evaluation lives at a stable URL with a permissive license. Researchers can re-run the entire pipeline against new vendor configurations, longer corpora, or different models.

GitHub repository
docs/research/agent-native-datasets/
All drafts, eval scripts, per-cell response files, scoring outputs, hypothesis-test runner, and figure-accuracy spot-check.
Zenodo record (paper)
10.5281/zenodo.19754393
PDF + permanent DOI + DataCite metadata. CC BY 4.0.
Zenodo record (dataset)
10.5281/zenodo.19653006
Orbyt Intelligence dataset (2026.1) the paper documents. CC BY 4.0.
Live dataset landing page
/orbyt-intelligence/dataset
Schema.org/Dataset + DataDownload + DataFeed. Google Dataset Search qualifying.
Dataset methodology
/orbyt-intelligence/methodology
How the underlying salary data is sourced, cleaned, and updated.
Author ORCID
0009-0005-2615-3624
Justin Bartak (Purecraft LLC). Linked Wikidata entity Q139551829.

Cite this paper

Permissive (CC BY 4.0) attribution. Both citations resolve to the same Zenodo record; choose whichever your reference manager expects.

Plain text (APA-style)

Bartak, J. (2026). Agent-Native Dataset Design: Schema, Licensing, and Distribution Patterns for LLM Retrieval. Zenodo. https://doi.org/10.5281/zenodo.19754393

BibTeX

@misc{bartak2026agentnative,
  author    = {Bartak, Justin},
  title     = {Agent-Native Dataset Design: Schema, Licensing,
              and Distribution Patterns for {LLM} Retrieval},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19754393},
  url       = {https://doi.org/10.5281/zenodo.19754393}
}

Where this paper lives

The same artifact is mirrored across multiple publication surfaces. Permanent DOI on Zenodo is the canonical citation; this page is the canonical owned URL with full Schema.org structured data.

Zenodo (primary, DOI)
https://doi.org/10.5281/zenodo.19754393
live
ORCID Works
https://orcid.org/0009-0005-2615-3624
live
Orbyt Research (this page)
https://www.orbytjobs.ai/research/agent-native-datasets
live
GitHub reproducibility package
https://github.com/justinbartak/app/tree/main/docs/research/agent-native-datasets
live
SSRN
https://papers.ssrn.com (in editorial review)
in review
HuggingFace Papers
Deferred to v2.0 (requires arXiv endorsement)
deferred

License

Released under Creative Commons Attribution 4.0 International (CC BY 4.0). You may copy, redistribute, remix, transform, and build on the paper for any purpose, including commercially, with attribution.

The companion dataset (Orbyt Intelligence) is also CC BY 4.0. Both share the same license terms , quote, redistribute, build on, with attribution to Bartak, J. (2026) for the paper and Orbyt Intelligence for the dataset.

Related on this site

Dataset (Orbyt Intelligence)
The case-study dataset the paper documents. Full Schema.org/Dataset markup.
Dataset methodology
How the underlying compensation data is sourced and computed.
Data Catalog
All endpoints, fields, and identifiers in the public API.
About the author
Justin Bartak , founder profile, prior work, and contact.

Read the full paper.

The full preprint runs ~32 pages including the §B.6 targeted-corpus appendix with the H1-H4 paired tests, per-baseline breakdown, and figure-accuracy spot-check.

Download PDF →View on Zenodo

Job Search

  • Career Changers
  • New Graduates
  • Recently Laid Off
  • Remote Workers
  • Executives
  • Replaced by AI
  • Healthcare Workers
  • All job types →

Guides

  • Career Guides
  • AI Skills Lab
  • AI & Tech Job Board
  • Compensation Reports

Free Tools

  • Resume Score
  • Cover Letter Generator
  • Interview Prep
  • Unemployment Calculator
  • Compare Offers
  • AI Skills Assessment
  • Salary Widget
  • All free tools →

Reference

  • Job Search Glossary
  • Intelligence Glossary
  • Developer Glossary
  • Methodology
  • Dataset
  • Changelog

Salary Data

  • Salary Explorer
  • AI Salary Hubs
  • Salary Calculator
  • Take-Home Calculator
  • Total Comp Calculator
  • All salary data →

Compare

  • Orbyt vs Teal
  • Orbyt vs Huntr
  • Orbyt vs LinkedIn
  • Orbyt vs Levels.fyi
  • Orbyt vs Glassdoor
  • All comparisons →

Product

  • Orbyt One
  • Orbyt Jobs
  • Orbyt Intelligence
  • Orbyt Labs

Developers

  • Developer Hub
  • Orbyt API
  • Intelligence API

Integrations

  • Claude Desktop
  • ChatGPT
  • Zapier
  • All integrations →

Account

  • Sign In
  • Sign Up

Company

  • Blog
  • About
  • Founder
  • Press
  • Contact
  • Support
Job Search
  • Career Changers
  • New Graduates
  • Recently Laid Off
  • Remote Workers
  • Executives
  • Replaced by AI
  • Healthcare Workers
  • All job types →
Guides
  • Career Guides
  • AI Skills Lab
  • AI & Tech Job Board
  • Compensation Reports
Free Tools
  • Resume Score
  • Cover Letter Generator
  • Interview Prep
  • Unemployment Calculator
  • Compare Offers
  • AI Skills Assessment
  • Salary Widget
  • All free tools →
Reference
  • Glossary
  • Methodology
  • Dataset
  • Changelog
Salary Data
  • Salary Explorer
  • AI Salary Hubs
  • Salary Calculator
  • Take-Home Calculator
  • Total Comp Calculator
  • All salary data →
Compare
  • Orbyt vs Teal
  • Orbyt vs Huntr
  • Orbyt vs LinkedIn
  • Orbyt vs Levels.fyi
  • Orbyt vs Glassdoor
  • All comparisons →
Product
  • Orbyt One
  • Orbyt Jobs
  • Orbyt Intelligence
  • Orbyt Labs
Developers
  • Developer Hub
  • Orbyt API
  • Intelligence API
  • Claude Desktop
  • ChatGPT
  • Zapier
  • All integrations →
Company
  • Blog
  • About
  • Founder
  • Press
  • Contact
  • Support
Sign InSign Up
Orbyt

© 2026 Purecraft LLC  All rights reserved.

Privacy·Terms·Security·Accessibility·DPA·Refund·Status·Sitemap