Are your published datasets discoverable to AI research assistants?
Confirms data publication pages emit Dataset structured data so research tools can find them.
What this signal tests
We check whether your dataset publication pages emit Dataset structured data. The required fields are a name and a description between 50 and 5000 characters. The valuable optional fields are creator (the organisation or person who produced the data), license (a URL to the licence terms), distribution (download URLs and formats), and identifier (a DOI or other persistent ID).
Why it matters for your visibility in AI
Datasets are core fuel for AI research assistants. ChatGPT, Perplexity, and dedicated research tools like Elicit and Consensus extract candidate datasets from Dataset-marked pages when users ask data-driven questions. Google's own Dataset Search is the most-used dataset discovery tool on the web and it indexes exclusively pages with this markup. Sites that publish data without the markup are invisible to that ecosystem. This is a low-weight signal for most sites because most sites do not publish datasets. But for sites that do, including government open-data portals, research institutions, journalism organisations, and companies that publish industry data, the leverage is enormous. A single Dataset markup block can move a dataset from undiscoverable to indexed by every major research tool overnight. The cost is low; the cost of skipping it is total invisibility in research-driven AI flows.
Pass criteria at a glance
| Criterion | Passes when |
|---|---|
| Dataset with required fields. |
How we test it
We crawl pages that match a dataset publication pattern (a page describing data with a download link or API access) and look for Dataset structured data. We confirm name is present and description is between 50 and 5000 characters. We then check for creator, license (as a URL), distribution as an array of DataDownload objects with contentUrl and encodingFormat, and identifier. Required fields missing fails the signal.
Show technical detection method
@type Dataset with name + description length in [50, 5000].
If your site fails: how to fix it
- Identify your dataset publication pages. These are pages that describe a specific dataset and offer it for download, API access, or query.
- Add Dataset JSON-LD with name (the dataset title) and a description longer than 50 characters explaining what the data contains, how it was collected, and what it covers.
- Add creator (your organisation, as an Organization object with name and url), license (a URL to the licence terms, such as a Creative Commons URL), and identifier (a DOI if you have one, or a stable internal identifier).
- Add distribution as an array of DataDownload entries, one per format you offer. Each should have contentUrl pointing at the actual downloadable file and encodingFormat as a MIME type (text/csv, application/json, application/pdf).
- Optionally include keywords, temporalCoverage (the time range the data covers), spatialCoverage (the geographic area), and isAccessibleForFree.
- Validate one URL in Google's Rich Results Test under the Dataset test type.
Quick facts
| Maturity | ESTABLISHED |
|---|---|
| Weight | low |
| Category | Structured Data |
Primary sources
Related signals
No related signals listed.
Frequently asked questions
What counts as a dataset for this schema?
Schema.org defines Dataset broadly: any collection of data treated as a single unit. This includes CSVs, JSON files, database exports, statistical tables, geospatial data, image collections used for analysis, and API endpoints. It does not include individual articles, blog posts, or product catalogues, which have their own dedicated schemas.
Do I need a DOI to publish a dataset?
No, but DOIs (Digital Object Identifiers) significantly increase dataset discoverability and citability. Services like Zenodo and Figshare assign free DOIs to datasets. If your organisation publishes data regularly, getting DOIs is worth the small effort. Without a DOI, use a stable URL or an internal identifier in the identifier field.
How long should the description be?
Schema.org requires between 50 and 5000 characters. Practical descriptions are usually 200-500 characters: enough to explain what the data covers, the time range, the geographic scope, and any major caveats. Very short descriptions fail; very long ones get truncated in AI presentations. Aim for clarity and completeness in a couple of sentences.
Run your own scan
Run a free scan and see how your site grades across all 155 AI-readiness signals.