PDFs tagged with structure, language, and document title

Linked PDFs need accessibility tagging so AI can extract the text in the right reading order.

What this signal tests

We find PDF files linked from your pages and check four properties of each. The PDF must be tagged (have a StructTreeRoot, meaning the document structure is exposed), have a /Lang entry with a valid BCP 47 language code (such as en or de-CH), have a /Title in its /Info dictionary, and set /ViewerPreferences /DisplayDocTitle to true (so the title shows instead of the filename). At least 80 percent of sampled PDFs must satisfy all four.

Why it matters for your visibility in AI

Untagged PDFs are walls of positioned text with no structure. Reading order is undefined. Tables get merged with paragraphs. Headings look identical to body text. When an AI system extracts text from an untagged PDF, it gets a garbled stream that often interleaves footnotes with body paragraphs and loses table structure entirely. Quotes become unreliable; the AI may attribute a sentence to the wrong section or refuse to cite the document at all. A properly tagged PDF preserves headings, lists, tables, and images with alt text. AI extraction returns clean text with structure intact. For research papers, white papers, regulatory filings, financial reports, and product manuals, this is the difference between being citable and being skipped. /Lang lets AI route to the right language model; /Title lets it be discoverable by name rather than by inscrutable filename.

Pass criteria at a glance

Criterion	Passes when
>=80% of sampled PDFs satisfy all four conditions.

How we test it

We find every .pdf URL linked from sampled pages, fetch each PDF, and parse its trailer and catalog. We check for /MarkInfo with Marked true or a /StructTreeRoot (indicating tagging), /Lang with a valid BCP 47 code, /Title in /Info, and /ViewerPreferences /DisplayDocTitle true. We report the percentage of PDFs passing all four checks.

Show technical detection method

Discover .pdf URLs; parse trailer/catalog for /MarkInfo /Marked true OR /StructTreeRoot, /Lang valid BCP 47, /Title in /Info, /ViewerPreferences /DisplayDocTitle true.

If your site fails: how to fix it

In Adobe Acrobat Pro, use "Save as Accessible PDF" or run the Accessibility Checker. This adds tags, sets language, and prompts for title if missing. The Action Wizard can automate this across a folder.
When exporting from Word or Google Docs, enable "Document structure tags for accessibility" in the PDF export options. Set the document title in File > Properties first.
For programmatically generated PDFs (LaTeX, Pandoc, wkhtmltopdf, headless Chrome), use a tagging-capable engine: LaTeX with the accessibility package, Pandoc with --pdf-engine=lualatex, or PDFKit/pdfmake with explicit structure trees.
Set /DisplayDocTitle true in your PDF library's metadata API. In Acrobat it is under File > Properties > Initial View > Show: Document Title. In code, set the ViewerPreferences dictionary entry.
Validate with PAC 2024 (the free PDF Accessibility Checker), or use the WCAG-compatible accessibility report inside Acrobat Pro. Both flag missing tags, language, and title.

Quick facts

Maturity	ESTABLISHED
Weight	medium
Category	Multimodal

Primary sources

Related signals

No related signals listed.

Frequently asked questions

Do I really need to tag every PDF on my site?

If a PDF carries content you want AI to ingest (papers, manuals, reports, marketing collateral), yes. For ephemeral PDFs like one-time invoices or auto-generated receipts, no. Focus the effort on PDFs you would want cited, downloaded, or summarised by an AI assistant.

Will tagging change how my PDF looks?

No. Tagging adds structural metadata; the visual rendering is unchanged. The only visible change is that screen readers and AI extractors get a much better experience reading the document.

My PDFs are auto-generated from a database. How do I tag those?

Use a tagging-capable PDF library at generation time. ReportLab (Python) supports tagged PDF since 4.0. iText (Java/.NET) supports it in their PDF/UA modules. wkhtmltopdf does not tag well; switch to Chrome's --print-to-pdf via Puppeteer/Playwright which produces tagged output.

What's the deal with /DisplayDocTitle? Why does that matter?

Without it, browsers and PDF viewers show the filename in the tab/window title instead of the document title. AI assistants often use the displayed title as the document name when citing. "AnnualReport2025.pdf" is less useful than "Acme Corp Annual Report 2025"; /DisplayDocTitle gets you the latter.

Run your own scan

Run a free scan and see how your site grades across all 155 AI-readiness signals.

Scan your site