Structured data

Seirdy · 27 February 2022 21:15

I actually think one of the best places to look for structured data applications is article extraction tools/services. Trafilatura is the best IMO, but Mozilla’s Readability.js, Chromium’s DOM Distiller, and the Mercury Parser’s generic extractor all leverage it in various forms to locate the article body and associated metadata.

Microformats1 is actually the most widely supported form of structured data for article-like pages, with support across both search engines and article extractors. It makes it easy to grab the article content, author, date published/updated, etc. to help form better snippets. Microdata using schema.org vocabulary is a close second, along with JSON-LD.

I agree that schema.org vocabulary is the most viable choice of vocabulary, whichever of the three semantic-data syntaxes you choose to parse (microdata, JSON-LD, RDFa). The Open Graph vocabulary also seems like a solid bet, as other engines do use it. I’d just be careful not to repeat the mistakes of others when parsing OG markup, as per the ctrl.blog post I linked; its use of a syntax that’s semi-compliant with a subset of RDFa-Lite can be quite annoying. Dublin Core could be a decent option, as it’s often used for original content in order to play well with software like Zotero. I’ve seen it used in both RDFa and Microdata. A little “Cite” button (like from Google Scholar) could be useful, powered by DC metadata.

On the topic of snippets: POSH can really go a long way with <abbr> and <dfn>. Searching for abbreviations and definitions on mainstream engines (even without using the “define” operator) seems to reliably pull up snippets using this markup when actual dictionaries turn up short.