Structured data

What sorts of structured data does Mojeek use, or plan to use?

The main ones are RDFa, Microdata, JSON-LD, Microformats1, Microformats2, the Open Graph Protocol, and Plain Old Semantic HTML (POSH).

I’d be especially excited to see RDFa support as it’s the most advanced one by far: in addition to the common schema.org vocabularies, it can be extended to include arbitrary vocabularies. Vocabularies exist for licensing (Creative Commons), descriptions of software projects (DOAP), and relationships between entities (Friend of a Friend). In fact, the Open Graph Protocol is actually just a sloppily-parsed RDFa vocabulary.

Microdata, the equivalent RDFa-Lite, Open Graph metadata, and JSON-LD are all officially supported across many search engines (inc. the big three, Google/Bing/Yandex). They also generally support legacy Microformats1.

I think that providing some documentation to authors/webmasters on how Mojeek parses pages and uses structured data could be useful, because right now users typically have to look to docs from the established larger players.

More information on adoption of structured data is in the Web Data Commons, using Common Crawl data: Web Data Commons Extraction Report - October 2021 Corpus

Microdata, JSON-LD, and microformats1 (mf1) are the most popular by a wide margin. I personally use Microdata, mf1, and mf2. mf2 is needed for sites to be fully IndieWeb-compatible and work well with Webmention endpoints while mf1 is used by a lot of parsing tools (e.g. most “Reader Mode” implementations).

3 Likes

Thanks @Seirdy for sharing your perspective on structured data. A section on ‘how Mojeek works’ is an excellent suggestion. We don’t write about the technical depths much having limited bandwidth (don’t we all!). Articles of that nature haven’t had a lot of engagement in the past but still. We’ll get back to you on your detailed points.

Thanks @Seirdy, Mojeek backend dev here, cheers for sharing that comprehensive overview of structured data.

We have had internal conversations about including structured data, we currently do not take it into account at the moment.

Open graph and schema.org are the most likely candidates for implementation in the very near future, it’s something we’ve been looking at. I see an undated similarweb article stating that 20% of websites have schema markup.

It’s definitely something to take into consideration, given the clean layout of datapoints that could be acquired. Of course, there’s a certain level of scepticism about what is provided in schema markup and the veracity of the information, but the same goes for anything else on the page.

We’d welcome your thoughts about any specific structured data you think would help in the accuracy of results, perhaps even specific examples where you think it’d help you search. Cheers

2 Likes

I actually think one of the best places to look for structured data applications is article extraction tools/services. Trafilatura is the best IMO, but Mozilla’s Readability.js, Chromium’s DOM Distiller, and the Mercury Parser’s generic extractor all leverage it in various forms to locate the article body and associated metadata.

Microformats1 is actually the most widely supported form of structured data for article-like pages, with support across both search engines and article extractors. It makes it easy to grab the article content, author, date published/updated, etc. to help form better snippets. Microdata using schema.org vocabulary is a close second, along with JSON-LD.

I agree that schema.org vocabulary is the most viable choice of vocabulary, whichever of the three semantic-data syntaxes you choose to parse (microdata, JSON-LD, RDFa). The Open Graph vocabulary also seems like a solid bet, as other engines do use it. I’d just be careful not to repeat the mistakes of others when parsing OG markup, as per the ctrl.blog post I linked; its use of a syntax that’s semi-compliant with a subset of RDFa-Lite can be quite annoying. Dublin Core could be a decent option, as it’s often used for original content in order to play well with software like Zotero. I’ve seen it used in both RDFa and Microdata. A little “Cite” button (like from Google Scholar) could be useful, powered by DC metadata.

On the topic of snippets: POSH can really go a long way with <abbr> and <dfn>. Searching for abbreviations and definitions on mainstream engines (even without using the “define” operator) seems to reliably pull up snippets using this markup when actual dictionaries turn up short.

1 Like

Sorry for the necropost; just wanted to add one more thing: schema.org structured data for a “sitelinks searchbox” could complement Mojeek’s “search choices” initiative by guiding users towards site-specific search. Schema.org structured data and OpenSearch Description files could help Mojeek discover that a site has an embedded search function, and it could display search widgets. Here’s an example from Google:

Guiding users to using “site-specific search” could help weaken the “one engine for everything” mindset that’s holding Mojeek back.

3 Likes