Suggestion: Filtering out Commercial/Corporate sites

One of the complaints about Web 2.0 is how commercial it has become. Both Google and Bing seem to favor the commercial and brands.

Suggestion: It might be neat if a Mojeek user could click on a non-default tab and filter out most of the commercial/corporate pages, plus advert heavy pages from the serps, leaving blogs, forum posts etc, relevant to the query.

How? With my very limited tech knowledge two ways come to mind.

  1. Filter out web pages using a lot of modern web building techniques like all that JavaScript stuff. Corporate sites/shopping sites are likely to use that - some person writing a blog is less likely and retro HTML sites also less likely. (This is not an original idea on my part. I saw it on a small experimental search engine.)

  2. Filter out web pages that are heavy with ad network banners. This would filter many made for Adsense sites.

Neither of these two methods are perfect but do move in the direction of the less commercial as an option. This could be really interesting on an index as big as Mojeek’s.

To be clear: the default results in Mojeek would be just as they are and this would some sort of option.

My thinking here is “how can Mojeek distinguish itself from Google, Bing and Bing retreads?”

3 Likes

My thinking here is “how can Mojeek distinguish itself from Google, Bing and Bing retreads?”

And it’s very much appreciated, as it’s a question that we think about a lot also. I’ve actually had half of a similar idea before, and it’s definitely a differentiator.

I will see what others think; it’s definitely an interesting way of upweighting/uplifting the voices of people who are creating information for reasons other than commercial ones (I know of at least one microblog I visit often).

2 Likes

I would be more inclined to filter out SEO optimised sites. To use recipes as an example, if I search for say a blueberry muffin recipe, Google will favour the page with 10,000 words about the writer’s grandmother, the history of blueberries, the history of muffins etc. When I’m scrolling down, I am bombarded with ads. I might mistakenly click on one (jackpot for Google!). Eventually at the bottom, there is a bad blueberry recipe, written by an SEO person who has never even boiled an egg.

I would prefer a link to a good blueberry recipe. Corporate or not. If Mojeek can solve the “muffin problem” you will succeed.

1 Like

It’s a very good point @stevenally. We are working on ways to give you more control over results. But as it stands Mojeek is way less suspectible to the SEO “investments”. These after all are targetted at playing and/or gaming the Google ranking system. Our ranking algorithms and indexing methods are entirely independent and written from the ground up, so the playing/gaming won’t work here :grinning:

How are we doing on the specific example? muffin recipe - Mojeek Search :cupcake: Google, as you say is captured, and with answers keeping you on Google: muffin recipe - Google Search. Three SEOed sites dominate here it seems :cake::cake::cake:

1 Like

Wiby does exactly what you’re asking for, actually. It seems they use an independent index, but it isn’t immediately apparent. It would be neat if Mojeek included a mode that emulates what Wiby does: About Wiby

The difference is that Wiby’s index is likely far smaller than Mojeek’s.

Wiby: In the early days of the web, pages were made primarily by hobbyists, academics, and computer savvy people about subjects they were personally interested in. Later on, the web became saturated with commercial pages that overcrowded everything else. All the personalized websites are hidden among a pile of commercial pages. Google isn’t great at finding them, its focus is on finding answers to technical questions, and it works well; but finding things you didn’t know you wanted to know, which was the real joy of web surfing, no longer happens. In addition, many pages today are created using bloated scripts that add slick cosmetic features in order to mask the lack of content available on them. Those pages contribute to the blandness of today’s web.

1 Like

It’s a good one to emulate for sure. After digging a bit further I found this HN comment which links to #RememberWebsites a short commentary on this very issue, in particular this paragraph is pretty relevant:

Another thing was that there was no dross, because everything had to be written and uploaded by a person. There was no standard format, since there were no real platforms that uniformly stylized anything. There was no sharing or commenting, although later that day you could talk to your friends about the amazing things you read. Most websites were written with html, so they were all unique. They were organic, as everything there had been added because the writer “had to add it” in order to complete the information he had to put out there, and it took a lot of time to make the material and the html page itself, and organize it all so that it worked properly for any guests (in a pretty true sense of the word). And usually the information there was available nowhere else. There were few pictures, mostly words, and almost all of it from the person who created the website himself.

Having the ability to search for information that is more in that retro HTML style alongside the regular SE is a very interesting prospect.

1 Like

Yes, thanks @gnome Wiby is an interesting index we know about adn which can provide interesting results. On the submit page: Submit to the Wiby Web they refer to a preference for “Pages should not use much scripts/css for cosmetic effect.” Mojeek has more of preference for those than the big G and B, and their syndicates. But in our case via automated processes; not human selection bias.

The submission topic is interesting. On the one hand we’d like to support people who make sensible submission suggestions to us, but on the other hand we know from past experience when we opened this up, we would get a lot of low quality submissions. Call it site submission noise. Which really would help no one if we actioned. We have considered charging a small fee for submissions; not so much for revenue though that would help, but more to regulate the signal-to-noise ratio. Thoughts welcome.

1 Like

@Seirdy and I use Project Gemini. One way to accomplish this end is to allow searching by gopher:// or gemini://.

2 Likes

Prior art:

  • Marginalia
  • Teclis, by the creator of Kagi
  • Kagi’s “non-commercial” lens, powered mostly by Teclis

Wiby mostly crawls user-submitted pages that pass the listed criteria. Search My Site is entirely powered by user-submitted sites, and indexes all indexable pages within those sites.

I’ve done some brainstorming for good heuristics for finding noncommercial sites, and hope to work with Marginalia and Teclis to find more. Here’s what I got so far:

  • The presence of webrings and blogrolls
  • Parsing microformats2 or a Webmention endpoint (to detect IndieWeb pages)
  • The presence of rel=me links
  • A lack of third-party privacy-hostile advertisements loaded with JavaScript, iframes, and/or embeds.
  • A lack of rel=sponsored links
  • Being linked to by other noncommercial websites (Marginalia’s ranking and indexing is informed by Custom Pagerank seeded by sites that are very strongly noncommercial; sites linked to by those sites are likely noncommercial too)
  • Creative Commons licensing (in structured data or meta tags)
  • The “generator” meta tag or other info showing that the site is powered by something like a static site generator, Blogger, WordPress, WriteFreely, Micro.blog, Known, etc. and not e.g. Wix
  • Not having a Google AMP or Yandex Turbo page available
  • Measuring blockage by adblocking implementations (Teclis does this)

The great thing about some of these listed properties is that they likely won’t be abused for SEO. Sites that want to rank well on Google are less likely to reject AMP and more likely to use rel=sponsored where appropriate, since Google could downrank a site if it detects unmarked sponsored content. SEO-spammy sites are unlikely to “fake” their licensing because they actually don’t want others to use their content to drive traffic elsewhere. Et cetera.

Ultimately the best way to build a noncommercial index is to start with some manually-curated websites that seem to be exemplars of the kind of content you want, and to then use a centrality algorithm like Custom Pagerank or harmonic centrality to index based on distance from these and other decidedly “non-commercial” sites.

4 Likes

Very interesting links, as usual.

Submission Spam: There is no solution. If it can be abused it will be abused. You have a very able and active crawler so I’d just rely on that finding new sites. The downside is that discovery by crawling creates a bias towards finding more popular websites that have more inbound links getting found quicker and there is no longer a Dmoz around to help find obscure but worthy sites.

You could open for submissions maybe only one or two days a year. Like on Mojeek’s birthday. :smile:

Or crawl places like Hacker News, Reddit, and the Indieweb blogs for hyperlinks.

Charging for submissions: I’d say it’s not worth the bother. If you charge people might accuse you of being a pay to play search engine and it’s not worth the damage to Mojeek’s reputation.

3 Likes