i happened to be reading the FAQs on archive.today and an idea popped into my head
archive.today does not bother with robots.txt because it is “not a free-walking crawler, it saves only one page acting as a direct agent of the human user”
could this somehow be implemented as an ethical way for Mojeek to circumvent robots.txt, such as with a browser extension that people could use to submit pages?
I kinda like the approach given in the article: Honor robots.txt if it specifically forbids Mojeekbot. If it does not, but allows Googlebot I would go ahead and crawl but try and hide the identity of the bot as Neeva does. Otherwise, any new search engine will always be at a disadvantage vs. Google.
And I strongly suspect several other crawling search engines of doing similar.
you can’t serve up search results if you can’t crawl the web
and…
The internet should not be allowed to discriminate between search engine crawlers based on who they are
my personal attitude towards this differs from Mojeek - in my view, if it’s publicly available, then it ought to be indexed, period
robots.txt is a really dumb way of controlling anything - there are better and far more effective ways of controlling access and punishing abusers
i also get the ethics issue of Mojeek however - they have a reputation to protect - but again, in the end, “you can’t serve up search results if you can’t crawl the web”
that said, you can’t risk getting all your IP’s blocked either, hence my idea of having others do it for you when need be - i think it’s a potential way to remain ethical in a highly unethical environment
Blocks from Web Application Firewalls (WAF), and webmasters that tune them up to block more, are also a significant challenge - perhaps more so than robots.txt. So might say that notable robots.txt blocks on us (eg Facebook, LinkedIn) are a welcome reduction of noise.