Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives

Not a huge surprise, wholesale scraping with very little return for content creators.

https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/

3 Likes

Companies that start out dodgy, lying, deceiving, underhanded, at the start will not suddenly become respectable once they are established. The cheating is in their DNA. How will they treat their investors and shareholders?

2 Likes

I was just about to post this.

The explicit goal of Perplexity’s service is to loot useful information from website authors and provide the original sources only as footnotes. They’ve always treated website authors’ knowledge as intrinsically valuable but not worthy of respect. This attitude is completely consistent with their business model.

Perplexity’s goal is superficially quite similar to Wikipedia, with the key differences being:

  • No scraping takes place. Human users contribute information from external sources they’ve read and understood[1].
  • In fact, no original research is allowed on Wikipedia; everything needs a citation from an external source. Every word must be verifiable by an external published source[2]. Perplexity synthesises information from its sources (and training data) and provides novel information based on what the language model “knows” about a topic in order to answer queries. The goal is learning at all costs.

Wikipedia is clearly a more popular method for learning information than Perplexity AI. Though somewhat depressingly, Reddit is more popular than Wikipedia, and ChatGPT is more popular than both of them. But ChatGPT manages to provide a vastly more popular service than Perplexity while (supposedly) respecting robots.txt. They have their own search mode. Perplexity’s desperation seems particularly malicious in this context.

I am a heavy user of Wikipedia. It’s my #1 most visited site on the internet (maybe tied with Mojeek). I don’t think a better compromise between accuracy/verifiability/convenience exists than Wikipedia at the moment.


  1. Hopefully. ↩︎

  2. This is a goal to aspire to, not a reality. Wikipedia has its problems, too. ↩︎

4 Likes

From an unofficial code to live by to a web awash with modern robber barons. Sad times. I’ve seen a lot of people looking for new and novel ways to fight back, but not much in the way of solid solutions with weight behind them.

This is interesting, for example: The Open-Source Software Saving the Internet From AI Bot Scrapers

1 Like

indeed; here is one decent compilation of efforts and suggestions.

Perplexity responded saying “An AI assistant works just like a human assistant” which perhaps expresses their latest goal. Anyway here’s a link to the HN discussion on their response, which they posted on X, of course.

1 Like

I have nothing of value to contribute to this discussion, but I will say that I’ve been reading a lot of Hacker News threads recently, and I don’t know why I do it. It just makes me mad.

You have a roughly equal split of very technical people who are on both sides of the fence with regard to AI. Few say it’s completely useless, but a lot of users seem to look down on people who don’t use AI services or don’t think fondly of it because they’re only holding themselves back out of a twisted sense of pride. On the other hand, people who don’t think fondly of AI services or big tech companies seem to see these “AI-positive” people as completely insane for willingly becoming serfs dependent on VC-funded companies to build and maintain their own programs.

I think both of these strawmen have a point, but I know which side I’m more sympathetic to.

Anyway, I stopped reading the discussion after encountering this comment:

I use Perplexity regularly for research because it does a good job accessing, preprocessing and citing relevant resources. Which do you think is better: the service respects my desire for it to do a good job and ignore site owners blocking agent access because “don’t like automated agents”, or the service respects said site owners’ - what I consider unreasonable - desires and not do a good job for me?

3 Likes

I didn’t review their code comprehensively but I guess they’re doing much the same as Google, and basically any detection tends to involve some javascript. That at least gets rid of of curl/wget style clients and they’re checking if you have cookies from previous visits.

e.g. Puppeteer is a popular way of browser automation, but its navigator variables available via JS will indicate en-US as the users preferred language vs what they actually intend with an Accept-Language HTTP header. Google switched to only honouring requests for JS enabled clients

Proof of work is a nice idea but given the UA mimicry you could discriminate against poorer hardware who would hang for seconds, or, scrapers mimic the fingerprints of the poorer hardware with their better hardware. Equality aside, it does increase costs for a scraper but for the likes of search engines with such unique content based on a lot of backend work, it’s still worth scraping them.

2 Likes

Anubis actually has a non-javascript challenge as of June: Making sure you're not a bot!

I was able to access this post without Javascript, as I browse the web with it disabled by default.

As an aside, the <meta refresh> tag is a cool way to design HTML + CSS only interfaces for visitors without Javascript. If you can’t do progressive enhancement on a page, then you can redirect no-JS users to a non-JS page by wrapping your <meta http-equiv="refresh"> tag in <noscript> tags without affecting user agents that enable and support Javascript.

4 Likes

Unfortunately scrapers with intent will go with the non-JS option if it’s offered. HTTP headers by themselves are not much of a blocker.

Some CDNs have went to the extreme of testing what TLS ciphers a client offers and what information is exchanged via HTTP/2. e.g. you can visit a page served by fastly in your browser, but open your console and ‘copy as a curl’ it is straight up blocked- due to curl not serving up the same fingerprints.

Given that people are using ‘undeclared crawlers’ it could clearly be called circumventing.

There’s been stories of Google being interested in ‘interesting data’, like celebritynetworth- where Google asked for API access and were refused, but scraped it anyway.

1 Like

Yeah, it seems even the Javascript challenge isn’t working for Codeberg anymore: Codeberg beset by AI bots that now bypass Anubis defense • The Register

Ironically, they had blocked these bots via IP address ranges already but not for Anubis pages, which is what caused the issue:

We have a list of explicitly blocked IP ranges. However, a configuration oversight on our part only blocked these ranges on the “normal” routes. The “anubis-protected” routes didn’t consider the challenge. It was not a problem while Anubis also protected from the crawlers on the other routes.

I personally don’t get why you couldn’t orchestrate a scraper to take down all of the git URLs and then automate cloning later instead of hitting individual files, but I guess these companies are doing dumb stuff like hitting Wikipedia pages over and over instead of just downloading the archives because they can. I suppose the discussions in issue trackers are valuable data to these companies as well.

Of course Google would.