Not a huge surprise, wholesale scraping with very little return for content creators.
Companies that start out dodgy, lying, deceiving, underhanded, at the start will not suddenly become respectable once they are established. The cheating is in their DNA. How will they treat their investors and shareholders?
I was just about to post this.
The explicit goal of Perplexityâs service is to loot useful information from website authors and provide the original sources only as footnotes. Theyâve always treated website authorsâ knowledge as intrinsically valuable but not worthy of respect. This attitude is completely consistent with their business model.
Perplexityâs goal is superficially quite similar to Wikipedia, with the key differences being:
- No scraping takes place. Human users contribute information from external sources theyâve read and understood[1].
- In fact, no original research is allowed on Wikipedia; everything needs a citation from an external source. Every word must be verifiable by an external published source[2]. Perplexity synthesises information from its sources (and training data) and provides novel information based on what the language model âknowsâ about a topic in order to answer queries. The goal is learning at all costs.
Wikipedia is clearly a more popular method for learning information than Perplexity AI. Though somewhat depressingly, Reddit is more popular than Wikipedia, and ChatGPT is more popular than both of them. But ChatGPT manages to provide a vastly more popular service than Perplexity while (supposedly) respecting robots.txt. They have their own search mode. Perplexityâs desperation seems particularly malicious in this context.
I am a heavy user of Wikipedia. Itâs my #1 most visited site on the internet (maybe tied with Mojeek). I donât think a better compromise between accuracy/verifiability/convenience exists than Wikipedia at the moment.
Hopefully. âŠď¸
This is a goal to aspire to, not a reality. Wikipedia has its problems, too. âŠď¸
From an unofficial code to live by to a web awash with modern robber barons. Sad times. Iâve seen a lot of people looking for new and novel ways to fight back, but not much in the way of solid solutions with weight behind them.
This is interesting, for example: The Open-Source Software Saving the Internet From AI Bot Scrapers
indeed; here is one decent compilation of efforts and suggestions.
Perplexity responded saying âAn AI assistant works just like a human assistantâ which perhaps expresses their latest goal. Anyway hereâs a link to the HN discussion on their response, which they posted on X, of course.
I have nothing of value to contribute to this discussion, but I will say that Iâve been reading a lot of Hacker News threads recently, and I donât know why I do it. It just makes me mad.
You have a roughly equal split of very technical people who are on both sides of the fence with regard to AI. Few say itâs completely useless, but a lot of users seem to look down on people who donât use AI services or donât think fondly of it because theyâre only holding themselves back out of a twisted sense of pride. On the other hand, people who donât think fondly of AI services or big tech companies seem to see these âAI-positiveâ people as completely insane for willingly becoming serfs dependent on VC-funded companies to build and maintain their own programs.
I think both of these strawmen have a point, but I know which side Iâm more sympathetic to.
Anyway, I stopped reading the discussion after encountering this comment:
I use Perplexity regularly for research because it does a good job accessing, preprocessing and citing relevant resources. Which do you think is better: the service respects my desire for it to do a good job and ignore site owners blocking agent access because âdonât like automated agentsâ, or the service respects said site ownersâ - what I consider unreasonable - desires and not do a good job for me?
I didnât review their code comprehensively but I guess theyâre doing much the same as Google, and basically any detection tends to involve some javascript. That at least gets rid of of curl/wget style clients and theyâre checking if you have cookies from previous visits.
e.g. Puppeteer is a popular way of browser automation, but its navigator variables available via JS will indicate en-US as the users preferred language vs what they actually intend with an Accept-Language HTTP header. Google switched to only honouring requests for JS enabled clients
Proof of work is a nice idea but given the UA mimicry you could discriminate against poorer hardware who would hang for seconds, or, scrapers mimic the fingerprints of the poorer hardware with their better hardware. Equality aside, it does increase costs for a scraper but for the likes of search engines with such unique content based on a lot of backend work, itâs still worth scraping them.
Anubis actually has a non-javascript challenge as of June: Making sure you're not a bot!
I was able to access this post without Javascript, as I browse the web with it disabled by default.
As an aside, the <meta refresh> tag is a cool way to design HTML + CSS only interfaces for visitors without Javascript. If you canât do progressive enhancement on a page, then you can redirect no-JS users to a non-JS page by wrapping your <meta http-equiv="refresh"> tag in <noscript> tags without affecting user agents that enable and support Javascript.
Unfortunately scrapers with intent will go with the non-JS option if itâs offered. HTTP headers by themselves are not much of a blocker.
Some CDNs have went to the extreme of testing what TLS ciphers a client offers and what information is exchanged via HTTP/2. e.g. you can visit a page served by fastly in your browser, but open your console and âcopy as a curlâ it is straight up blocked- due to curl not serving up the same fingerprints.
Given that people are using âundeclared crawlersâ it could clearly be called circumventing.
Thereâs been stories of Google being interested in âinteresting dataâ, like celebritynetworth- where Google asked for API access and were refused, but scraped it anyway.
Yeah, it seems even the Javascript challenge isnât working for Codeberg anymore: Codeberg beset by AI bots that now bypass Anubis defense ⢠The Register
Ironically, they had blocked these bots via IP address ranges already but not for Anubis pages, which is what caused the issue:
We have a list of explicitly blocked IP ranges. However, a configuration oversight on our part only blocked these ranges on the ânormalâ routes. The âanubis-protectedâ routes didnât consider the challenge. It was not a problem while Anubis also protected from the crawlers on the other routes.
I personally donât get why you couldnât orchestrate a scraper to take down all of the git URLs and then automate cloning later instead of hitting individual files, but I guess these companies are doing dumb stuff like hitting Wikipedia pages over and over instead of just downloading the archives because they can. I suppose the discussions in issue trackers are valuable data to these companies as well.
Of course Google would.