Reddit will block the Internet Archive

I figured it was going to happen at some point…

The company says that AI companies have scraped data from the Wayback Machine, so it’s going to limit what the Wayback Machine can access.

2 Likes

Indeed, but maybe not a bad thing. What use are organisations and people, who find value in searching the history of Reddit ramblings? IA can use it’s budget to crawl and index more useful information.

First line of the article triggered me: “Reddit says that it has caught AI companies scraping its data.”

The T&C’s, of course, set-it up this way [0]. Imagine me claiming that photos my friends send me and I collect, is my data. Anyway and obviously it’s not just Reddit playing this game.

And obviously we do not have such terms for the little bit of anonymised data we keep in order to help in improving the service.


[0] https://redditinc.com/policies/user-agreement - extract as follows:
You retain any ownership rights you have in Your Content, but you grant Reddit the following license to use that Content:

When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit. For example, this license includes the right to use Your Content to train AI and machine learning models, as further described in our Public Content Policy. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.

From my experience with Reddit, I think humanity will survive with not having The Internet Archive archive all what’s said on Reddit.

2 Likes

The problem is some interest groups only exist on Reddit such as r/knives (pre-API pricing change) or r/MechanicalKeyboards. That information just doesn’t really exist anywhere else because of SEO and balkanization.

I think what is at stake is the openness of the internet. If it is ok for Reddit to block the Internet Archive today, then that will be the precedent for something we don’t agree with tomorrow.

And, for example, if you disagreed with the API pricing changes, it might be that you got (or only could get) your information from Internet Archive or Pushshift. That ends with this precedent.

2 Likes

That’s a wider issue. IA is an enormously useful archive but it’s not the internet.

I thought I’d check their latest strapline; which was the widely known and wishful thinking “first page of the internet”. But that has changed a few times since and before the IPO. The latest is really quite something “the heart of the internet”.

It’s sad that, a few years ago, independent forums were largely subsumed by “subreddits”. The culture is so homogeneous. Independent forums are weird and fun! And I say that as someone whose only social media platform of choice for years was Reddit (YouTube being a distant second).

But at least Reddit is searchable by humans. And Old Reddit still has a TOR .onion service[1], which I tend to access it from for text-form posts on occasion.

I’ve found that, for the topics I search at least, the Reddit discussion is low quality. There’s usually a nugget or two in a thread, but it’s not my first port-of-call. A dedicated blog post usually contains the highest quality information. But there are still cases where Reddit is the only site that contains any information on a topic.

I’ve heard Reddit wants to build their own search engine. I suppose it can’t be any worse than their site search currently.


  1. https://old.reddittorjg6rue252oqsxryoxengawnmo46qy4kyii5wtqnwfj4ooad.onion ↩︎

1 Like

Not quite - from the article:

To continue to build out search, Reddit is “expanding Reddit Answers globally, integrating it more deeply into the core search experience, and making search a central feature across Reddit,” Huffman says.

So moving towards more of an answer engine, for content on Reddit.

2 Likes

I gave Reddit Answers a try, just for the sake of satiating my curiosity.

While it is definitely more of an Answer engine than a search engine, it is very link-heavy. And it always has a section at the bottom linking you to particular subreddits.

The reason is obvious; this is only using conversations taking place on Reddit, so Reddit has no reason to de-emphasise links and wants to give you many ideas for where to go next to stay on Reddit.


You appear to get a maximum of 30 questions per week.

In terms of quality, it’s fine, I guess.

3 Likes