Crawling Twitter

mike · 14 July 2023 11:53

Since the beginning of July, Twitter has required users to log in to see profiles. For example, it is no longer possible to see the local weather tweets.

My question is, how has this impacted Mojeek’s web crawler? And, if there is a problem then are there any options to restore access or performance?

It seems that, like Facebook and TikTok, Twitter has now cut off a massive part of the Internet from public view: creating another proprietary silo and further fracturing the open Web.

Twitter users get locked out after Elon Musk imposes daily limits on reading tweets | PBS NewsHour

Josh · 17 July 2023 08:57

Looking at this very quickly it seems like we’re still alright, pages are coming up that have been crawled today:

https://www.mojeek.com/search?q=site%3Atwitter.com+status+since%3Aday

It’s possible they are allowing bots, but it would require a bit of a deeper look to ascertain what’s going on.

mike · 17 July 2023 14:28

Those might be deeplinks coming from sources outside Twitter. And those tweets probably predate July 2023.

Check if you can crawl profiles like https://twitter.com/stripestatus or if there are any tweets in the index from July.

Josh · 18 July 2023 08:15

Looking into it further we are able to crawl; there is no issue with the page that you’ve suggested and I’m able to pick up Tweets which come from July 2023.

mike · 18 July 2023 19:03

You’re right. When you change the user agent to a bot, the page loads.

Network conditions: Override the user agent string | Chrome Developers

Videonas · 24 July 2023 13:16

I noticed that changing the user agent to a bot isn’t the perfect solution. After 3 profiles, accessing a 4th profile results in a message “This page is not available”. So you are limited to the amount of profiles you can see this way. A hashtag ‘overview’ page doesn’t load any tweets (even though he page itself loads) and replies to separate tweets also don’t load.

Elon Musk was going to limit bots on Twitter. He wanted to decrease the number of bot accounts and prevent Twitter content being used to train LLM’s. I guess that search engine crawlers are also affected by those measures.