8 Sources
[1]
Reddit blocks Internet Archive to end sneaky AI scraping
Reddit is now blocking the Internet Archive (IA) from indexing popular Reddit threads after allegedly catching sneaky AI firms -- restricted from scraping Reddit -- instead simply scraping data from IA's archived content. Where before IA's Wayback Machine dependably archived Reddit pages, profiles, and comments -- as part of its mission to archive the Internet -- moving forward, only screenshots of the Reddit homepage will be archived. As The Verge noted, this means the archive will only be useful as a snapshot of popular posts and news headlines each day, rather than providing a backup documenting deleted posts or a window into various Reddit subcultures or any given user's activity. Reddit has not confirmed which AI firms were scraping its data from the Wayback Machine. The company's spokesperson, Tim Rathschmidt, would only confirm to Ars that Reddit has become "aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine." Rathschmidt suggested there may be steps that IA could take to better defend against the AI scraping of archived Reddit content. That could perhaps lead Reddit to lift the restrictions on its scraping, which The Verge reported will be ramping up across Reddit starting today. But Reddit also is taking this time to address other apparently longstanding privacy concerns, adding that restrictions are appropriate since the Wayback Machine problematically archives content that users have deleted.
[2]
Reddit will block the Internet Archive
Reddit says that it has caught AI companies scraping its data from the Internet Archive's Wayback Machine, so it's going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means IA will only be able to archive insights into which news headlines and posts were most popular on a given day. "Internet Archive provides a service to the open web, but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine," spokesperson Tim Rathschmidt tells The Verge. The Internet Archive's mission is to keep a digital archive of websites on the internet and "other cultural artifacts," and the Wayback Machine is a tool you can use to look at pages as they appeared on certain dates, but Reddit believes not all of its content should be archived that way."Until they're able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we're limiting some of their access to Reddit data to protect redditors," Rathschmidt says. The limits will start "ramping up" today, and Reddit says it reached out to the Internet Archive "in advance" to "inform them of the limits before they go into effect," according to Rathschmidt. He says Reddit has also "raised concerns" about the ability of people to scrape content from the Internet Archive in the past. Reddit has a recent history of cutting off access to scraper tools as AI companies have begun to use (and abuse) them en masse, but it's willing to provide that data if companies pay. Last year, Reddit struck a deal with Google for both Google Search and AI training data early last year, and a few months later, it started blocking major search engines from crawling its data unless they pay. It also said its infamous API changes from 2023, which forced some third-party apps to shut down, leading to protests, were because those APIs were abused to train AI models. Reddit also struck an AI deal with OpenAI, but it sued Anthropic in June, claiming Anthropic was still scraping from Reddit even after Anthropic said it wasn't scraping anymore. The Internet Archive didn't immediately respond to a request for comment.
[3]
Reddit is restricting its availability to the Internet Archive's Wayback Machine
The Internet Archive's Wayback Machine is the latest victim of Reddit's crackdown on data access. The company has begun to place new restrictions on what the archive site will be able to access in a move that will significantly limit the Wayback Machine's ability to preserve information from Reddit. With the change, the Wayback Machine, a project run by the nonprofit Internet Archive, will only be able to crawl Reddit's homepage. It will no longer be able to access comments, subreddit pages, post details, profiles and other data. The move is the latest step Reddit has taken on its quest to limit AI companies' ability to use its data to train large language models without paying licensing fees. It's also a notably different stance than the company took last year, when it explicitly said that it would not limit "good faith actors," including the Internet Archive. It's not clear what exactly has changed since then. Reddit seems to believe that AI companies are circumventing its rules by scraping data via the Wayback Machine. We've reached out to the Internet Archive for comment. Data licensing has become a significant business for Reddit. The company has struck multimillion-dollar deals with OpenAI and Google that allow them to use Reddit posts to help train their AI models. At the same time, Reddit has taken an increasingly hardline stance against companies that attempt to use its data without such arrangements. Earlier this year, the company sued Anthropic, alleging it scraped Reddit for years without permission.
[4]
Reddit Is Blocking the Wayback Machine From Archiving Posts
Reddit is limiting the Wayback Machine from indexing most of its site over concerns of unauthorized AI scraping. Reddit is blocking the Internet Archive’s Wayback Machine from indexing most of its site, after discovering that AI companies were scraping its data from the digital time capsule. The move comes as Reddit tightens its grip on user data. The company doesn’t mind AI firms training their models on Reddit posts, but they have to pay first. Reddit previously said it wouldn’t restrict “good faith actors†like the Internet Archive, but now it believes some are helping AI firms dodge licensing fees. Reddit’s sudden change of stance highlights how data licensing has become a major revenue source in the AI era. The Internet Archive is a nonprofit organization dedicated to building a vast digital library of websites and other online content. So far, it has archived billions of web pages, along with millions of books, videos, and software programs. Its signature tool, the Wayback Machine, lets users save snapshots of webpages and revisit them later to see exactly how they looked on a specific date. Reddit says it has evidence that some AI companies are exploiting the Wayback Machine to bypass its policies and scrape user content without permission. "Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,†a Reddit spokesperson told Gizmodo in an emailed statement. “Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors." Reddit told The Verge that the Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles. Instead, it will only be allowed to index Reddit’s homepage. The restrictions begin “ramping up†today, and Reddit says it gave the Internet Archive a heads-up beforehand. The Internet Archive did not immediately respond to a request for comment from Gizmodo. Reddit has been tightening control over access to its data in recent years. While the company is open to licensing its data, it’s cracking down on companies that haven’t paid up. The company has already struck multimillion-dollar deals with Google and OpenAI. In the Google deal, Reddit partnered with Google for both search indexing and AI training data, then began blocking other search engines from surfacing recent Reddit posts in their search results.
[5]
The internet is about to get a little worse as Reddit moves to block the Internet Archive so AI companies can't scrape its content
Google and OpenAI can scrape Reddit's content, but they paid for it. The internet, which was once a useful thing, is about to become a little less so: A new report from The Verge says Reddit is going to start blocking the Wayback Machine from indexing most of its content. The Wayback Machine, part of the Internet Archive, takes "snapshots" of websites as they exist at various points through their history -- even if those websites don't exist anymore. Want to know what the old BioWare forums looked like before they were closed in 2016? Wayback Machine's got you. It's also incredibly handy for tracking things like Steam page changes and answering questions like, "Hey, did the CIA ever run a Star Wars fan site?" (And yes, it did.) The Internet Archive's ability to do this is dependent on crawling and indexing websites, and that's what Reddit is going to block: In future, the Wayback Machine will only be able to index the reddit.com homepage, meaning individual subreddits and posts will be out of reach -- effectively rendering it useless. Reddit spokesperson Tim Rathschmidt said the block is being imposed because "we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine." The report says limits on the Wayback Machine's ability to scrape Reddit will start "ramping up" today. Rathschmidt said Reddit had been in touch with the Internet Archive in advance, to "inform them of the limits before they go into effect." I'm generally all for anything that makes life more difficult for AI companies, but I can't really hand it to Reddit in this case because the principle in question here appears to be, well, not principle, but money: Reddit made a deal with Google in 2024 to make its content available for AI training. Another deal with OpenAI followed a few months later. Reddit's thing isn't so much about preventing the abuses of AI training, then, as it is charging top dollar for the privilege. In that light, this really sucks: The Internet Archive is a non-profit organization, and the Wayback Machine -- in sharp contrast to AI-powered chatbots -- is genuinely useful, even vital given how quickly working links turn into dead ones. The Internat Archive provides a valuable service, accurately and without unprompted racist slurs. Cutting the Wayback crawler off from Reddit, a massive trove of information on just about every subject imaginable, is a loss for us all.
[6]
It's About to Get Harder to Read Old Reddit Threads, and You Can Blame AI
Reddit and the Internet Archive are still in talks about the decision. With more and more AI showing up in Google searches as of late, I've been leaning extra hard on that one magic word that makes the internet work: Reddit. It's got its problems, but appending "Reddit" to a search is still the surest bet I have of getting an honest opinion from a real person, which is more than I can say for some other platforms. Unfortunately, it seems like the "Reddit" trick is about to get a lot less useful, and once again, you can blame AI for it. The problem with any live forum is that information comes and goes as people delete old posts and new updates break older parts of the site. There used to be a way to get around this, but going forward, that loophole's getting closed. Yes, Reddit is about to start blocking the Internet Archive. The site, run by a nonprofit dedicated to preserving the open internet, is host to the Wayback Machine, a popular way to browse internet pages that are no longer active, or have changed significantly since they first went up. Simply enter a URL in the Machine's search box, and you'll be able to browse captures of what that page used to look like, sometimes going as far back as the 1990s. It's a useful way to see how a site has changed, or access information that's supposed to be long gone. In Reddit's case, you could use it to look at, say, a hotel review that's since been deleted. Sure, you might feel a bit awkward about reading a post that's been purposefully taken down, but because deleting all your threads when leaving the service is a common practice, the Wayback Machine is a great way to preserve useful content well into the future, and keep classic memes from becoming lost media. Unfortunately, while Reddit says it's not against the Wayback Machine in general, it's about to stop the Internet Archive from indexing anything but the Reddit homepage, which means the only archives it'll be able to keep going forward will be lists of what was popular on Reddit on a certain day. Individual subreddits and posts will be blocked. That's not totally useless, say if you're an internet researcher, but it will make all future Reddit threads way more temporary in nature, and will definitely hurt casual web searches down the line. If I review a hotel now, and then delete my thread, users in a month or two won't be able to easily see it. On the bright side, existing archives shouldn't be affected by this block, at least unless Reddit asks the Internet Archive to take down existing captures. But as time passes, the lack of Reddit archives is only going to become a bigger issue. So why is this happening? Basically, Reddit doesn't like AI companies scraping content from its site, at least without paying for it first. "Internet Archive provides a service to the open web," Reddit spokesperson Tim Rathschmidt told the Verge, "but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine." Essentially, Reddit wants to tightly control which AI companies it works with (it's sued over this before), and has blocked most of them from crawling its site. However, with some then turning to scraping Reddit pages captured by the Internet Archive instead, the company is now going to crack down on those captures as well. Basically, we're paying the price for a few bad apples. Rathschmidt told The Verge that limits on the Internet Archive will start "ramping up" today, although he wasn't entirely clear about how. I've reached out to Reddit for details, but for now, I did double check, and I'm still able to access archives that already exist, so at least Reddit hasn't gone nuclear yet. As for any future posts, all might not be lost. The Verge also spoke to Wayback Machine director Mark Graham, who said that the Internet Archive has a "longstanding relationship with Reddit," and that there are "ongoing discussions about this matter."
[7]
Reddit says its blocking the Internet Archive to stop sneaky AI scrapers accessing its content - SiliconANGLE
Reddit says its blocking the Internet Archive to stop sneaky AI scrapers accessing its content Reddit Inc. said today it has decided to block the Internet Archive from indexing its popular web forums in order to prevent sneaky artificial intelligence firms from scraping its content for training purposes. Reddit reportedly found evidence that AI companies were scraping its content via the Internet Archive's platform, after it restricted them from doing so using its official website. The decision means that the organization's popular Wayback Machine service will no longer be able to archive Reddit pages, threads, profiles or comments - nothing, except for what's shown on its homepage. A report in The Verge means that, going forward, the archive will only be able to show what posts and news headlines were popular on any given day. Previously, Wayback Machine was able to archive every single page, documenting everything that was posted onto the "front page of the internet," as Reddit proclaims itself to be. Reddit did not say which AI companies were using the Wayback Machine to get around its prohibitions on them scraping its content. A spokesperson for the company told The Verge that it has "become aware of instances where AI companies violate platform policies... and scrape data from the Wayback Machine." The company seems to think that the Internet Archive should be taking steps to prevent this scraping, so there's hope that the decision won't be a permanent one. However, the report also highlights a concern by Reddit that Wayback Machine has a tendency to archive user's posts and comments that are later deleted, saying that this is problematic for user privacy. "Until they're able to defend their site and comply with platform policies, we're limiting some of their access to Reddit data to protect redditors," the company said. Although Reddit raises the issue of user privacy, it's likely that its primary motivation for blocking the scrapers is money. AI companies are expressly prohibited from crawling its website, unless they're willing to pay to access that data. Several companies have taken Reddit up on that offer, notably Google LLC and OpenAI. Reddit has never revealed how much its deal with OpenAI is worth, but the agreement with Google is reportedly worth around $60 million. Reddit has also stated previously that it hopes to generate as much as $200 million from such licensing agreements over the next three years. One company that doesn't seem prepared to pay up is Anthropic PBC. In June, Reddit filed a lawsuit against it, saying it was continuing to scrape its content even after it claimed it was no longer doing so. The Internet Archive isn't the first organization to be blocked by Reddit over scraping concerns. In June 2024, the social media firm said it had blocked Microsoft Corp.'s Bing and smaller search engines, such as DuckDuckGo, Mojeek and Qwant, in order to prevent its content being scraped through their archives. It's not immediately clear if the Internet Archive will try and take steps to prevent its archives from being scraped so it can get Reddit's restrictions lifted. In a statement, Wayback Machine Director Mark Graham said his team is engaged in "ongoing discussions about this matter."
[8]
Reddit locks out Wayback machine to stop AI from scraping old posts
Reddit has restricted the Internet Archive's Wayback Machine from extensively capturing its content due to concerns over unauthorized AI data scraping. The platform will now allow only the homepage to be archived, aiming to protect user privacy and control content use. This move highlights the challenges of balancing digital preservation and data security in today's AI-driven world. Reddit has announced that it will restrict the Internet Archive's Wayback Machine to archiving only its homepage, blocking the tool from saving most of its site's content. This change comes as a direct response to increasing concerns about AI companies scraping Reddit data through the Wayback Machine, possibly risking Reddit's content policies and violating user privacy. According to Reddit spokesperson Tim Rathschmidt, the company has seen cases where artificial intelligence firms accessed Reddit's content via the Wayback Machine without adhering to Reddit's terms of service. This includes scraping of posts, comments, and even deleted or removed content. Such unauthorized activities challenge Reddit's ability to manage and protect its content. Rathschmidt emphasized that until the Internet Archive can guarantee compliance with Reddit's policies, this restriction will stay in place to safeguard users' privacy and preserve the integrity of removed content. The Wayback Machine is a widely used tool operated by the Internet Archive, designed to preserve snapshots of websites over time. This archival service enables users to view historical versions of web pages, which is useful for research, fact-checking, and maintaining internet history. With Reddit's new limitation, the Wayback Machine will no longer archive specific Reddit pages like posts or user profiles, only the homepage. This significantly reduces the breadth and depth of Reddit's content saved by the archive, restricting public access to old discussions and deleted data through this service. This restriction is part of Reddit's broader effort to control how its data is accessed and used, especially by AI companies. Recently Reddit has taken many steps to protect its content, including modifying its application programming interfaces (APIs) to limit data scraping, negotiating paid data licenses with firms like Google and OpenAI, and pursuing legal action against the companies such as Anthropic for unauthorized data collection. Reddit's goal is to balance user privacy, platform safety, and its business interests by carefully regulating third parties, who can access its vast content. Mark Graham, director of the Wayback Machine, confirmed ongoing discussions with Reddit about this issue but no formal announcement has been made. The Internet Archive community and users who rely on its archiving service await further updates to understand the long-term implications for internet preservation. This move by Reddit highlights the complex challenge of protecting user privacy while preserving internet content at the same time, especially as AI technologies rely on large datasets gathered from the web. Q1. What is Reddit? A1. Reddit is an online community where users share posts, comments, and discussions on various topics. Q2. What is the Wayback Machine? A2. The Wayback Machine is a tool that archives and lets people view past versions of websites.
Share
Copy Link
Reddit has begun blocking the Internet Archive's Wayback Machine from indexing most of its content, citing concerns over AI companies scraping data without permission. This move has significant implications for digital preservation and raises questions about data access in the AI era.
In a significant move that has sent ripples through the digital landscape, Reddit has begun blocking the Internet Archive's Wayback Machine from indexing the majority of its content. This decision comes in response to allegations that AI companies were circumventing Reddit's data access restrictions by scraping information from archived pages 1.
Source: engadget
The restrictions, which started ramping up recently, will severely limit the Wayback Machine's ability to preserve Reddit's vast trove of information. Moving forward, the Internet Archive will only be able to index Reddit's homepage, effectively reducing its archival capacity to daily snapshots of popular posts and news headlines 2.
Reddit spokesperson Tim Rathschmidt explained the company's position, stating, "Internet Archive provides a service to the open web, but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine" 3. The company claims this move is necessary to protect user privacy and enforce its platform policies.
This decision marks a significant shift from Reddit's previous stance, where it had explicitly stated it would not limit "good faith actors" like the Internet Archive. The change highlights the growing tension between data preservation and the commercial interests of platforms in the AI era 4.
Source: Ars Technica
Reddit's blockade on the Internet Archive is part of a broader strategy to control access to its data. The company has struck multimillion-dollar deals with AI giants like Google and OpenAI, allowing them to use Reddit posts for training their AI models. This move underscores how data licensing has become a significant revenue stream for social media platforms 5.
The decision has sparked concerns among digital preservationists and open internet advocates. Critics argue that this move could have far-reaching consequences for the accessibility of online information and the ability to track changes on one of the internet's most popular platforms. The Internet Archive, a non-profit organization, plays a crucial role in maintaining a historical record of the web, and this limitation could significantly impact its mission 5.
Source: SiliconANGLE
This incident is part of a larger trend of platforms tightening control over their data in response to the growing demand for training data in AI development. It raises important questions about the balance between protecting user data, preserving digital history, and the commercial interests of tech companies in the age of AI 3.
As the situation continues to unfold, it remains to be seen how this will impact the broader ecosystem of web archiving and the future of digital preservation. The incident serves as a stark reminder of the complex challenges facing the open internet in an era increasingly dominated by AI and data-driven technologies.
Nvidia announces new AI models and infrastructure for robotics and enterprise applications, including Cosmos Reason for physical AI and Nemotron models for improved reasoning capabilities in AI agents.
4 Sources
Technology
12 hrs ago
4 Sources
Technology
12 hrs ago
GitHub CEO Thomas Dohmke steps down, marking the end of GitHub's independence as Microsoft integrates it into its CoreAI organization, signaling a shift towards AI-focused development.
8 Sources
Business and Economy
12 hrs ago
8 Sources
Business and Economy
12 hrs ago
xAI, Elon Musk's AI company, has made its advanced Grok 4 model available to all users, including those on the free tier, for a limited time. This move comes as competition intensifies in the AI industry, particularly following the release of OpenAI's GPT-5.
6 Sources
Technology
20 hrs ago
6 Sources
Technology
20 hrs ago
Elon Musk accuses Apple of antitrust violations, claiming the company unfairly favors OpenAI's ChatGPT in App Store rankings. Musk's xAI threatens immediate legal action, escalating tensions in the AI industry.
10 Sources
Policy and Regulation
4 hrs ago
10 Sources
Policy and Regulation
4 hrs ago
NVIDIA announces the integration of RTX Pro 6000 Blackwell Server Edition GPUs into 2U rack mount servers, offering enhanced AI performance and efficiency for enterprise data centers.
4 Sources
Technology
12 hrs ago
4 Sources
Technology
12 hrs ago