15 Sources
15 Sources
[1]
Reddit blocks Internet Archive to end sneaky AI scraping
Reddit is now blocking the Internet Archive (IA) from indexing popular Reddit threads after allegedly catching sneaky AI firms -- restricted from scraping Reddit -- instead simply scraping data from IA's archived content. Where before IA's Wayback Machine dependably archived Reddit pages, profiles, and comments -- as part of its mission to archive the Internet -- moving forward, only screenshots of the Reddit homepage will be archived. As The Verge noted, this means the archive will only be useful as a snapshot of popular posts and news headlines each day, rather than providing a backup documenting deleted posts or a window into various Reddit subcultures or any given user's activity. Reddit has not confirmed which AI firms were scraping its data from the Wayback Machine. The company's spokesperson, Tim Rathschmidt, would only confirm to Ars that Reddit has become "aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine." Rathschmidt suggested there may be steps that IA could take to better defend against the AI scraping of archived Reddit content. That could perhaps lead Reddit to lift the restrictions on its scraping, which The Verge reported will be ramping up across Reddit starting today. But Reddit also is taking this time to address other apparently longstanding privacy concerns, adding that restrictions are appropriate since the Wayback Machine problematically archives content that users have deleted.
[2]
Reddit will block the Internet Archive
Reddit says that it has caught AI companies scraping its data from the Internet Archive's Wayback Machine, so it's going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means IA will only be able to archive insights into which news headlines and posts were most popular on a given day. "Internet Archive provides a service to the open web, but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine," spokesperson Tim Rathschmidt tells The Verge. The Internet Archive's mission is to keep a digital archive of websites on the internet and "other cultural artifacts," and the Wayback Machine is a tool you can use to look at pages as they appeared on certain dates, but Reddit believes not all of its content should be archived that way."Until they're able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we're limiting some of their access to Reddit data to protect redditors," Rathschmidt says. The limits will start "ramping up" today, and Reddit says it reached out to the Internet Archive "in advance" to "inform them of the limits before they go into effect," according to Rathschmidt. He says Reddit has also "raised concerns" about the ability of people to scrape content from the Internet Archive in the past. Reddit has a recent history of cutting off access to scraper tools as AI companies have begun to use (and abuse) them en masse, but it's willing to provide that data if companies pay. Last year, Reddit struck a deal with Google for both Google Search and AI training data early last year, and a few months later, it started blocking major search engines from crawling its data unless they pay. It also said its infamous API changes from 2023, which forced some third-party apps to shut down, leading to protests, were because those APIs were abused to train AI models. Reddit also struck an AI deal with OpenAI, but it sued Anthropic in June, claiming Anthropic was still scraping from Reddit even after Anthropic said it wasn't scraping anymore. The Internet Archive didn't immediately respond to a request for comment.
[3]
Reddit blocks the Internet Archive from crawling its data - here's why
Publishers (and others) are suing AI companies for copyright infringement. Reddit is defending its privacy from AI companies that are taking roundabout approaches to scraping its content. The social media platform, known as a resource where users can post anonymously and find information about virtually any subject, will block the Internet Archive's Wayback Machine from indexing its online data, according to a Monday report from The Verge. The move is in response to the discovery that AI firms, unable to scrape data from Reddit directly due to the platform's prohibitive policies, have instead been retrieving its data from indexed content on the Internet Archive and using it to train models. The Wayback Machine will now only be able to scrape data from Reddit's homepage, according to The Verge, while access to user profiles, comments, and post detail pages will be blocked. Launched in 1996, the Internet Archive is a non-profit that operates an enormous digital database of web content. The archive is maintained in part by the Wayback Machine, a piece of web-crawling software that gathers web pages and preserves them as they appeared when they were collected, like digital flies in amber. This serves as a resource for researchers studying the evolution of online culture and digital forensic evidence for law enforcement, among other uses. Reddit has previously flagged concerns related to the scraping of its content with the Internet Archive, according to The Verge. The non-profit was also reportedly notified before the web-crawling restrictions started going into effect yesterday. The Internet Archive has yet to make an official statement about how it plans to respond to Reddit's new restrictions, and at the time of writing, it has not responded to ZDNET's request for comment. Wayback Machine director Mark Graham, however, has told multiple publications that the Internet Archive will "continue to have ongoing discussions about this matter" with Reddit. Reddit's reported decision to block Wayback Machine from scraping the majority of its content arrives during a moment of mounting tension between AI companies and digital publishers, though Reddit is the first tech company to wade into the debate. The company sued Anthropic in June after discovering that the AI company was illegally scraping its data, but it has also previously signed licensing deals with both Google and OpenAI. (Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) AI developers require access to gargantuan troves of information to train generative AI models, which are designed to identify and replicate subtle mathematical patterns gleaned from those training datasets. Many of those companies have scraped training data from publicly available websites, including social media sites and news outlets, claiming legal immunity under a concept known in copyright law as fair use. (The courts are still untangling the legitimacy of that argument, and will likely be doing so for some time.) Many of the organizations whose content has been copiously scraped -- along with a cohort of authors and other artists -- have responded with lawsuits. Others, meanwhile, have signed content licensing agreements with the likes of OpenAI, Anthropic, and Google, consenting to the use of their organizations' data in exchange for increased visibility in the responses generated by chatbots, or other benefits.
[4]
Reddit Is Blocking Internet Archive to Halt Free Scraping of User Data
(Credit: Thomas Fuller/SOPA Images/LightRocket via Getty Images) Reddit is limiting access to the Internet Archive after finding out that AI companies have used its Wayback Machine to scrape user data for free, The Verge reports. The Internet Archive is a nonprofit digital library that preserves webpages and other content to provide "universal access to all knowledge." Reddit's public content policy doesn't block such good-faith actors from using its data for non-commercial purposes. However, the platform was recently "made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine," a company spokesperson tells The Verge. Reddit didn't name the AI companies using this trick, but said it is adding some restrictions to stop the Wayback Machine from being the enabler. Starting this week, the digital library will no longer be allowed to crawl post detail pages, user comments, and profiles. It will only be able to archive Reddit's homepage, thereby restricting visitor access to just the top posts from a particular day. Reddit has already informed the Internet Archive of these restrictions and will keep them "until they're able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content)," the spokesperson adds. In recent times, Reddit has made it abundantly clear that it doesn't mind AI companies scraping its user data -- as long as they are paying for it. The platform licenses its data to Google for $60 million a year, and has a similar deal with OpenAI as well. On the other hand, it recently sued Anthropic over claims that its bots accessed the platform over 100,000 times without permission. The folks at the Internet Archive, meanwhile, remain hopeful of an amicable solution. "We have a longstanding relationship with Reddit and continue to have ongoing discussions about this matter," Mark Graham, director of the Wayback Machine, told The Verge.
[5]
Reddit is restricting its availability to the Internet Archive's Wayback Machine
The Internet Archive's Wayback Machine is the latest victim of Reddit's crackdown on data access. The company has begun to place new restrictions on what the archive site will be able to access in a move that will significantly limit the Wayback Machine's ability to preserve information from Reddit. With the change, the Wayback Machine, a project run by the nonprofit Internet Archive, will only be able to crawl Reddit's homepage. It will no longer be able to access comments, subreddit pages, post details, profiles and other data. The move is the latest step Reddit has taken on its quest to limit AI companies' ability to use its data to train large language models without paying licensing fees. It's also a notably different stance than the company took last year, when it explicitly said that it would not limit "good faith actors," including the Internet Archive. It's not clear what exactly has changed since then. Reddit seems to believe that AI companies are circumventing its rules by scraping data via the Wayback Machine. We've reached out to the Internet Archive for comment. Data licensing has become a significant business for Reddit. The company has struck multimillion-dollar deals with OpenAI and Google that allow them to use Reddit posts to help train their AI models. At the same time, Reddit has taken an increasingly hardline stance against companies that attempt to use its data without such arrangements. Earlier this year, the company sued Anthropic, alleging it scraped Reddit for years without permission.
[6]
Reddit Is Blocking the Wayback Machine From Archiving Posts
Reddit is limiting the Wayback Machine from indexing most of its site over concerns of unauthorized AI scraping. Reddit is blocking the Internet Archive’s Wayback Machine from indexing most of its site, after discovering that AI companies were scraping its data from the digital time capsule. The move comes as Reddit tightens its grip on user data. The company doesn’t mind AI firms training their models on Reddit posts, but they have to pay first. Reddit previously said it wouldn’t restrict “good faith actors†like the Internet Archive, but now it believes some are helping AI firms dodge licensing fees. Reddit’s sudden change of stance highlights how data licensing has become a major revenue source in the AI era. The Internet Archive is a nonprofit organization dedicated to building a vast digital library of websites and other online content. So far, it has archived billions of web pages, along with millions of books, videos, and software programs. Its signature tool, the Wayback Machine, lets users save snapshots of webpages and revisit them later to see exactly how they looked on a specific date. Reddit says it has evidence that some AI companies are exploiting the Wayback Machine to bypass its policies and scrape user content without permission. "Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,†a Reddit spokesperson told Gizmodo in an emailed statement. “Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors." Reddit told The Verge that the Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles. Instead, it will only be allowed to index Reddit’s homepage. The restrictions begin “ramping up†today, and Reddit says it gave the Internet Archive a heads-up beforehand. The Internet Archive did not immediately respond to a request for comment from Gizmodo. Reddit has been tightening control over access to its data in recent years. While the company is open to licensing its data, it’s cracking down on companies that haven’t paid up. The company has already struck multimillion-dollar deals with Google and OpenAI. In the Google deal, Reddit partnered with Google for both search indexing and AI training data, then began blocking other search engines from surfacing recent Reddit posts in their search results.
[7]
Reddit blocks non-profit Wayback Machine from archiving the site - 9to5Mac
The Internet Archive's Wayback Machine is one of the most valuable free services available on the web, ensuring that important sources of information are protected from the vicissitudes of fate and tech companies. Until recently, the archive was able to capture the entirety of Reddit, but that is no longer the case following new restrictions implemented by the for-profit community discussion platform ... The archive has been in operation since 1996. We began in 1996 by archiving the Internet itself, a medium that was just beginning to grow in use. Like newspapers, the content published on the web was ephemeral - but unlike newspapers, no one was saving it. Today we have 28+ years of web history accessible through the Wayback Machine and we work with 1,200+ library and other partners through our Archive-It program to identify important web pages. To date, it has archived 835 billion web pages, alongside books, audio recordings, photos, videos, photos, and apps. It is used by millions of people a day, from researchers and historians to the general public. Engadget reports that Reddit is almost completely blocking the Wayback Machine from crawling content on the platform. The company has begun to place new restrictions on what the archive site will be able to access in a move that will significantly limit the Wayback Machine's ability to preserve information from Reddit. With the change, the Wayback Machine, a project run by the nonprofit Internet Archive, will only be able to crawl Reddit's homepage. It will no longer be able to access comments, subreddit pages, post details, profiles and other data. This is despite the fact that Reddit said last year that it would not block good faith actors, specifically including the Internet Archive within this. Along with our updated robots.txt file, we will continue rate-limiting and/or blocking unknown bots and crawlers from accessing reddit.com. This update shouldn't impact the vast majority of folks who use and enjoy Reddit. Good faith actors - like researchers and organizations such as the Internet Archive - will continue to have access to Reddit content for non-commercial use. The restrictions are the latest in a growing move by Reddit to sell access to user content while blocking free access to it. The focus on monetization was driven by the company's IPO. Google pays Reddit more than $60 million a year to access user content to help train its AI models, and a similar deal was struck with OpenAI. Following the conclusion of the Google deal, Reddit started blocking all other search engines. It's been speculated that some AI companies may have been indirectly scraping content from Reddit via the Wayback Machine, and that this may have driven the new restrictions.
[8]
Reddit is blocking Wayback Machine from archiving users' posts
Reddit will reportedly block the Internet Archive's Wayback Machine from saving users' posts. The social media platform states that the measure is intended to stop AI companies from scraping archived comments to train their algorithms. Or at least, prevent them from doing so without paying up. As reported by The Verge, Reddit is preventing the Wayback Machine from archiving users' post detail pages, comments, and profiles. The Reddit homepage is still fair game, meaning that the titles of the top posts each day will still be preserved, but anything beyond that will no longer be indexed in the Internet Archive's digital library. Reddit framed the decision as an effort to protect its users, stating that AI companies were violating its policies by scraping data from the Wayback Machine. "Until [the Internet Archive is] able to defend their site and comply with platform policies (e.g., respecting user privacy re. deleting removed content) we're limiting some of their access to Reddit data to protect redditors," Reddit spokesperson Tim Rathschmidt told The Verge. Yet despite such assertions, Reddit has demonstrated that it's happy to hand over users' data to AI companies provided that they pay up. In 2024, Reddit barred search engines such as Microsoft Bing and DuckDuckGo from crawling its platform. However, a $60 million deal between Reddit and Google enabled the tech giant to continue training its AI algorithms on redditors' data, as well as surface their posts in Search. Reddit made a similar $60 million deal with ChatGPT creator OpenAI as well. "Without these agreements, we don't have any say or knowledge of how our data is displayed and what it's used for, which has put us in a position now of blocking folks who haven't been willing to come to terms with how we'd like our data to be used or not used," Reddit CEO Steve Huffman told The Verge last August. Ironically, Reddit users themselves have little say in how the company uses their public posts, as it doesn't allow them to opt out of having such data sold or used to train AI algorithms. The only remedy for redditors to prevent such use is to simply stop posting to the platform altogether, though that still doesn't address posts they've previously made. Though concern for users' privacy may be a factor, Reddit's decision to block the Wayback Machine appears to be more obviously motivated by money. While AI companies were apparently scraping Reddit posts for free, cutting off such access will enable the social media platform to instead licence such data for a significant fee. "The Reddit corpus of data is really valuable," Huffman told the New York Times in 2023. "But we don't need to give all of that value to some of the largest companies in the world for free." Reddit has been fighting to reduce its financial losses in recent years, resulting in widely unpopular changes such as charging developers for access to its application programming interface (API), removing the ability to opt out of ad personalisation, and the planned introduction of paid subreddits. Unfortunately, there's still a long way to go before Reddit claws itself out of the red. The self-professed "heart of the internet" reported a whopping net loss of $484.3 million last year -- more than five times its $90.8 million net loss in 2023.
[9]
The internet is about to get a little worse as Reddit moves to block the Internet Archive so AI companies can't scrape its content
Google and OpenAI can scrape Reddit's content, but they paid for it. The internet, which was once a useful thing, is about to become a little less so: A new report from The Verge says Reddit is going to start blocking the Wayback Machine from indexing most of its content. The Wayback Machine, part of the Internet Archive, takes "snapshots" of websites as they exist at various points through their history -- even if those websites don't exist anymore. Want to know what the old BioWare forums looked like before they were closed in 2016? Wayback Machine's got you. It's also incredibly handy for tracking things like Steam page changes and answering questions like, "Hey, did the CIA ever run a Star Wars fan site?" (And yes, it did.) The Internet Archive's ability to do this is dependent on crawling and indexing websites, and that's what Reddit is going to block: In future, the Wayback Machine will only be able to index the reddit.com homepage, meaning individual subreddits and posts will be out of reach -- effectively rendering it useless. Reddit spokesperson Tim Rathschmidt said the block is being imposed because "we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine." The report says limits on the Wayback Machine's ability to scrape Reddit will start "ramping up" today. Rathschmidt said Reddit had been in touch with the Internet Archive in advance, to "inform them of the limits before they go into effect." I'm generally all for anything that makes life more difficult for AI companies, but I can't really hand it to Reddit in this case because the principle in question here appears to be, well, not principle, but money: Reddit made a deal with Google in 2024 to make its content available for AI training. Another deal with OpenAI followed a few months later. Reddit's thing isn't so much about preventing the abuses of AI training, then, as it is charging top dollar for the privilege. In that light, this really sucks: The Internet Archive is a non-profit organization, and the Wayback Machine -- in sharp contrast to AI-powered chatbots -- is genuinely useful, even vital given how quickly working links turn into dead ones. The Internat Archive provides a valuable service, accurately and without unprompted racist slurs. Cutting the Wayback crawler off from Reddit, a massive trove of information on just about every subject imaginable, is a loss for us all.
[10]
It's About to Get Harder to Read Old Reddit Threads, and You Can Blame AI
Reddit and the Internet Archive are still in talks about the decision. With more and more AI showing up in Google searches as of late, I've been leaning extra hard on that one magic word that makes the internet work: Reddit. It's got its problems, but appending "Reddit" to a search is still the surest bet I have of getting an honest opinion from a real person, which is more than I can say for some other platforms. Unfortunately, it seems like the "Reddit" trick is about to get a lot less useful, and once again, you can blame AI for it. The problem with any live forum is that information comes and goes as people delete old posts and new updates break older parts of the site. There used to be a way to get around this, but going forward, that loophole's getting closed. Yes, Reddit is about to start blocking the Internet Archive. The site, run by a nonprofit dedicated to preserving the open internet, is host to the Wayback Machine, a popular way to browse internet pages that are no longer active, or have changed significantly since they first went up. Simply enter a URL in the Machine's search box, and you'll be able to browse captures of what that page used to look like, sometimes going as far back as the 1990s. It's a useful way to see how a site has changed, or access information that's supposed to be long gone. In Reddit's case, you could use it to look at, say, a hotel review that's since been deleted. Sure, you might feel a bit awkward about reading a post that's been purposefully taken down, but because deleting all your threads when leaving the service is a common practice, the Wayback Machine is a great way to preserve useful content well into the future, and keep classic memes from becoming lost media. Unfortunately, while Reddit says it's not against the Wayback Machine in general, it's about to stop the Internet Archive from indexing anything but the Reddit homepage, which means the only archives it'll be able to keep going forward will be lists of what was popular on Reddit on a certain day. Individual subreddits and posts will be blocked. That's not totally useless, say if you're an internet researcher, but it will make all future Reddit threads way more temporary in nature, and will definitely hurt casual web searches down the line. If I review a hotel now, and then delete my thread, users in a month or two won't be able to easily see it. On the bright side, existing archives shouldn't be affected by this block, at least unless Reddit asks the Internet Archive to take down existing captures. But as time passes, the lack of Reddit archives is only going to become a bigger issue. So why is this happening? Basically, Reddit doesn't like AI companies scraping content from its site, at least without paying for it first. "Internet Archive provides a service to the open web," Reddit spokesperson Tim Rathschmidt told the Verge, "but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine." Essentially, Reddit wants to tightly control which AI companies it works with (it's sued over this before), and has blocked most of them from crawling its site. However, with some then turning to scraping Reddit pages captured by the Internet Archive instead, the company is now going to crack down on those captures as well. Basically, we're paying the price for a few bad apples. Rathschmidt told The Verge that limits on the Internet Archive will start "ramping up" today, although he wasn't entirely clear about how. I've reached out to Reddit for details, but for now, I did double check, and I'm still able to access archives that already exist, so at least Reddit hasn't gone nuclear yet. As for any future posts, all might not be lost. The Verge also spoke to Wayback Machine director Mark Graham, who said that the Internet Archive has a "longstanding relationship with Reddit," and that there are "ongoing discussions about this matter."
[11]
AI data wars push Reddit to block the Wayback Machine
As the battle to train artificial intelligence models becomes more intense and Reddit's rich content library becomes more valuable, the social media giant has taken steps to block the Internet Archive from indexing its pages. While the Wayback Machine has historically recorded all Reddit pages, comments and user profiles, the company has put limits on what the system can scrape. Moving forward, it will only be permitted to archive the site's home page, which shows popular posts and news headlines of the day, but no user comments or post history. The action comes as Reddit has become increasingly protective of the content on its site. Reddit, in May, announced it had struck a deal with OpenAI to use its content to help train ChatGPT. It previously announced a similar deal with Google - and blocked other search engines from crawling the site after that deal unless they struck financial agreements with Reddit as well. AI companies that are less well-financed, however, have reportedly been using the Internet Archive to scrub the site's previous posts and train their large language models from that content.
[12]
Reddit says its blocking the Internet Archive to stop sneaky AI scrapers accessing its content - SiliconANGLE
Reddit says its blocking the Internet Archive to stop sneaky AI scrapers accessing its content Reddit Inc. said today it has decided to block the Internet Archive from indexing its popular web forums in order to prevent sneaky artificial intelligence firms from scraping its content for training purposes. Reddit reportedly found evidence that AI companies were scraping its content via the Internet Archive's platform, after it restricted them from doing so using its official website. The decision means that the organization's popular Wayback Machine service will no longer be able to archive Reddit pages, threads, profiles or comments - nothing, except for what's shown on its homepage. A report in The Verge means that, going forward, the archive will only be able to show what posts and news headlines were popular on any given day. Previously, Wayback Machine was able to archive every single page, documenting everything that was posted onto the "front page of the internet," as Reddit proclaims itself to be. Reddit did not say which AI companies were using the Wayback Machine to get around its prohibitions on them scraping its content. A spokesperson for the company told The Verge that it has "become aware of instances where AI companies violate platform policies... and scrape data from the Wayback Machine." The company seems to think that the Internet Archive should be taking steps to prevent this scraping, so there's hope that the decision won't be a permanent one. However, the report also highlights a concern by Reddit that Wayback Machine has a tendency to archive user's posts and comments that are later deleted, saying that this is problematic for user privacy. "Until they're able to defend their site and comply with platform policies, we're limiting some of their access to Reddit data to protect redditors," the company said. Although Reddit raises the issue of user privacy, it's likely that its primary motivation for blocking the scrapers is money. AI companies are expressly prohibited from crawling its website, unless they're willing to pay to access that data. Several companies have taken Reddit up on that offer, notably Google LLC and OpenAI. Reddit has never revealed how much its deal with OpenAI is worth, but the agreement with Google is reportedly worth around $60 million. Reddit has also stated previously that it hopes to generate as much as $200 million from such licensing agreements over the next three years. One company that doesn't seem prepared to pay up is Anthropic PBC. In June, Reddit filed a lawsuit against it, saying it was continuing to scrape its content even after it claimed it was no longer doing so. The Internet Archive isn't the first organization to be blocked by Reddit over scraping concerns. In June 2024, the social media firm said it had blocked Microsoft Corp.'s Bing and smaller search engines, such as DuckDuckGo, Mojeek and Qwant, in order to prevent its content being scraped through their archives. It's not immediately clear if the Internet Archive will try and take steps to prevent its archives from being scraped so it can get Reddit's restrictions lifted. In a statement, Wayback Machine Director Mark Graham said his team is engaged in "ongoing discussions about this matter."
[13]
Reddit locks out Wayback machine to stop AI from scraping old posts
Reddit has restricted the Internet Archive's Wayback Machine from extensively capturing its content due to concerns over unauthorized AI data scraping. The platform will now allow only the homepage to be archived, aiming to protect user privacy and control content use. This move highlights the challenges of balancing digital preservation and data security in today's AI-driven world. Reddit has announced that it will restrict the Internet Archive's Wayback Machine to archiving only its homepage, blocking the tool from saving most of its site's content. This change comes as a direct response to increasing concerns about AI companies scraping Reddit data through the Wayback Machine, possibly risking Reddit's content policies and violating user privacy. According to Reddit spokesperson Tim Rathschmidt, the company has seen cases where artificial intelligence firms accessed Reddit's content via the Wayback Machine without adhering to Reddit's terms of service. This includes scraping of posts, comments, and even deleted or removed content. Such unauthorized activities challenge Reddit's ability to manage and protect its content. Rathschmidt emphasized that until the Internet Archive can guarantee compliance with Reddit's policies, this restriction will stay in place to safeguard users' privacy and preserve the integrity of removed content. The Wayback Machine is a widely used tool operated by the Internet Archive, designed to preserve snapshots of websites over time. This archival service enables users to view historical versions of web pages, which is useful for research, fact-checking, and maintaining internet history. With Reddit's new limitation, the Wayback Machine will no longer archive specific Reddit pages like posts or user profiles, only the homepage. This significantly reduces the breadth and depth of Reddit's content saved by the archive, restricting public access to old discussions and deleted data through this service. This restriction is part of Reddit's broader effort to control how its data is accessed and used, especially by AI companies. Recently Reddit has taken many steps to protect its content, including modifying its application programming interfaces (APIs) to limit data scraping, negotiating paid data licenses with firms like Google and OpenAI, and pursuing legal action against the companies such as Anthropic for unauthorized data collection. Reddit's goal is to balance user privacy, platform safety, and its business interests by carefully regulating third parties, who can access its vast content. Mark Graham, director of the Wayback Machine, confirmed ongoing discussions with Reddit about this issue but no formal announcement has been made. The Internet Archive community and users who rely on its archiving service await further updates to understand the long-term implications for internet preservation. This move by Reddit highlights the complex challenge of protecting user privacy while preserving internet content at the same time, especially as AI technologies rely on large datasets gathered from the web. Q1. What is Reddit? A1. Reddit is an online community where users share posts, comments, and discussions on various topics. Q2. What is the Wayback Machine? A2. The Wayback Machine is a tool that archives and lets people view past versions of websites.
[14]
Reddit Restricts Wayback Machine's Access To Only Its Homepage
Reddit has stopped the Internet Archive's Wayback Machine from indexing most of its content, saying that AI companies have been using it to bypass licensing fees and scrape user data, as per a report by The Verge. From now on, the Wayback Machine will not be able to archive posts, comments, or user profiles on the online discussion-hosting website. It will only be able to access Reddit's homepage, which means it can show which headlines and posts were popular on a given day, but not their details. "Internet Archive provides a service to the open web, but we've been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine," Reddit spokesperson Tim Rathschmidt told The Verge. He added: "Until they're able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we're limiting some of their access to Reddit data to protect redditors." This move will make it harder for users, journalists, and researchers to retrieve deleted posts or track how discussions changed over time. Notably, Reddit had earlier said that it would not restrict websites like the Internet Archive, but the online discussion-hosting website states it now has proof that AI firms were accessing its archives to avoid paying licence fees. What is the Internet Archive and the Wayback Machine? The Internet Archive is a non-profit organisation founded in 1996 in the United States. It operates the website archive.org and offers free access to books, music, videos, software, and billions of archived web pages. Its mission is to provide "universal access to all knowledge". The Wayback Machine, launched in 2001 by the Internet Archive, lets users see how websites looked like in the past. It stores historical snapshots of web pages, including many that no longer exist online. As of now, it has archived more than 946 billion web pages. Journalists, researchers, and Wikipedia editors widely use the tool to cross-check and verify content. Why Reddit Made This Move The social media platform said it has evidence that some AI firms are using the Wayback Machine to get its content without paying for it. The company had earlier said it would not block "good faith actors" like the Internet Archive, but it changed its stance after finding that some actors were helping AI firms avoid licence fees. Mark Graham, the Director of the Wayback Machine, responded by saying: "We have a longstanding relationship with Reddit and continue to have ongoing discussions about this matter." Reddit's Recent Deals and Disputes with AI Companies Reddit has been tightening control over its data in recent years. In June 2025, it sued Anthropic, claiming the AI company scraped Reddit content even though it claims not to train its models on 'stolen data'. Notably, the social media company's terms expressly prohibit data scraping for commercial gains. Elsewhere in 2024, Reddit signed a deal with Google worth a reported $60 million, allowing Google to access its data for AI model training. It also signed a licensing agreement with OpenAI that allows the AI company to access Reddit's Data application programming interface (API), and provide the social media platform with AI features for its users. However, Reddit revealed that this does not change its Data API Terms or Developer Terms. Notably, over the next few years, the social media company expects to make more than $200 million from such licensing deals. Furthermore, Reddit changed its API in 2023 to introduce a premium access point for third-party apps, which requires higher data usage limits. It said that content created and submitted is owned by redditors and cannot be used by a third party sans permission.
[15]
Reddit Cuts Off Wayback Machine Over AI Data-Scraping Concerns
The action ends a long practice of regularly preserving public pages by the Wayback Machine for research and historical reasons. The shift is targeted directly at preventing artificial intelligence firms from circumventing its licensing agreements. While the Internet Archive is a 'good-faith actor,' Reddit indicates that have performed data scraping to gain archived Reddit comments from the Wayback Machine rather than negotiate direct access to the site's data. , a Reddit spokesperson explained, "We've seen cases where AI businesses are breaking platform rules and scraping data off the Wayback Machine. We must ensure our privacy requirements and contracts are honored."
Share
Share
Copy Link
Reddit has implemented restrictions on the Internet Archive's Wayback Machine to prevent AI companies from scraping user data without permission, sparking debates about data privacy and AI training practices.
In a significant move to protect user data and enforce its platform policies, Reddit has implemented restrictions on the Internet Archive's Wayback Machine. This decision comes after the discovery that AI companies were using the archive to circumvent Reddit's data scraping restrictions
1
.Source: SiliconANGLE
Under the new policy, the Wayback Machine will only be allowed to archive Reddit's homepage, effectively limiting its ability to preserve the platform's vast content ecosystem. The restrictions prevent the archiving of post detail pages, comments, user profiles, and subreddit pages
2
.Reddit spokesperson Tim Rathschmidt stated that the company became "aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine"
3
. This move is part of Reddit's broader strategy to control access to its data, especially in the context of AI training.The Internet Archive, a non-profit digital library, has been an essential resource for researchers and historians. The restrictions will significantly limit its ability to preserve Reddit's content, potentially impacting future studies on online culture and digital forensics
3
.Source: 9to5Mac
Reddit has been actively pursuing data licensing agreements with AI companies. It has struck deals with Google and OpenAI, allowing them to use Reddit's content for AI training in exchange for substantial fees
4
. The platform's approach underscores the growing value of user-generated content in the AI era.Related Stories
This incident highlights the ongoing tensions between AI companies, content platforms, and copyright holders. Several publishers and creators have filed lawsuits against AI firms for alleged copyright infringement, challenging the notion of "fair use" in AI training
3
.Source: ZDNet
Reddit's decision raises questions about the future of data access for AI training. As platforms become more protective of their data, AI companies may need to reassess their data acquisition strategies and potentially negotiate more licensing agreements
5
.The Internet Archive has expressed a willingness to continue discussions with Reddit about this matter. Mark Graham, director of the Wayback Machine, stated that they have a "longstanding relationship with Reddit" and hope to find an amicable solution
4
.Summarized by
Navi
[1]
[2]