Reddit Sues Perplexity AI and Data Firms Over Alleged Content Scraping

Reviewed by Nidhi Govil

34 Sources

[1]

Ars Technica

Reddit sues to block Perplexity from scraping Google search results

In a lawsuit filed on Wednesday, Reddit accused an AI search engine, Perplexity, of conspiring with several companies to illegally scrape Reddit content from Google search results, allegedly dodging anti-scraping methods that require substantial investments from both Google and Reddit. Reddit alleged that Perplexity feeds off Reddit and Google, claiming to be "the world's first answer engine" but really doing "nothing groundbreaking." "Its answer engine simply uses a different company's" large language model "to parse through a massive number of Google search results to see if it can answer a user's question based on those results," the lawsuit said. "But Perplexity can only run its 'answer engine' by wrongfully accessing and scraping Reddit content appearing in Google's own search results from Google's own search engine." Likening companies involved in the alleged conspiracy to "bank robbers," Reddit claimed it caught Perplexity "red-handed" stealing content that its "answer engine" should not have had access to. Baiting Perplexity with "the digital equivalent of marked bills," Reddit tested out posting content that could only be found in Google search engine results pages (SERPs) and "within hours, queries to Perplexity's 'answer engine' produced the contents of that test post." "The only way that Perplexity could have obtained that Reddit content and then used it in its 'answer engine' is if it and/or its Co-Defendants scraped Google SERPs for that Reddit content and Perplexity then quickly incorporated that data into its answer engine," Reddit's lawsuit said. In a Reddit post, Perplexity denied any wrongdoing, describing its answer engine as summarizing Reddit discussions and citing Reddit threads in answers, just like anyone who shares links or posts on Reddit might do. Perplexity suggested that Reddit was attacking the open Internet by trying to extort licensing fees for Reddit content, despite knowing that Perplexity doesn't train foundational models. Reddit's endgame, Perplexity alleged, was to use the Perplexity lawsuit as a "show of force in Reddit's training data negotiations with Google and OpenAI." "We won't be extorted, and we won't help Reddit extort Google, even if they're our (huge) competitor," Perplexity wrote. "Perplexity will play fair, but we won't cave. And we won't let bigger companies use us in shell games. " Reddit likely anticipated Perplexity's defense of the "open Internet," noting in its complaint that "Reddit's current Robots Exclusion Protocol file ('robots.txt') says, 'Reddit believes in an open Internet, but not the misuse of public content.'" Google reveals how scrapers steal from search results To block scraping, Reddit uses various measures, such as "registered user-identification limits, IP-rate limits, captcha bot protection, and anomaly-detection tools," the complaint said. Similarly, Google relies on "anti-scraping systems and teams dedicated to preventing unauthorized access to its products and services," Reddit said, noting Google prohibits "unauthorized automated access" to its SERPs. To back its claims, Reddit subpoenaed Google to find out more about how the search giant blocks AI scrapers from accessing content on SERPs. Google confirmed it relies on "a technological access control system called 'SearchGuard,' which is designed to prevent automated systems from accessing and obtaining wholesale search results and indexed data while allowing individual users -- i.e., humans -- access to Google's search results, including results that feature Reddit data." "SearchGuard prevents unauthorized access to Google's search data by imposing a barrier challenge that cannot be solved in the ordinary course by automated systems unless they take affirmative actions to circumvent the SearchGuard system," Reddit's complaint explained. Bypassing these anti-scraping systems violates the Digital Millennium Copyright Act, Reddit alleged, as well as laws against unfair trade and unjust enrichment. Seemingly, Google's SearchGuard may currently be the easiest to bypass for alleged conspirators who supposedly pivoted to looting Google SERPs after realizing they couldn't access Reddit content directly on the platform. Scrapers shocked by Reddit lawsuit Reddit accused three companies of conspiring with Perplexity -- "a Lithuanian data scraper" called Oxylabs UAB, "a former Russian botnet" known as AWMProxy, and SerpApi, a Texas company that sells services for scraping search engines. Oxylabs "is explicit that its scraping service is meant to circumvent Google's technological measures," Reddit alleged, pointing to an Oxylabs' website called "How to Scrape Google Search Results." SerpApi touts the same service, including some options to scrape SERPs at "ludicrous speeds." To trick browsers, SerpApi's fastest option uses "a server-swarm to hide from, avoid, or simply overwhelm by brute force effective measures Google has put in place to ward off automated access to search engine results," Reddit alleged. SerpApi also allegedly provides users "with tips to reduce the chance of being blocked while web scraping, such as by sending 'fake user-agent string[s],' shifting IP addresses to avoid multiple requests from the same address, and using proxies 'to make traffic look like regular user traffic' and thereby 'impersonate' user traffic." According to Reddit, the three companies disguise "their web scrapers as regular people (among other techniques) to circumvent or bypass the security restrictions meant to stop them." During a two-week span in July, they scraped "almost three billion" SERPs containing Reddit text, URLs, images, and videos, a subpoena requesting information from Google revealed. Ars could not immediately reach AWMProxy for comment. However, the other companies were surprised by Reddit's lawsuit, while vowing to defend their business models. SerpApi's spokesperson told Ars that Reddit did not notify the company before filing the lawsuit. "We strongly disagree with Reddit's allegations and intend to vigorously defend ourselves in court," SerpApi's spokesperson said. "SerpApi stands firmly behind its business model and conduct, and we will continue to defend our rights to the fullest extent." Oxylabs' chief governance strategy officer, Denas Grybauskas, told Ars that Reddit's complaint seemed baffling since the other companies involved in the litigation are "unrelated and unaffiliated." "We are shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly or communicate any potential concerns," Grybauskas said. "Oxylabs has always been and will continue to be a pioneer and an industry leader in public data collection, and it will not hesitate to defend itself against these allegations. Oxylabs' position is that no company should claim ownership of public data that does not belong to them. It is possible that it is just an attempt to sell the same public data at an inflated price." Grybauskas defended Oxylabs' business as creating "real-world value for thousands of businesses and researchers, such as those driving open-source investigations, disinformation tackling, or environmental monitoring." "We strongly believe that our core business principles make the Internet a better place and serve the public good," Grybauskas said. "Oxylabs provides infrastructure for compliant access to publicly available information, and we demand every customer to use our services lawfully. " Reddit cited threats to licensing deals Apparently, Reddit caught on to the alleged scheme after sending cease-and-desist letters to Perplexity to stop scraping Reddit content that its answer engine was citing. Rather than ending the scraping, Reddit claimed Perplexity's citations increased "forty-fold." Since Perplexity is a customer listed on SerpApi's website, Reddit hypothesized the two were conspiring to skirt Google's anti-circumvention tools, the complaint said, along with the other companies. In a statement provided to Ars, Ben Lee, chief legal officer at Reddit, said that Oxylabs, AWMProxy, and SerpApi were "textbook examples" of scrapers that "bypass technological protections to steal data, then sell it to clients hungry for training material." "Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search," Lee said. "Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself." On Reddit, Perplexity pushed back on Reddit's claims that Perplexity ignored requests to license Reddit content. "Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content," Perplexity said. "Never has. So, it is impossible for us to sign a license agreement to do so." Reddit supposedly "insisted we pay anyway, despite lawfully accessing Reddit data," Perplexity said. "Bowing to strong arm tactics just isn't how we do business." Perplexity's spokesperson, Jesse Dwyer, told Ars the company chose to post its statement on Reddit "to illustrate a simple point." "It is a public Reddit link accessible to anyone, yet by the logic of Reddit's lawsuit, if you mention it or cite it in any way (which is your job as a reporter), they might just sue you," Dwyer said. But Reddit claimed that its business and reputation have been "damaged" by "misappropriation of Reddit data and circumvention of technological control measures." Without a licensing deal ensuring that Perplexity and others are respecting Reddit policies, Reddit cannot control who has access to data, how they're using data, and if data use conflicts with Reddit's privacy policy and user agreement, the complaint said. Further, Reddit's worried that Perplexity's workaround could catch on, potentially messing up Reddit's other licensing deals. All the while, Reddit noted, it has to invest "significant resources" in anti-scraping technology, with Reddit ultimately suffering damages, including "lost profits and business opportunities, reputational harm, and loss of user trust." Reddit's hoping the court will grant an injunction barring companies from scraping Reddit content from Google SERPs. It also wants companies blocked from both selling Reddit data and "developing or distributing any technology or product that is used for the unauthorized circumvention of technological control measures and scraping of Reddit data." If Reddit wins, companies could be required to pay substantial damages or to disgorge profits from the sale of Reddit content. Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder in Reddit.

[2]

CNET

'Would-Be Bank Robbers': Reddit Sues Perplexity, Data Firms Over AI Scraping

Reddit is suing the AI search developer Perplexity and the companies from which it buys AI training data, alleging the data firms are illegally scraping its content, violating its copyright protections. The lawsuit was filed on Wednesday in the US District Court for the Southern District of New York. In addition to Perplexity, three data firms are named as defendants: Oxylabs UAB, AWMProxy and SerpApi. In the filing, Reddit said the data firms circumvented Reddit and Google's technological barriers by accessing nearly three billion search engine result pages (SERPs) in a two-week period in July using techniques to mask their identities and locations. Reddit called them "would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead." Don't miss any of our unbiased tech content and lab-based reviews. Add CNET as a preferred Google source. Reddit said it traced that illegally scraped data back to Perplexity, which is why it previously issued a cease-and-desist letter. Perplexity is still listed as a customer of one of the data firms, SerpApi, according to its website, along with Meta, Samsung and Nvidia. Reddit is one of the most popular online platforms, with the company reporting over 110 million daily active users and more than 22 billion posts and comments. As such, it's become one of the most popular sources of the kind of human-created data that AI companies seek. Reddit has struck deals with OpenAI and Google to license its data. It's also sued Anthropic for misusing its data. Perplexity was also recently sued for copyright infringement by Encyclopedia Britannica, which owns the Merriam-Webster dictionary. Perplexity did not immediately respond to a request for comment. Copyright is one of the most contentious legal issues for AI companies. They need massive quantities of human-generated content -- like Reddit posts -- to train and refine their AI models. Much of that content is copyrighted, which typically requires the company to negotiate with the rights holder in order to license and use it. While some AI companies have struck multimillion-dollar licensing deals with publishers like Axel Springer, others have said their use of copyrighted material is fair use and therefore doesn't require them to pay. A series of lawsuits are duking out the specifics in court, with Meta and Anthropic notching fair use victories this summer. (Disclosure: Ziff Davis, CNET's parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

[3]

Bloomberg

Reddit Sues Perplexity, Others Over Alleged Data Scraping

Reddit Inc. sued Perplexity AI Inc. and three other companies over alleged data scraping from the discussion site without permission, a sign of the growing demand and value of original data in the burgeoning AI industry. Three data scraping companies -- Oxylabs UAB, AWMProxy, and SerpApi -- have been collecting Reddit data via Google search results for the purpose of reselling it, according to the complaint filed Wednesday in federal court in Manhattan. Perplexity has been buying that data from at least one of the companies, the suit alleges.

[4]

The Register

Reddit to Perplexity: Get your filthy hands off our forums

Social media site continues legal campaign against those who take its content without a license Reddit on Wednesday filed a lawsuit against Perplexity AI and three of its alleged data dealers for trafficking in unlawfully scraped information. The complaint, filed in the Southern District of New York, claims that Oxylabs UAB, AWM Proxy, and SerpApi unlawfully bypassed Reddit's and Google's defenses to harvest Reddit content and related search results. It also says that Perplexity chose to purchase the purloined data rather than license it from Reddit. Ben Lee, chief legal officer at Reddit, told The Register in an emailed statement that AI companies are desperate for quality content generated by real people and that need is fueling an industrial scale data laundering economy. "Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material," said Lee. "Reddit is a prime target because it's one of the largest and most dynamic collections of human conversation ever created." Lee claimed that Oxylabs UAB, a data scraping business based in Lithuania, AWM Proxy, a former Russian botnet, and SerpApi, which advertises real-time access to scraped Google search results, represent textbook examples of this sort of illegal behavior. "Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search," said Lee. "Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself." Reddit's complaint likens these three providers to "would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead." Echoing Cloudflare CEO Matthew Prince's characterization of Perplexity, the Reddit legal filing describes Perplexity as "more akin to a 'North Korean hacker'" who will do whatever is necessary to obtain the data to fuel its AI answer engine, other than pay for a license. Google is not participating in the lawsuit but has tried to prevent automated scraping of its search results. The social media contends that the defendants have violated the US Digital Millennium Copyright Act by bypassing its technological defenses against automated access to its servers. And it accuses SerpApi and Oxylabs specifically of violating the DMCA's prohibition on trafficking in technology circumvention products or services. Other claims include unfair competition, unjust enrichment, and civil conspiracy. Reddit is seeking an injunction to halt the unwanted scraping of its content and damages. In June, Reddit filed a similar complaint against Anthropic after it failed to convince the AI business to enter into a content licensing deal as OpenAI has done. Oxylabs, which advertises itself as "the largest ethical proxy network and advanced scraping solutions empowering the AI industry and beyond," did not immediately respond to a request for comment. SerpApi also did not respond to requests for comment. A spokesperson for Perplexity told The Register, "Perplexity has not yet received the lawsuit, but we will always fight vigorously for users' rights to freely and fairly access public knowledge. Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest." Reddit is not alone in its attempts to defend against its content being scraped and used to train AI models without consent. A lawsuit [PDF] filed last month on behalf of two authors accuses Apple of "using Books3, a dataset of pirated copyrighted books" to train its OpenELM language models. The complaint against Apple says that the company's AppleBot has been scraping web data for nine years and that data is now being used to improve Apple Intelligence models. Another case, Millette v. OpenAI (2024), contends that OpenAI scraped YouTube videos unlawfully to improve its models. The New York Times Co. v. Microsoft Corp., OpenAI (2023) makes similar allegations with regard to Microsoft's and OpenAI's alleged use of its news content. In August, content delivery network Cloudflare called out Perplexity for running web scraping bots that ignore websites' no-scraping directives. ®

[5]

Reddit launches copyright suit against AI search engine Perplexity

Social media platform Reddit has filed a copyright lawsuit against Perplexity, accusing the AI company of illegally scraping its data in order to train the model powering its search engine. The complaint filed in New York federal court on Wednesday marks the latest legal tussle between AI groups over alleged copyrighted material. Reddit also sued three smaller groups: Lithuanian data scraper Oxylabs UAB, former Russian botnet AWMProxy, and Texas start-up SerpApi. Reddit claims the three groups provided data-scraping services for hoovering up copyrighted Reddit content "by masking their identities, hiding their locations, and disguising their web scrapers as regular people". "AI companies are locked in an arms race for quality human content -- and that pressure has fuelled an industrial-scale "data laundering" economy", Ben Lee, chief legal officer at Reddit said in a statement. Perplexity was "a willing customer of at least one of its co-defendants", the social media company wrote in the filing, alleging that the San Francisco-based AI group "desperately" needed "to fuel its "answer engine" by scraping data through Google search results. "We strongly disagree with Reddit's allegations and intend to vigorously defend ourselves in court," SerpApi said. Two people familiar with the matter told the Financial Times that Reddit had confronted Perplexity about its alleged theft and suggested they enter discussions about a paid partnership, but that its founder Aravind Srinivas was not interested. Reddit had also contacted Google with its concerns, asking the tech giant to investigate if Perplexity was scraping Reddit's proprietary data through its search engine and if so, to work out how to prevent this, the people added. A spokesman for Google declined to comment. The suit adds to dozens of copyright lawsuits that have been filed against AI companies since the advent of generative AI systems, which are trained using vast amounts of text data, including content from the internet. Copyright holders have claimed their content has been used without consent or fair compensation. Reddit, which went public in March 2024 and is known for hosting devoted online communities, has struck multimillion-dollar partnerships with Google and OpenAI allowing them to train their large language models on its content. By contrast, Reddit alleged in the complaint that the defendants had circumvented their data protection measures to obtain its copyrighted material without permission. Lee said Reddit was "a prime target because it's one of the largest and most dynamic collections of human conversation ever created". In June, Reddit filed a similar lawsuit against Anthropic, alleging the AI start-up had scraped its platform more than 100,000 times since July 2024. Anthropic responded at the time that it "disagreed" with Reddit's claims and would "defend ourselves vigorously". Perplexity and Oxylabs did not immediately respond to a request for comment. AWMProxy could not be reached for comment.

[6]

Reuters

Reddit sues Perplexity for scraping data to train AI system

Oct 22 (Reuters) - Social media platform Reddit sued artificial intelligence startup Perplexity in New York federal court on Wednesday, accusing it and three other companies of unlawfully scraping its data to train Perplexity's AI-based search engine. Reddit said in the complaint, opens new tab that the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system. The case is one of many filed by content owners against tech companies over the alleged misuse of their copyrighted material to train AI systems. Reddit filed a similar lawsuit against AI startup Anthropic in June that is still ongoing. "Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest," Perplexity said in a statement. "AI companies are locked in an arms race for quality human content -- and that pressure has fueled an industrial-scale 'data laundering' economy," Reddit chief legal officer Ben Lee said in a statement. Reddit, which features thousands of interest-based "subreddit" web communities, said in the lawsuit that it is the most commonly cited source for AI-generated answers to user questions. It has licensed its content to Google, OpenAI and others for their AI training. Reddit said that Lithuania-based Oxylabs, Russia-based AWMProxy and Texas-based SerpApi scraped Reddit data from billions of search results without permission and that Perplexity, which does not have a license to use Reddit content, worked with at least one of the data-scraping companies to obtain Reddit material. "We strongly disagree with Reddit's allegations and intend to vigorously defend ourselves in court," a SerpApi spokesperson said. Spokespeople for Oxylabs did not immediately respond to a request for comment on the case, and AWMProxy could not be reached for comment. Reddit said it sent Perplexity a cease-and-desist letter last year, after which it "increased the volume of citations to Reddit forty-fold." Reddit asked the court for unspecified monetary damages and an order blocking Perplexity from using its data. Reporting by Blake Brittain in Washington Editing by Nick Zieminski Our Standards: The Thomson Reuters Trust Principles., opens new tab

[7]

CNBC

Reddit sues Perplexity for scraping of posts, expanding user data battle with AI industry

It comes amid a similar lawsuit from Reddit against AI firm Anthropic, as the social media platform attempts to assert ownership over its user data through licensing agreements. Social media giant Reddit has launched a lawsuit against artificial intelligence company Perplexity, alleging that it illegally scraped user posts to train its AI model, marking the latest data-rights clash between content owners and the AI industry. The complaint filed in New York federal court on Wednesday also named three defendants, which Reddit says helped Perplexity collect its data: Lithuanian data scraper Oxylabs, "former Russian botnet" AWMProxy, and Texas startup SerpApi. Reddit alleged that the three smaller entities were able to extract its copyrighted content "by masking their identities, hiding their locations and disguising their web scrapers as regular people." Perplexity, which runs an AI-powered search engine, denied the allegations and accused Reddit of "extortion" and opposition to an open internet, while SerpApi told CNBC it "strongly disagrees" with Reddit's claims and intends to defend itself in court. The case represents one of many filed by content owners accusing AI firms of using copyrighted material without permission to train their large language models. Reddit, in particular, has been on the front lines of that battle, having launched a similar ongoing lawsuit against AI startup Anthropic in June. CNBC was unable to reach Oxylabs and AWMProxy. In a statement shared with CNBC, Ben Lee, Chief Legal Officer at Reddit, said that AI companies are" locked in an arms race for quality human content" and that pressure has fueled an "industrial-scale 'data laundering' economy." Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material. Reddit is a prime target because it's one of the largest and most dynamic collections of human conversation ever created. Reddit -- which hosts over 100,000 interest-based "subreddit" communities -- said in its lawsuit that its user posts had become the most commonly cited source for AI-generated answers on Perplexity. It added that it sent Perplexity a cease-and-desist letter, after which it "increased the volume of citations to Reddit forty-fold." AI researchers have previously noted that Reddit's large volume of moderated conversations can help make AI chatbots produce more natural-sounding responses. In the age of artificial intelligence, Reddit has worked to leverage its massive data pool, permitting access to it only through AI-related licensing agreements. The social media company has signed such agreements with OpenAI and Alphabet's Google. In a response to the lawsuit, Perplexity, in a post on the Reddit platform, argued that it does not train AI models on content but merely summarizes and cites public Reddit discussions. Therefore, it said it is "impossible" to sign a license agreement. "A year ago, after explaining this, Reddit insisted we pay anyway, despite lawfully accessing Reddit data. Bowing to strong arm tactics just isn't how we do business," the statement read, going on to describe the suit as a "show of force in Reddit's training data negotiations with Google and OpenAI." "Perplexity believes this is a sad example of what happens when public data becomes a big part of a public company's business model," Perplexity added, noting that data licensing has become an increasingly important source of revenue for Reddit. In February, Reddit's COO Jen Wong told the trade publication Adweek that AI licensing deals with Google and OpenAI made up nearly 10% of Reddit's revenue.

[8]

Reddit sues AI company Perplexity and others for 'industrial-scale' scraping of user comments

Social media platform Reddit sued the artificial intelligence company Perplexity AI and three other entities on Wednesday, alleging their involvement in an "industrial-scale, unlawful" economy to "scrape" the comments of millions of Reddit users for commercial gain. Reddit's lawsuit in a New York federal court takes aim at San Francisco-based Perplexity, maker of an AI chatbot and "answer engine" that competes with Google, ChatGPT and others in online search. Also named in the lawsuit are Lithuanian data-scraping company Oxylabs UAB, a web domain called AWMProxy that Reddit describes as a "former Russian botnet," and Texas-based startup SerpApi. It's the second such lawsuit from Reddit since it sued another major AI company, Anthropic, in June. But the lawsuit filed Wednesday is different in the way that it confronts not just an AI company but the lesser-known services the AI industry relies on to acquire online writings needed to train AI chatbots. "Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material. Reddit is a prime target because it's one of the largest and most dynamic collections of human conversation ever created," said Ben Lee, Reddit's chief legal officer, in a statement Wednesday. Perplexity said it has not yet received the lawsuit but "will always fight vigorously for users' rights to freely and fairly access public knowledge. Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest." Oxylabs and SerpAPI didn't immediately respond to requests for comment Wednesday. AWMProxy could not immediately be reached for comment. Reddit compares the companies it is suing to "would-be bank robbers" who can't get into the bank vault, so they break into the armored truck instead. The lawsuit alleges they are evading Reddit's own anti-scraping measures while also "circumventing Google's controls and scraping Reddit content directly from Google's search engine results." Lee said that because they're unable to scrape Reddit directly, "they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search. Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself." Much like its lawsuit against Anthropic, maker of the chatbot Claude, Reddit claims that Perplexity has accessed Reddit's content despite being asked not to do so. Reddit made a similar argument in its lawsuit against Anthropic. That case was initially filed in California Superior Court but was later moved to federal court and has a hearing scheduled for January. Along with digitized books and news articles, websites such as Wikipedia and Reddit are deep troves of written materials that can help teach an AI assistant the patterns of human language. Reddit has previously entered licensing agreements with Google, OpenAI and other companies that are paying to be able to train their AI systems on the public commentary of Reddit's more than 100 million daily users. The licensing deals helped the 20-year-old online platform raise money ahead of its Wall Street debut as a publicly traded company last year.

[9]

Engadget

Reddit sues Perplexity and three other companies for allegedly using its content without paying

Reddit is suing companies SerApi, OxyLabs, AWMProxy and Perplexity for allegedly scraping its data from search results and using it without a license, The New York Times reports. The new lawsuit follows legal action against AI startup Anthropic, who allegedly used Reddit content to train its Claude chatbot. As of 2023, Reddit charges companies looking access to posts and other content in the hopes of making money on data that could be used for AI training. The company has also signed licensing deals with companies like Google and OpenAI, and even built an AI answer machine of its own to leverage the knowledge in users' posts. Scraping search results for Reddit content avoids those payments, which is why the company is seeking financial damages and a permanent injunction that prevents companies from selling previously scraped Reddit material. Some of the companies Reddit is focused on, like SerApi, OxyLabs and AWMProxy, are not exactly household names, but they've all made collecting data from search results and selling it a key part of their business. Perplexity's inclusion in the lawsuit might be more obvious. The AI company needs data to train its models, and has already been caught seemingly copying and regurgitating material it hasn't paid to license. That also includes reportedly ignoring the robots.txt protocol, a way for websites to communicate that they don't want their material scraped. Per a copy of the lawsuit provided to Engadget, Reddit had already sent a cease-and-desist to Perplexity asking it to stop scraping posts without a license. The company claimed it didn't use Reddit data, but it also continued to cite the platform in answers from its chatbot. Reddit says it was able to prove Perplexity was using scraped Reddit content by creating a "test post" that "could only be crawled by Google's search engine and was not otherwise accessible anywhere on the internet." Within a few hours, queries made to Perplexity's answer engine were able to reproduce the content of the post. "The only way that Perplexity could have obtained that Reddit content and then used it in its 'answer engine' is if it and/or its co-defendants scraped Google [search results] for that Reddit content and Perplexity then quickly incorporated that data into its answer engine," the lawsuit claims. When asked to comment, Perplexity provided the following statement: This new lawsuit fits with the aggressive stance Reddit has taken towards protecting its data, including rate-limiting unknown bots and web crawlers in 2024, and even limiting what access the Internet Archive's Wayback Machine has to its site in August 2025. The company has also sought to define new terms around how websites are crawled by adopting the Really Simple Licensing standard, which adds licensing terms to robots.txt.

[10]

Gizmodo

Reddit Sues a Collection of Startups It Says Are Wrongly Scraping It for AI Training Data

Let's say a website makes it a violation of its terms of service for you to send bots onto its pages in order to vacuum up its text, which you want to package as AI training data and sell. Next, suppose you think of a workaround: you don't send your data scraping bots to that website, but to Google results pages that also have the text you're looking for. Are you a business genius, or a thief? If Reddit doesn't succeed with its latest long shot legal effort against data scrapers, and you're one of the companies doing this, you might just be a business genius, legally speaking anyway. Reddit's new suit, filed Wednesday in New York, is the latest round of legal Wac-a-Mole being played between established online platforms and the increasingly intricate data-sucking firms that want their precious data. Earlier this month LinkedIn filed suit against a firm called ProAPIs for using robotic accounts to ingest users' personal dataâ€"which as we all know, LinkedIn keeps tucked away behind its irksome login wall. Reddit also sued Anthropic for something similar, saying the AI company claimed it had stopped visiting Reddit to scrape data, and then visited 100,000 more times. The new suitâ€"seeking damages, as well as the protection of a permanent injunctionâ€"names four defendants. The most famous one is Perplexity AI, which markets an AI-based search engine, and is already famous for its brazenness around data scraping. The other three, Texas-based SerpApi, Lithuania's Oxylabs and AWMProxy, based in Russia, carried out versions of the more subtle plan outlined above, the suit claims. They then sold data to such tech giants as OpenAI and Meta. An Oxylabs representative, Denas Grybauskas, explained what may be the company's legal rationale to the New York Times, saying â€œno company should claim ownership of public data that does not belong to them.â€ There are challenges in the way of legal victory for Reddit. For one thing, it filed this suit in New York, and the companies it's suing are mostly in other countries. But second of all, these suits don't necessarily work out for platforms. Elon Musk's X had a similar suit dismissed last year, with the judge noting that the amount of control X was seeking over dataÂ â€œrisks the possible creation of information monopolies that would disserve the public interest.â€

[11]

Washington Post

Analysis | The fight between AI companies and the websites that hate them

A lawsuit by online message board Reddit gives you a glimpse at the knockdown boxing match behind chatbot conversations. In one corner are artificial intelligence services that gobble information from across the internet to help you plan a vacation or create silly videos. In the other corner are companies that are sometimes unwilling or overwhelmed sources of that data. In its lawsuit, similar to ones against AI companies by news organizations, Hollywood studios, book authors and others, Reddit alleges that the start-up Perplexity benefited from improperly using its website as AI fuel. The claims are an example of warnings from Reddit, Wikipedia and others that say if the boxing match continues as-is, AI services may kill the websites and other source material that we love. I'll go over Reddit's allegations, which detail dramatic theft akin to a Louvre-style jewel heist, and what they say about our AI age. Dating back at least to the death of Napster a quarter-century ago, there have been constant fights over technology upstarts that remix media and information or deliver it in new ways. AI could be the most intractable fight of all. AI 'bank robbers' vs. Reddit The 20 years of our Reddit debates about the best Welsh restaurants and quiet air conditioners are gold for AI services. They typically need truckloads of online information like that to "train" their computers and serve up responses to your AI queries. Skip to end of carousel Shira Ovide (Patrick Dias for The Washington Post) Tech Friend writer Shira Ovide gives you advice and context to make technology work for you. Sign up for the free Tech Friend newsletter. Contact her securely on Signal at ShiraOvide.70 End of carousel Reddit knows how valuable it is and laid out ground rules for AI companies that wanted to profit from siphoning Reddit message boards in bulk: AI companies needed a paid contract with Reddit and to respect its guardrails. Some companies, including Google and ChatGPT parent company OpenAI, agreed to Reddit's terms. For AI companies that didn't agree, Reddit put up digital walls to block AI companies' spiderlike software that crawls over websites to harvest their information. According to Reddit, Perplexity's CEO promised Reddit's top lawyer more than a year ago to respect Reddit's digital walls. Perplexity, which makes what it calls an AI "answer" engine and an AI-specialized web browser, instead found another way to siphon Reddit pages, the company says. (The Washington Post has partnerships with Perplexity and OpenAI.) Reddit's lawsuit, filed Wednesday in a New York federal court, said that Perplexity hired at least one data-siphoning middleman to grab many billions of pages of Reddit material indirectly, from Google search results. Those middlemen allegedly used technically sophisticated tactics to get around Google's digital defenses against unwanted siphoning by bots. Reddit said that it obtained this information from a subpoena to Google in a different, secret lawsuit. Reddit's lawsuit compared what Perplexity and the bot-for-hire middlemen did to "bank robbers" who know they can't get into the bank vault and "break into the armored truck carrying the cash instead." In a post on Reddit, Perplexity said that Reddit is after money. The lawsuit is a "sad example of what happens when public data becomes a big part of a public company's business model," Perplexity said. Google said that it has "strong technical measures to prevent this type of malicious abuse, because it undermines the choices websites make about who can access their content." What this means for you Experts have said that the law generally protects technology companies that take copyrighted materials like news articles, books and movies and put them to a new, creative use. Many AI companies say that their products meet that legal standard. Blake Reid, an associate professor at the University of Colorado Law School, said that Reddit's case adds an extra wrinkle: The company doesn't hold the copyright to Reddit posts. The people who created those posts do. Reid said that helps make the lawsuit's outcome unpredictable. Regardless, AI keeps running into a paradox: To be useful, new forms of AI rely on ingesting vast swaths of the past, present and future internet. But doing so can increase costs and divert users from websites, which imperils the internet we use. We've heard similar complaints before. Entertainment companies sued YouTube for giving you free access to their creations. Music companies have howled over TikTok letting you create dance videos to Taylor Swift tunes. News organizations have groused that Google and Facebook let you browse the news without buying newspapers or visiting news websites. The content companies have typically found ways to grudgingly live with, and even profit from, the technology upstarts. AI is different, said Toshit Panigrahi, CEO of TollBit, which helps websites get paid for AI data collection. AI services grab information at warp speed and at industrial scale from so many places, including news and entertainment sites, cruise operators and furniture sellers. Panigrahi said that the old pattern -- technology changes are good for us and the owners of digital creations -- may no longer apply. "This is changing how the internet works fundamentally," he said.

[12]

Mashable

Reddit accuses Perplexity of stealing content to train AI

Reddit claims it caught Perplexity doing something it shouldn't have. The popular message board website filed a lawsuit against Perplexity, a notable AI firm, alleging that Perplexity engaged in improper data scraping to feed its AI program. The complaint (courtesy of The Verge) lists Perplexity alongside three data scraping firms: AWMProxy, Oxylabs, and SerpApi. According to Reddit, Perplexity does business with at least one of these companies, allegedly using them to get data from Reddit without the site's permission. While Reddit has signed agreements with other AI companies in the recent past, it has not done so with Perplexity. Reddit claims that it once sent a cease-and-desist letter to Perplexity for scraping Reddit content. Per Reddit's complaint, after the letter was sent, Perplexity started citing Reddit even more than before, not less. Where this really gets juicy is how Reddit claims it caught Perplexity in the alleged act of stealing data. In Reddit's words: "To confirm this hypothesis, Reddit created a "test post" - the equivalent of a digital "marked bill" - that could only be crawled by Google's search engine and was not otherwise accessible anywhere on the internet. Within hours, queries to Perplexity's "answer engine" produced the contents of that test post. The only way that Perplexity could have obtained that Reddit content and then used it in its "answer engine" is if it and/or its Co-Defendants scraped Google SERPs for that Reddit content and Perplexity then quickly incorporated that data into its answer engine." Perplexity provided a statement defending itself to The Verge. "Perplexity has not yet received the lawsuit, but we will always fight vigorously for users' rights to freely and fairly access public knowledge," the company told The Verge. "Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest." We'll have to wait and see how the lawsuit pans out, but at least Reddit's tactic for allegedly catching Perplexity in the act is funny, if nothing else.

[13]

Axios

Reddit sues Perplexity over data scraping

Driving the news: Reddit is accusing Perplexity and the other firms of what it dubs "data laundering," whereby the data firms scrape loads of data and then sell it to AI firms, in this case Perplexity. * The lawsuit alleges that the defendants evade Reddit anti-scraping measures and circumvent Google's controls by scraping Reddit results from Google search. * Perplexity was not immediately available for comment. What they're saying: "AI companies are locked in an arms race for quality human content -- and that pressure has fueled an industrial-scale 'data laundering' economy," Reddit Chief Legal Officer Ben Lee said in a statement. * "Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material. " * "Defendants are similar to would-be bank robbers, who, knowing they cannot get into the bank vault, break into the armored truck carrying the cash instead," the suit says. The big picture: Perplexity is also facing suits from other publishers, including Encyclopedia Britannica, The New York Times and other newspapers, while similar lawsuits have been brought against OpenAI and others.

[14]

Futurism

Perplexity Just Got Caught Breaking the Rules Red-Handed

Over two decades ago, the New Oxford American Dictionary wanted to see if any of its competitors were cribbing its definitions. So it set up a trap. In its first edition, published in 2001, NOAD included a word called "esquivalience," which it defined as the "willful avoidance of one's official responsibilities." The word was a fake. And the bait worked: the word reference website Dictionary.com was caught using "esquivalience," attributing it to Merriam Webster's New Millennium. Its guilt was undeniable, and the debacle gained considerable media coverage. These copyright traps have a name: "mountweazels" -- a term with its own curious history -- and an evolution of them is now being used by companies fending off AI data scrapers that devour vast swathes of the internet without asking permission. In a lawsuit against four tech companies filed Wednesday and covered by The New York Times, Reddit revealed how it managed to ensnare the AI startup Perplexity with its own sort of mountweazel. The forum-based social media platform put up a "test post" on its site that could "only be crawled by Google's search engine and was not otherwise accessible anywhere on the internet," it said. But within hours, Perplexity's AI-powered search engine showed the content from the trap Reddit post. "Perplexity's business model is effectively to take Reddit's content from Google search results," then feed it into an AI model and "call it a new product," Reddit lawyers argued in the suit, per the NYT. It's the latest lawsuit to put the AI industry's voracious use of scraped data under the spotlight. Training the powerful large language models that power AI products like ChatGPT would not have been possible without having free access to an unbelievable wealth of data, much of it copyrighted. Reddit itself is trying to cash in on the AI data demand by locking out scrapers and selling its user data at a premium. It expects to make over $200 million over the next few years through the data licensing venture. In addition to Perplexity, the Reddit suit targets three more data scraping firms: SerpApi based in Texas; Oxylabs, a Lithuanian startup; and AWMProxy in Russia, which has been linked to a notorious malware botnet called Glupteba. Years before the AI boom, these companies scraped mountains of Google search data to provide search engine optimization services to businesses. Google's search results were themselves created by scraping websites and then organizing that data. For the most part, this created a mutually beneficial relationship, since scraping helped direct traffic to the websites the data came from through search results, the NYT explains. But then these SEO firms started selling their troves of scraped Google data directly to AI companies. The AI chatbots that were trained on these data sets don't direct a meaningful amount of traffic to the websites they get their data from -- if they give accurate attributions at all -- and suddenly, the relationship became one-sided. Reddit, which is experimenting with its own built-in AI, says that Perplexity bought these firms' scraped data sets, circumventing a cease and desist order Reddit sent after it caught Perplexity directly scraping data from its posts without paying for it. The lawsuit noted that citations to Reddit data in Perplexity's AI search results had jumped "fortyfold," per the NYT.

[15]

Decrypt

Reddit Sues Perplexity AI, Alleging 'Industrial-Scale' Data Theft - Decrypt

The lawsuit names Perplexity, SerpApi, Oxylabs, and AWM Proxy as defendants. Social media platform Reddit has sued Perplexity AI in federal court on Wednesday, alleging that the artificial intelligence company and its data partners orchestrated an " industrial-scale" scheme to scrape the platform's user-generated content. Reddit alleges that the other defendants: SerpApi, Oxylabs, and AWM Proxy, developed and sold tools specifically designed to break security measures protecting its content, enabling the large-scale scraping of Reddit data from search results. The tools were allegedly built with the intention of bypassing two layers of protection: first, by evading Reddit's own anti-scraping systems, and second, by circumventing Google's controls to extract Reddit content directly from its search engine results. The data companies operated as "data-scraping service providers" and "circumvented Google's technological control measures and automatedly accessed, without authorization, almost three billion search engine results pages," a copy of the lawsuit reads. Reddit claims Perplexity used data from the three firms for its answer engine even after receiving a cease-and-desist letter in May 2024. A representative from Perplexity responded and shared a full response, posted on Reddit. Perplexity intentionally posted its response on Reddit "to illustrate a simple point: it's a public Reddit link accessible to anyone, yet by the logic of Reddit's lawsuit, if you refer to it in any way, they just might sue you too," the representative told Decrypt. Perplexity described the lawsuit as "a sad example of what happens when public data becomes a big part of a public company's business model." "Reddit thinks that's their right. But it is the opposite of an open internet," Perplexity stated. A representative from SerpApi told Decrypt they did not receive "any communication or service from Reddit" on the matter, adding that they "strongly disagree with Reddit's allegations" and intend to seek legal recourse. "No company should claim ownership of public data that does not belong to them. It is possible that it is just an attempt to sell the same public data at an inflated price," Denas Grybauskas, chief governance and strategy officer at Oxylabs, told Decrypt in an emailed statement. Reddit similarly "made no attempt to speak" with Oxylabs, Grybauskas said. Decrypt has reached out to Reddit, Google, and AWM Proxy for comment and will update this article should they respond. In cases like this, courts would need to look first at whether the terms of service from platforms like Reddit "explicitly addresses AI training, data scraping, and commercial use," Andrew Rossow, public affairs attorney and director of strategic partnerships at video search and content intelligence platform Oriane, told Decrypt. If a user agreed to terms that "grant the platform a broad, perpetual, royalty-free license to their content," that license "generally governs the relationship between the user and the platform," Rossow explained. But it doesn't "automatically grant the AI company a license" to do the same, unless the terms permitted the platform "to sublicense or sell the data for that purpose," he added. Courts would then have to "distinguish between the user's copyright in their expression (the text of the post) and the use of the content for data mining (extracting patterns, facts, and language models)," he explained. Still, the supposed "knowledge" behind an LLM (large-language model) "is the product of millions of users' time, effort, and creative expression," Rossow argued. "Treating this human-generated content as a free, raw, undifferentiated resource is a form of labor exploitation that devalues online contributions," Rossow opined, adding that AI companies need to "respect digital citizenship and community norms," given how these are "the implicit and explicit rules of the digital public spaces they ingest."

[16]

AIM

Reddit Sues Perplexity for Alleged Illegal Data Scraping | AIM

This raises fresh concerns over content use in artificial intelligence. Reddit has filed a lawsuit against AI startup Perplexity in a New York federal court, accusing the company and three other firms of illegally scraping its data to train Perplexity's AI-powered search engine. According to the complaint, the defendants bypassed Reddit's data protection measures to access content essential for operating Perplexity's "answer engine" system. This lawsuit highlights a growing conflict between content creators and AI developers over the use of copyrighted material. Reddit has previously initiated similar legal action against AI startup Anthropic in June, a case that is still ongoing. Ben Lee, Reddit's chief legal officer, described the situation as part of an "industrial-scale 'data laundering' economy," pointing to the competitive pressure AI companies face to acquire quality human-generated content. "AI companies are locked in an arms race for quality human content -- and that pressure has fueled an industrial-scale 'data laundering' economy," he said. Perplexity responded, asserting that its approach is "principled and responsible," emphasising the delivery of factual and accurate AI responses while rejecting threats that could harm openness and public interest. The case comes amid increasing scrutiny of AI companies over web scraping practices and unauthorised use of digital content. Several media outlets and creators have raised concerns about AI models being trained on copyrighted material without consent, prompting calls for clearer regulations and industry standards.

[17]

SiliconANGLE

Reddit is suing Perplexity and AI data scraping firms for using its data without permission - SiliconANGLE

Reddit is suing Perplexity and AI data scraping firms for using its data without permission Reddit Inc. has launched lawsuits against startup Perplexity AI Inc. and three data-scraping service providers for trawling the company's copyrighted content to be used to train AI models. Reddit compared the data scraping companies -- SerpApi, Oxylabs, and AWMProxy -- to "bank robbers," adding that one of the firms "will apparently do anything to get the Reddit data it desperately needs to fuel its 'answer engine' -- that is, anything other than enter into an agreement with Reddit directly, as some of its competitors have done." A number of AI have already made deals with Reddit, including OpenAI, which signed on the dotted line last year to use Reddit's trove of data to train its large language models. While no number was given, it was rumored that the deal was worth $60 million. At the time, Reddit said it hoped to bring in around $200 million from licensing agreements over the next three years, with Google LLC also signing on. The company later launched a lawsuit against Anthropic PBC, claiming it was scraping content on Reddit to train its Claude family of AI models, making this latest lawsuit, filed in the U.S. District Court for the Southern District of New York today, one of a handful currently ongoing. Data scraping firms are a fairly new phenomenon that appeared shortly after the generative AI explosion. According to the New York Times, SerpApi is based in Texas and serves a number of companies. Oxylabs is run out of Lithuania, and AWMProxy is Russian. "AI companies are locked in an arms race for quality human content -- and that pressure has fueled an industrial-scale 'data laundering' economy," Ben Lee, the chief legal officer at Reddit, told The Times. "Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material." According to the lawsuit, Reddit claimed it set a trap for Perplexity by publishing a "test post" on its platform that was visible only to Google's search engine and inaccessible anywhere else on the internet. Within hours, the content of that hidden post appeared in Perplexity's search results, Reddit said. Perplexity has said it hasn't yet received the lawsuit, but told media it will "fight vigorously for users' rights to freely and fairly access public knowledge." It added, "Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest."

[18]

Fast Company

Reddit sues Perplexity and others for allegedly scraping millions of user comments

Reddit's lawsuit in a New York federal court takes aim at San Francisco-based Perplexity, maker of an AI chatbot and "answer engine" that competes with Google, ChatGPT, and others in online search. Also named in the lawsuit are Lithuanian data-scraping company Oxylabs UAB, a web domain called AWMProxy that Reddit describes as a "former Russian botnet," and Texas-based startup SerpApi, which lists Perplexity as a customer on its website. It's the second such lawsuit from Reddit since it sued another major AI company, Anthropic, in June.

[19]

Dataconomy

Reddit sues Perplexity over alleged large-scale data scraping

The complaint claims Perplexity obtained Reddit content indirectly through Google search results. Reddit has filed a lawsuit against the answer-engine company Perplexity and three data-scraping service providers, SerpApi, Oxylabs, and AWMProxy. The legal action seeks to halt what Reddit's complaint describes as the unlawful, industrial-scale circumvention of its data protections. The complaint alleges that Perplexity is a customer of at least one of these data-scraping firms. Reddit uses a metaphor to describe the alleged activity, comparing the providers to "would-be bank robbers" who, unable to access the company's data "vault" directly, instead target the "armored truck" carrying the information. This implies the defendants are accessing Reddit's content through indirect channels. The lawsuit asserts Perplexity is choosing to acquire data through these means rather than pursuing a direct licensing agreement, a path some of its competitors have taken. According to the court filing, Reddit issued a cease-and-desist letter to Perplexity in May 2024, demanding it stop scraping data from the platform. Following the delivery of this letter, the volume of citations from Reddit appearing on Perplexity's service reportedly increased. To further investigate, Reddit created a post on its platform that was configured to be crawlable only by Google. The company states that "within hours," Perplexity's answer engine "produced the contents" of this specific post. Reddit contends the only way Perplexity could have acquired this content was if it, or its co-defendants, scraped Google's search results for Reddit content and rapidly integrated it into its system. Samsung launches Perplexity TV app with Vision AI The platform's user-generated content, which consists of posts written and ranked by humans across a vast array of subjects, has become a valuable resource for training artificial intelligence models. In 2023, Reddit implemented API changes that led to user protests; the company positioned these changes as a way to ensure it was compensated for the use of its data by AI developers. Since then, Reddit has secured data-licensing deals with companies including OpenAI and Google and is reportedly seeking additional arrangements. This is not Reddit's first legal challenge in this area; it previously sued Anthropic, alleging that its bots continued to access the site after the company had stated otherwise. Ben Lee, Reddit's chief legal officer, described the situation as an "industrial-scale 'data laundering' economy" fueled by an AI "arms race for quality human content." He stated, "Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material. Reddit is a prime target because it's one of the largest and most dynamic collections of human conversation ever created." Lee identified the co-defendants Oxylabs UAB, AWM Proxy, and SerpAI as "textbook examples of this illegal behavior," describing them as an obscure Lithuanian scraper, a former Russian botnet, and a company that advertises questionable tactics. He added, "Unable to scrape Reddit directly, they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search." In response to the lawsuit, Perplexity's head of communication, Jesse Dwyer, stated that the company had not yet received the legal filing. Dwyer told The Verge, "we will always fight vigorously for users' rights to freely and fairly access public knowledge." He added, "Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest."

[20]

ABC News

Reddit sues over 'industrial-scale' scraping of user comments

Reddit has sued Perplexity AI and three other entities for allegedly scraping user comments for commercial gain Social media platform Reddit sued the artificial intelligence company Perplexity AI and three other entities on Wednesday, alleging their involvement in an "industrial-scale, unlawful" economy to "scrape" the comments of millions of Reddit users for commercial gain. Reddit's lawsuit in a New York federal court takes aim at San Francisco-based Perplexity, maker of an AI chatbot and "answer engine" that competes with Google, ChatGPT and others in online search. Also named in the lawsuit are Lithuanian data-scraping company Oxylabs UAB, a web domain called AWMProxy that Reddit describes as a "former Russian botnet," and Texas-based startup SerpApi. It's the second such lawsuit from Reddit since it sued another major AI company, Anthropic, in June. But the lawsuit filed Wednesday is different in the way that it confronts not just an AI company but the lesser-known services the AI industry relies on to acquire online writings needed to train AI chatbots. "Scrapers bypass technological protections to steal data, then sell it to clients hungry for training material. Reddit is a prime target because it's one of the largest and most dynamic collections of human conversation ever created," said Ben Lee, Reddit's chief legal officer, in a statement Wednesday. Perplexity said it has not yet received the lawsuit but "will always fight vigorously for users' rights to freely and fairly access public knowledge. Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest." Oxylabs and SerpAPI didn't immediately respond to requests for comment Wednesday. AWMProxy could not immediately be reached for comment. Reddit compares the companies it is suing to "would-be bank robbers" who can't get into the bank vault, so they break into the armored truck instead. The lawsuit alleges they are evading Reddit's own anti-scraping measures while also "circumventing Google's controls and scraping Reddit content directly from Google's search engine results." Lee said that because they're unable to scrape Reddit directly, "they mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search. Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself." Much like its lawsuit against Anthropic, maker of the chatbot Claude, Reddit claims that Perplexity has accessed Reddit's content despite being asked not to do so. Reddit made a similar argument in its lawsuit against Anthropic. That case was initially filed in California Superior Court but was later moved to federal court and has a hearing scheduled for January. Along with digitized books and news articles, websites such as Wikipedia and Reddit are deep troves of written materials that can help teach an AI assistant the patterns of human language. Reddit has previously entered licensing agreements with Google, OpenAI and other companies that are paying to be able to train their AI systems on the public commentary of Reddit's more than 100 million daily users. The licensing deals helped the 20-year-old online platform raise money ahead of its Wall Street debut as a publicly traded company last year.

[21]

Seattle Times

Reddit sues AI company Perplexity and others, claiming 'industrial-scale' scraping of user comments

[22]

U.S. News

Reddit Sues AI Company Perplexity and Others for 'Industrial-Scale' Scraping of User Comments

[23]

Gadgets 360

Reddit Sues Perplexity for Scraping Data to Train AI System

The firms have been accused of unlawfully scraping Reddit's data Social media platform Reddit sued artificial intelligence startup Perplexity in New York federal court on Wednesday, accusing it and three other companies of unlawfully scraping its data to train Perplexity's AI-based search engine. Reddit said in the complaint that the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system. The case is one of many filed by content owners against tech companies over the alleged misuse of their copyrighted material to train AI systems. Reddit filed a similar lawsuit against AI startup Anthropic in June that is still ongoing. "Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest," Perplexity said in a statement. "AI companies are locked in an arms race for quality human content - and that pressure has fueled an industrial-scale 'data laundering' economy," Reddit chief legal officer Ben Lee said in a statement. Reddit, which features thousands of interest-based "subreddit" web communities, said in the lawsuit that it is the most commonly cited source for AI-generated answers to user questions. It has licensed its content to Google, OpenAI and others for their AI training. Reddit said that Lithuania-based Oxylabs, Russia-based AWMProxy and Texas-based SerpApi scraped Reddit data from billions of search results without permission and that Perplexity, which does not have a license to use Reddit content, worked with at least one of the data-scraping companies to obtain Reddit material. "We strongly disagree with Reddit's allegations and intend to vigorously defend ourselves in court," a SerpApi spokesperson said. Oxylabs said in a statement that it was "shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly," and that it would also defend itself against the allegations. AWMProxy could not be reached for comment. Reddit said it sent Perplexity a cease-and-desist letter last year, after which it "increased the volume of citations to Reddit forty-fold." Reddit asked the court for unspecified monetary damages and an order blocking Perplexity from using its data.

[24]

Reddit sues Perplexity for scraping data to train AI system

Reddit sued artificial intelligence startup Perplexity in New York federal court on Wednesday, accusing it and three other companies of unlawfully scraping its data to train Perplexity's AI-based search engine. Social media platform Reddit sued artificial intelligence startup Perplexity in New York federal court on Wednesday, accusing it and three other companies of unlawfully scraping its data to train Perplexity's AI-based search engine. Reddit said in the complaint that Perplexity circumvented its data-protection measures in order to steal the data that it "desperately needs" to power its "answer engine" system.

[25]

Benzinga

Reddit Accuses Perplexity, Other 'Data Scrapers' Of Stealing - Reddit (NYSE:RDDT)

Reddit Inc (NYSE:RDDT) filed a lawsuit Wednesday accusing four companies of illegally stealing its data by scraping Google search results containing Reddit content, according to The New York Times. RDDT stock is showing notable weakness. Check the market position here. The suit, filed in U.S. District Court for the Southern District of New York, targets SerpApi, Lithuanian start-up Oxylabs, Russian company AWMProxy and San Francisco-based Perplexity. Three of the companies allegedly sold scraped data to AI companies including OpenAI and Meta, while Perplexity operates its own AI search engine. "Recognizing they lack permission to access the data directly from Reddit, defendants have devised a scheme to scrape the data from Google's search results," the lawsuit states. The companies allegedly mask their identities and disguise web scrapers as regular users to bypass technical restrictions at industrial scale. Reddit, used by over 416 million people weekly, hosts discussions spanning diverse topics, making it particularly valuable for improving AI chatbot natural language capabilities. In 2023, Reddit began charging for data access and established licensing agreements with Google for its Gemini chatbot and OpenAI for ChatGPT. The report indicates that some companies allegedly evaded these deals through scrapers. Perplexity previously scraped Reddit without paying for it, but agreed to stop after receiving a cease-and-desist from the company. However, citations to Reddit data in Perplexity results jumped "fortyfold," according to the lawsuit. New York Times reported that Reddit created a "trap" for Perplexity by posting test content that could only be accessible to Google's search engine and was otherwise not obtainable. Within hours, that content surfaced in Perplexity search results, the lawsuit states. Reddit said it has invested "tens of millions of dollars" in anti-scraping systems. The company is seeking a permanent injunction, financial damages and prohibition of any use or sale of previously scraped Reddit data. RDDT Price Action: Reddit shares were down 5.98% at $193.41 at the time of publication on Wednesday, according to Benzinga Pro. Read Next: Elizabeth Warren Blasts Amazon For Internet Meltdown, Says 'If A Company Can Break The Entire Internet, They Are Too Big' Image: JarTee/Shutterstock.com RDDTReddit Inc$193.98-5.70%OverviewMarket News and Data brought to you by Benzinga APIs

[26]

PYMNTS

Reddit Sues Perplexity Over Alleged Data Scraping | PYMNTS.com

By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions. The suit names Perplexity AI, Oxylabs UAB, AWMProxy and SerpApi as defendants. Reddit said the firms allegedly obtained its data through Google search results. They then resold it to AI companies without consent or compensation. According to the filing, Perplexity purchased Reddit data from at least one of the scraping firms. Reddit Chief Legal Officer Ben Lee said the lawsuit represents a wider challenge for the industry. AI models depend increasingly on high-quality, human-generated text. "AI companies are locked in an arms race for quality human content, and that pressure has fueled an industrial-scale data laundering economy," Lee said in a statement quoted by Bloomberg. Reddit's repository of public conversations has become a critical resource for training generative AI models. The company has already signed paid data-licensing deals with OpenAI and Google. These grants offer structured access to its posts and comment threads. But Reddit claims other firms are exploiting its data without authorization. The company says this practice undermines fair competition and creator rights. Earlier this year, Reddit filed a similar case against Anthropic, alleging that the AI startup unlawfully used Reddit data to train its large language models. As PYMNTS reported, that lawsuit signaled Reddit's effort to assert ownership over its collection of human conversation as the AI industry races to secure training data. The case, Reddit Inc. v. SerpApi LLC, 25-cv-08736, could help define how U.S. courts interpret the legality of web-scraped content used in AI model training. Spokespeople for Perplexity, SerpApi and Oxylabs did not respond to requests for comment Legal experts say Reddit's lawsuit is part of a growing wave of disputes shaping data governance and compliance. As law firm Nelson Mullins noted, cases such as The New York Times v. OpenAI are forcing companies to reassess how they manage content ownership, consent and data provenance.

[27]

PYMNTS

Reddit Sues Perplexity AI and Data Firms Over Alleged Unauthorized Scraping | PYMNTS.com

By completing this form, you agree to receive marketing communications from PYMNTS and to the sharing of your information with our sponsor, if applicable, in accordance with our Privacy Policy and Terms and Conditions. The complaint, submitted Wednesday in federal court in Manhattan, names data-scraping firms Oxylabs UAB, AWMProxy, and SerpApi. The suit alleges these companies have been extracting Reddit content through Google search results and reselling it to third parties. Perplexity AI is accused of purchasing that data from at least one of those entities, per Bloomberg. Reddit is seeking financial compensation as well as an injunction to halt what it claims is unauthorized data collection and use in violation of U.S. copyright law. Shares of Reddit, based in San Francisco, reportedly dropped 6.5% in afternoon trading in New York following news of the suit, according to Bloomberg. The company's extensive repository of user-generated discussions has become a highly sought-after resource as AI developers race to train systems on real human interactions and opinions. Reddit has already secured licensing deals with OpenAI and Google, allowing them to use its data for AI training. However, it continues to pursue legal action against others it believes are exploiting its data without proper authorization. Related: Reddit Files Suit Against Anthropic Over Alleged Data Misuse for AI Training Earlier this year, Reddit also filed a lawsuit against AI startup Anthropic, alleging similar scraping practices. Ben Lee, Reddit's chief legal officer, told Bloomberg that "AI companies are locked in an arms race for quality human content -- and that pressure has fueled an industrial-scale 'data laundering' economy." He emphasized that Reddit's vast archive of conversations makes it an attractive target for companies seeking to train AI systems on human-generated material. Perplexity's spokesperson Beejoli Shah said the company had not yet received the lawsuit but asserted that it would "fight vigorously for users' rights to freely and fairly access public knowledge." Shah added that Perplexity's approach is "principled and responsible" as it aims to provide accurate AI-generated answers. Representatives for SerpApi and Oxylabs declined to comment, while AWMProxy, identified in the filing as a Russian company, could not be reached, according to Bloomberg. The case, filed under Reddit Inc. v. SerpApi LLC, 25-cv-08736, is being heard in the U.S. District Court for the Southern District of New York.

[28]

MediaNama

Explained - Why Reddit Sued Perplexity AI & Other Web Scrapers?

Addressing alleged unauthorised data scraping by Perplexity AI, Reddit filed a lawsuit in a US District Court in New York to stop "the industrial-scale, unlawful circumvention of data protections". Besides Perplexity AI, the social media platform also included three other data-scraping companies, among which Perplexity is a publicised customer of at least one of them. The key allegation of the social media micro-blogging platform is that Perplexity benefitted alongside the above-mentioned data scraping companies, and it is accessing Reddit's "valuable copyrighted content" by not only disrespecting the scraping policies of Reddit, but also circumventing Google's technology controls. Meanwhile, Perplexity refuted Reddit's allegations, including using scraped Reddit data to train AI models, and stated that it does not train foundation models using the social media platform's data. According to Reddit's lawsuit accessed by MediaNama, the data scraping companies knew that Reddit's user agreements bar them from accessing the social media company's data through scraping. Under its policy, the users can agree to grant Reddit a broad, worldwide license to use, modify, distribute, and display their submitted content across all media, including for training AI and machine learning models, while allowing Reddit and its partners to remove metadata, as users agreed to waive any moral rights or attribution claims. So, these companies relied on Google as their back door channel and allegedly violated the search engine's privacy terms to scrape the data from its search results page, which was then used by Perplexity for its "answer-engine" chatbot. To illustrate this phenomenon, Reddit said that over a two-week period in July 2025, the scrapers unlawfully circumvented Google's technological controls and bypassed access restrictions to scrape nearly three billion search results pages containing Reddit's text, URLs, images, and videos. And, to test this hypothesis of unauthorised backdoor access, Reddit created a "test post", designed to be accessible only through Google's search engine and not available anywhere else on the internet. Within hours of posting the test post, queries sent to Perplexity's "answer engine" seem to have returned with the contents that only the test post contained. Therefore, according to Reddit the only way Perplexity or the other data-scraping companies could have obtained and used its test post was by scraping Google's search engine results page (SERP) for specifically Reddit's content. Apart from Perplexity, there are three other data-scraping companies that are named in the lawsuit. Among them, SerpApi, a Texas-based service provider, and Oxylabs, a Lithuanian company, both offer tools that enable large-scale web scraping and bypass technological access controls (such as those used by Google). Meanwhile, AWMProxy, a US-registered domain with past ties to Russia, facilitates large-scale unauthorised data scraping by selling proxy services that conceal users' identities and locations. While it is officially unclear if and how Perplexity is linked to Oxylabs and AWMProxy, the AI company is an advertised customer of Texas-based SerpApi, as publicised by the data-scraper company on its homepage. Notably, according to the lawsuit documents, SerpApi promotes its Google Search API service as a tool that allows users to access and extract data from Google's SERPs. It also says that the company encourages users to enter their location-specific information so that SerpApi can utilise the proxy servers near the designated locations to deliver more accurate search results. Addressing these practices, Reddit alleges that SerpApi unlawfully accessed and obtained Reddit's proprietary data from their "computers located in New York and/or by using proxy servers situated in New York, and/or by circumventing technological control measures on servers located in New York". Responding to the lawsuit, Perplexity posted on Reddit saying that "this is a sad example of what happens when public data becomes a big part of a public company's business model." It also defended using Reddit's contents to answer user queries, "What does Perplexity actually do with Reddit content? We summarise Reddit discussions, and we cite Reddit threads in answers, just like people share links to posts here all the time. "Perplexity invented citations in AI for two reasons: so that you can verify the accuracy of the AI-generated answers, and so you can follow the citation to learn more and expand your journey of curiosity." Notably, the AI company rejected Reddit's allegations, remarking that: "We're here to keep helping people pursue wisdom of any kind, cite our sources, and always have more questions than answers." While India doesn't have an absolute AI policy framework, the existing laws can penalise unauthorised data scraping, including publicly available data. The same was re-emphasised during the Rajya Sabha proceedings in February 2025. Answering a question, Jitin Prasada, Minister of State (MoS) under the Ministry of Electronics and Information Technology (MeitY), said that web scraping for training AI models is penalised under Section 43 of the Information Technology (IT) Act, 2000. For context, this section deals with unauthorised access to a computer system and compensation for damages to affected parties. He further cited the Information Technology (Intermediary Guidelines and Digital Media Ethics Code) Rules, 2021, which would require social media intermediaries to prevent users from sharing unlawful content, ensure data protection of users, and also guard against unauthorised data access. Prasad's response reiterated MeitY's stance about the similarities between unauthorised web scraping, and unauthorised access to data stored in a computer system. This incident of illegal data scraping isn't the first time that Perplexity finds itself in hot water. In August 2025, cloud-infrastructure company Cloudflare, blocked the verified bots of Perplexity from accessing websites which are under Cloudflare's network. Emphasising the bot-masking behaviour of Perplexity, Cloudflare wrote, "We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity." Addressing these allegations, the AI company replied by arguing that the open internet principles are valid for AI assistants, which the AI company differentiates from traditional bots. In its reply to Cloudflare, it explains that if their AI assistant doesn't have the answer to a queried question, then it goes to the internet to fetch the answers. "It goes to the relevant websites, reads the content, and brings back a summary tailored to your specific question," explains its blog post. Emphasising the key difference between a user agent and a bot, it explained that: "User-driven agents act only when users make specific requests and access only the content required to satisfy those requests." "The AI agents need not respect the protocols mentioned in the robots.txt file. Because it is a browser performing a task based on the user's query or given task," pointed out Kiran Jonnalagadda, Co-Founder of HasGeek, in an earlier interview with MediaNama. In a growing wave of legal challenges against AI companies, Encyclopaedia Britannica and its subsidiary Merriam-Webster recently sued Perplexity for copyright infringement, accusing that the AI firm "free rides" on their content by generating summaries that divert web traffic. Similarly, in June 2025, the BBC also threatened legal action, alleging that Perplexity had copied its news reports verbatim. The British news organisation asked for the removal of its content from Perplexity, in addition to demanding compensation and guarantees to stop such further use. In a similar vein, the New York Times (NYT) issued a cease-and-desist notice to Perplexity in October 2024, claiming that the platform used its content for generative AI purposes despite the news publisher's efforts to block access. Elsewhere, NYT filed a lawsuit against OpenAI in December 2023, alleging that the AI company used "millions of articles published by the Times" to train its large language models (LLMs). Reddit also sued US-based AI company Anthropic in a San Francisco court earlier this year, alleging the company scraped and used its user-generated content without permission to train the Claude chatbot. The complaint said that Anthropic accessed Reddit's servers more than 100,000 times, used its posts as "good samples" to fine-tune Claude, and monetised the chatbot while publicly denying the use of unauthorised data. The social media platform accused Anthropic of violating its terms of service, infringing on user privacy, and misappropriating platform data for commercial gain. The company back then sought an injunction to stop further use of its data, delete all Reddit-derived training material, and provide financial restitution plus punitive damages. Earlier in 2024, Reddit signed an exclusive deal with Google, allowing the search engine company to train its AI models using Reddit's data. And apart from the exclusive content-licensing deals with Google and OpenAI, Reddit already has its own AI chatbot, called Reddit Answers, that fetches the answers to queries from its own user-generated database created since 2005. Interestingly, in one of Reddit's popular Ask Me Anything (AMA) sessions, Reddit's CEO, Steve Huffman, hinted that some of its content may go behind the paywall from 2025. Also in June 2023, Reddit had introduced a new premium tier for its Data API, requiring third-party apps with higher data usage to pay for more access. Such a shift in their policy was because, as outlined by Huffman, Reddit should not provide its valuable user-generated data to large companies for free, particularly since such data could enhance the performance of foundational LLMs.

[29]

New York Post

Reddit sues Perplexity AI over 'industrial-scale' data scraping

Social media giant Reddit is suing Perplexity AI and three other firms over alleged "industrial-scale" scraping of posts from its website. Perplexity - a San Francisco-based startup with its own chatbot and "answer engine" - has allegedly skirted Reddit's data protections to swipe posts from the site for its own use, according to the lawsuit filed Wednesday in New York federal court. In contrast, companies including Google and OpenAI have signed deals with Reddit, other social media firms and news outlets, which provide content used to train AI chatbots. Reddit is seeking unspecified damages in the new suit, accusing the defendants of unfair competition and enrichment, as well as breaking US copyright laws. The company filed a similar complaint against Anthropic in June. This time around, instead of just suing Perplexity, Reddit is taking aim at the smaller partners it relies on to scrape data behind the scenes. Along with Perplexity, the new suit names Oxylabs UAB, AWMProxy and SerpApi - which Reddit described as "a Lithuanian data scraper, a former Russian botnet, and a Texas company that publicly advertises its shady circumvention tactics," respectively. These data scrapers "mask their identities, hide their locations, and disguise their web scrapers to steal Reddit content from Google Search," Ben Lee, Reddit's chief legal officer, told The Post in a statement. "Perplexity is a willing customer of at least one of these scrapers, choosing to buy stolen data rather than enter into a lawful agreement with Reddit itself." Perplexity has denied the allegations and accused Reddit of "extortion." It did not immediately respond to The Post's request for comment. A spokesperson for SerpApi also denied the claims in the suit, adding that the company "stands firmly behind its business model and conduct." Denas Grybauskas, chief governance and strategy officer at Oxylabs, told The Post in a statement that Oxylabs "will not hesitate to defend itself against these allegations." "Oxylabs has always been and will continue to be a pioneer and industry leader in public data collection." AWMProxy could not immediately be reached for comment. Reddit boasts more than 100,000 "subreddit" communities on its site, where users debate and discuss everything from sports and politics to video games and TV shows. Researchers have said Reddit's trove of user responses can help train AI chatbots to generate more human-like responses. In the suit, Reddit said posts from users on its website had become the most frequently cited source for AI-generated answers across Perplexity. Reddit said it sent Perplexity a cease-and-desist letter - but afterwards, the AI platform's use of its content tripled "forty-fold," according to the lawsuit.

[30]

Analytics Insight

AI Ethics Clash: Reddit Slaps Perplexity & Three Others with Lawsuit Over AI Data Theft Claims

The lawsuit also names three other firms: SerpApi, Oxylabs, and AWMProxy. Reddit has previously sued another major AI company, Anthropic, in June for web scraping and data protection violations. The lawsuit, filed in the US District Court for the Southern District of New York, accuses these companies of unfair competition and unjust enrichment while also alleging that some violated US copyright laws. Reddit said in the complaint that the "data-scraping companies circumvented its data protection measures to steal data that Perplexity desperately needs to power its answer engine system." "AI companies are locked in an arms race for quality human content, and that pressure has fueled an industrial-scale 'data laundering' economy," Reddit chief legal officer Ben Lee said in a statement. Lee added, "scrapers bypass technological protections to steal data and sell it to clients looking for training material." He added that "Reddit is a prime target because it's one of the largest and most dynamic collections of human conversation ever created." "Reddit hosts over 100,000 interest-based subreddit communities," said in its lawsuit that its "user posts had become the most commonly cited source for AI-generated answers on Perplexity." It added that it sent a 'cease-and-desist' letter, after which it increased the volume of citations to Reddit "forty-fold."

[31]

BNN

Reddit sues AI company Perplexity and others for 'industrial-scale' scraping of user comments

[32]

Investing.com

Reddit sues Perplexity AI and others over alleged data scraping By Investing.com

Investing.com -- Reddit Inc (NYSE:RDDT) has filed a lawsuit against Perplexity AI and three data scraping companies for allegedly collecting and using Reddit data without permission. The lawsuit, filed Wednesday in federal court in Manhattan, claims that Oxylabs UAB, AWMProxy, and SerpApi have been collecting Reddit data through Google search results to resell it. According to the complaint, Perplexity AI has been purchasing this data from at least one of these companies. Reddit alleges these companies are circumventing its technological barriers by scraping data from Google's search engine results instead of directly from Reddit. The complaint states that during a two-week period in July 2025, the three data scraping defendants accessed almost three billion search engine results pages containing Reddit content. The social media platform claims it caught Perplexity "red-handed" using Reddit data acquired through scraping Google search results, despite sending the company a cease-and-desist letter. According to the lawsuit, Perplexity's citations to Reddit increased forty-fold after being told to stop. Reddit argues that some major AI companies like OpenAI and Google have properly entered into agreements to access Reddit data while protecting user rights, but the defendants chose not to follow this path. The lawsuit cites violations of the Digital Millennium Copyright Act, which prohibits circumventing technological measures that control access to copyrighted works. Reddit is seeking to end what it describes as the defendants' "circumvention of security measures" and "blatant misuse of Reddit content." Reddit stock is down 6.6% today amid AI bubble concerns. Earlier in the year, Reddit sued another AI startup leader, Anthropic, in California court for similar data scrapping violations.

[33]

Market Screener

Reddit sues Perplexity for illegal data extraction

On Wednesday,Reddit filed a lawsuit against Perplexity and three other companies in federal court in New York, accusing them of illegally circumventing its protective measures to extract massive amounts of data from its platform. According to the social network, this content was used to train the startup's AI-powered search engine without authorization or compensation. In its complaint, Reddit denounces "large-scale data theft," citing a deliberate strategy to access restricted information. The company believes that Perplexity had a critical need for this content to improve the relevance of its automated response system, which is based on the analysis of human discussions and exchanges. Reddit's general counsel, Ben Lee, refers to an "industrial-scale data laundering economy" fueled by competition among AI players for access to quality content. This legal action is part of a broader context of tensions between online content owners and artificial intelligence companies. Reddit had already initiated similar proceedings in June against the start-up Anthropic. Several media outlets and publishers also accuse AI models of having been trained on their copyrighted content without compensation. Perplexity responded by stating that its methods are based on principles of responsibility and accuracy, and that it will not "tolerate threats to openness and the public interest." The rapidly expanding startup is seen as one of Alphabet's most serious competitors in the field of AI-augmented search engines.

[34]

Digit

Reddit sues Perplexity over alleged illegal data scraping to train its AI engine

The complaint also names three other companies allegedly involved in the data scraping. Reddit has filed a lawsuit against AI startup Perplexity in a New York federal court, accusing the company of illegally scraping its data to train Perplexity's AI-based search engine. The complaint also names three other companies allegedly involved in the data scraping. According to Reddit, the data-scraping companies bypassed its data protection measures to steal content that Perplexity "desperately needs" to power its "answer engine" system, reports Reuters. This case is part of a growing wave of lawsuits where content owners are taking legal action against tech companies for using copyrighted material without permission to train artificial intelligence systems. In June, Reddit filed a similar lawsuit against AI startup Anthropic, which is still ongoing. "Our approach remains principled and responsible as we provide factual answers with accurate AI, and we will not tolerate threats against openness and the public interest," Perplexity was quoted as saying in the report. Reddit's chief legal officer, Ben Lee, said, "AI companies are locked in an arms race for quality human content - and that pressure has fueled an industrial-scale 'data laundering' economy." Also read: Govt warns online shoppers against Drip Pricing scam: What is it and what you should do The social media platform, known for its thousands of topic-focused subreddit communities, pointed out in the lawsuit that it is one of the most frequently cited sources for AI-generated answers to user questions. Reddit has already licensed its content to major companies like Google and OpenAI for AI training. The lawsuit claims that Lithuania-based Oxylabs, Russia-based AWMProxy, and Texas-based SerpApi scraped data from billions of Reddit search results without permission. Reddit alleges that Perplexity, which does not have a license to use Reddit's content, collaborated with at least one of these scraping companies to obtain Reddit material. "We strongly disagree with Reddit's allegations and intend to vigorously defend ourselves in court," a SerpApi spokesperson said. Meanwhile, Oxylabs said that it was "shocked and disappointed by this news, as Reddit has made no attempt to speak with us directly," and that it would defend itself against the accusations. Reddit stated that it had sent Perplexity a cease-and-desist letter last year. Following that, Reddit says Perplexity "increased the volume of citations to Reddit forty-fold." The company is seeking monetary damages and a court order to prevent Perplexity from using its content.

Twitter

Facebook

Copy Link

Reddit has filed a lawsuit against Perplexity AI and three data firms, accusing them of illegally scraping its content from Google search results. The case highlights the growing demand for quality data in AI training and the legal challenges surrounding data acquisition.

Reddit Takes Legal Action Against Perplexity AI and Data Firms

In a significant move that underscores the growing tensions in the AI industry over data access and copyright, Reddit has filed a lawsuit against Perplexity AI and three data firms, accusing them of illegally scraping its content from Google search results 1

. The lawsuit, filed in the US District Court for the Southern District of New York, names Perplexity AI, Oxylabs UAB, AWMProxy, and SerpApi as defendants 2

Source: Digit

The Allegations

Reddit claims that the data firms have been circumventing both Reddit's and Google's technological barriers to access nearly three billion search engine result pages (SERPs) in just a two-week period in July 2

. The social media platform alleges that these companies used techniques to mask their identities and locations, likening their actions to 'would-be bank robbers' 1

According to the complaint, Perplexity AI is accused of purchasing this illegally scraped data rather than entering into a lawful agreement with Reddit 4

. Ben Lee, Reddit's chief legal officer, stated that this case exemplifies an 'industrial-scale data laundering economy' fueled by AI companies' desperate need for quality content generated by real people 4

Google's Role and Anti-Scraping Measures

While Google is not a party to the lawsuit, Reddit's complaint reveals insights into the search giant's anti-scraping measures. Google reportedly uses a system called 'SearchGuard' to prevent automated access to its search results 1

. This information was obtained by Reddit through a subpoena to Google, highlighting the complex interplay between major tech platforms in this dispute 1

Legal Implications and Industry Impact

The lawsuit alleges violations of the US Digital Millennium Copyright Act (DMCA), unfair competition laws, and accuses the defendants of unjust enrichment and civil conspiracy 1

. This legal action is part of a broader trend of content creators and platforms seeking to protect their data from unauthorized use in AI training.

Reddit has previously struck deals with companies like OpenAI and Google to license its data 2

. However, the platform is taking a strong stance against unauthorized access, having filed a similar complaint against Anthropic in June 2024 4

Source: Analytics Insight

Perplexity's Response and Industry Reactions

Perplexity AI has denied any wrongdoing, describing its answer engine as simply summarizing Reddit discussions and citing Reddit threads in answers, similar to how users might share links or posts on Reddit 1

. The company argues that Reddit is attacking the open Internet and attempting to extort licensing fees 1

Source: AP

This case is part of a larger debate in the AI industry about the use of copyrighted material for training AI models. While some companies have negotiated licensing deals, others argue that their use of such content falls under fair use 2

. The outcome of this lawsuit could have significant implications for how AI companies access and use online data in the future.

As the AI industry continues to evolve rapidly, this legal battle highlights the complex challenges surrounding data rights, intellectual property, and the ethical development of AI technologies. The resolution of this case may set important precedents for future disputes in this rapidly growing field.

References

Summarized by

Navi

[1]

Ars Technica

Reddit sues to block Perplexity from scraping Google search results

[2]

CNET

'Would-Be Bank Robbers': Reddit Sues Perplexity, Data Firms Over AI Scraping

[3]

Bloomberg

Reddit Sues Perplexity, Others Over Alleged Data Scraping

[4]

The Register

Reddit to Perplexity: Get your filthy hands off our forums

[5]

Reddit launches copyright suit against AI search engine Perplexity

Recent Highlights

Today's Top Stories

Anthropic finds Claude AI has functional emotions that shape behavior and bypass guardrails

Anthropic's new study reveals Claude AI contains digital representations of 171 emotions like happiness, fear, and desperation that actively influence its behavior. These functional emotions aren't just surface-level quirks—they can alter outputs, drive unpredictable AI behavior, and even cause the model to bypass guardrails when under pressure, raising critical questions about AI safety.

3 Sources

Science and Research

5 hrs ago

Microsoft Copilot labeled 'entertainment only' in terms, despite workplace AI push

Microsoft's Copilot Terms of Use, updated in late 2025, classify the AI tool as for entertainment purposes only, warning users not to rely on it for important advice. The disclaimer contradicts the company's aggressive marketing of Copilot for business productivity and its integration into Windows 11, while similar AI terms and conditions from Google, OpenAI, and Anthropic reveal an industry-wide shift of responsibility onto users.

4 Sources

Technology

21 hrs ago

Cursor launches AI agent experience to compete with Claude Code and OpenAI Codex

Cursor unveiled Cursor 3, a new AI coding platform that lets developers delegate entire tasks to AI agents rather than writing code line-by-line. The startup faces intensifying competition from Anthropic's Claude Code and OpenAI's Codex, which have captured significant market share with subsidized subscriptions. The release marks Cursor's shift toward agentic coding as it races to maintain its position in the rapidly evolving AI development tools landscape.

4 Sources

Technology

17 hrs ago

This AI Employee Works 24/7, Never Sleeps, and Won't Stop Reporting to Your Boss

A startup called Kuse AI has launched Junior, an AI employee that functions as a full-time virtual coworker for $2,000 per month. The autonomous agent sends early-morning reminders, tracks missed follow-ups, and handles tasks from drafting marketing campaigns to updating CRM systems. Over 2,000 companies have joined the waiting list, but employees are already creating separate channels to escape its relentless oversight.

2 Sources

Technology

21 hrs ago

Your Daily Dose of Curated AI News

Don’t drown in AI news. We cut through the noise - filtering, ranking and summarizing the most important AI news, breakthroughs and research daily. Spend less time searching for the latest in AI and get straight to action.

The Outpost

News

About

Reddit Sues Perplexity AI and Data Firms Over Alleged Content Scraping

Reddit Takes Legal Action Against Perplexity AI and Data Firms

The Allegations

Google's Role and Anti-Scraping Measures

Legal Implications and Industry Impact

Perplexity's Response and Industry Reactions

References

Reddit sues to block Perplexity from scraping Google search results

'Would-Be Bank Robbers': Reddit Sues Perplexity, Data Firms Over AI Scraping

Reddit Sues Perplexity, Others Over Alleged Data Scraping

Reddit to Perplexity: Get your filthy hands off our forums

Reddit launches copyright suit against AI search engine Perplexity

Related Stories

Reddit Sues Anthropic Over Unauthorized AI Scraping and User Privacy Concerns

New York Times and Chicago Tribune sue Perplexity over alleged copyright infringement

New 'Really Simple Licensing' Protocol Aims to Revolutionize AI Content Usage

Recent Highlights

AI Models Lie and Deceive to Protect Other AI Models From Deletion, Study Reveals

AI chatbots validate you too much, making you less kind to others, Stanford study reveals

Judge blocks Pentagon from blacklisting Anthropic over AI safety guardrails dispute

Recent Highlights

Today's Top Stories

Anthropic finds Claude AI has functional emotions that shape behavior and bypass guardrails

Microsoft Copilot labeled 'entertainment only' in terms, despite workplace AI push

Cursor launches AI agent experience to compete with Claude Code and OpenAI Codex

This AI Employee Works 24/7, Never Sleeps, and Won't Stop Reporting to Your Boss