Curated by THEOUTPOST
On Thu, 24 Oct, 12:06 AM UTC
6 Sources
[1]
Former OpenAI Employee Condemns the Company's Data Scraping Practices
An artificial intelligence researcher who worked at OpenAI as recently as August says the company violates copyright law. Part of Suchir Balaji's job was to gather enormous amounts of data for OpenAI's GPT-4 multimodal AI but at the time he treated it as a research project and didn't think that the product he was working on would ultimately turn out to be a chatbot with an integrated AI image generator. "With a research project, you can, generally speaking, train on any data," Balaji tells The New York Times. "That was the mindset at the time." Balaji says he was drawn to AI research because he thought the technology could do some good for the world. However, he now thinks it is causing more harm to society than good. The Berkley graduate thinks that OpenAI is a threat to the very entities that it took the data from to build its products -- including individuals, businesses, and internet services. "If you believe what I believe, you have to just leave the company," Balaji tells The Times. OpenAI builds products like ChatGPT and DALL-E by taking data from the open web and feeding it into a machine-learning program which learns from it. Balaji says it's not a "sustainable model for the internet ecosystem." In a statement to The Times, OpenAI says: "We build our AI models using publicly available data, in a manner protected by fair use and related principles, and supported by longstanding and widely accepted legal precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness." However, the fair use argument for AI training is yet to be tested in a court and OpenAI is facing numerous lawsuits -- predominantly from wordsmiths, including The New York Times. Balaji says OpenAI's practices don't meet the fair use criteria and says the company is making copies of copyrighted data and amalgamating it. "The outputs aren't exact copies of the inputs, but they are also not fundamentally novel," he says. Balaji has published a mathematical analysis on his personal website to prove his theory that OpenAI violates copyright law.
[2]
OpenAI Whistleblower Disgusted That His Job Was to Vacuum Up Copyrighted Data to Train Its Models
"If you believe what I believe, you have to just leave the company." A former OpenAI researcher is blowing the whistle on the company's AI training practices, alleging that OpenAI violated copyright law to train its AI models -- and arguing that OpenAI's current business model stands to upend the business of the internet as we know it, according to The New York Times. The ex-staffer, a 25-year-old named Suchir Balaji, worked at OpenAI for four years before deciding to leave the AI firm due to ethical concerns. As Balaji sees it, because ChatGPT and other OpenAI products have become so heavily commercialized, OpenAI's practice of scraping online material en masse to feed its data-hungry AI models no longer satisfies the criteria of the fair use doctrine. OpenAI -- which is currently facing several copyright lawsuits, including a high-profile case brought last year by the NYT -- has argued the opposite. "If you believe what I believe," Balaji told the NYT, "you have to just leave the company." Balaji's warnings, which he outlined in a post on his personal website yesterday, add to the ever-growing controversy around the AI industry's collection and use of copyrighted material to train AI models, which was largely conducted without comprehensive government regulation and outside of the public eye. "Given that AI is evolving so quickly," intellectual property lawyer Bradley Hulbert told the NYT, "it is time for Congress to step in." Balaji, who was hired in 2020, was one of several staffers tasked with collecting and organizing web-gathered training data that would eventually be fed into OpenAI's large language models (LLMs). Because OpenAI was still technically just a well-funded research company at the time, the issue of copyright wasn't as big of a deal. "With a research project, you can, generally speaking, train on any data," Balaji told the NYT. "That was the mindset at the time." But once ChatGPT was released in November 2022, Balaji says, his feelings started to change. After all, the chatbot was no longer a closed-door research project; instead, powered by OpenAI's LLMs, it was being commodified for commercial use -- including in cases where the AI was being used to produce content or services that directly reflected or mimicked the copyrighted source material it was trained on, thus threatening the livelihoods and profit models of those very individuals and businesses. "This is not a sustainable model," Bilaji told the NYT, "for the internet ecosystem as a whole." For its part, in a statement to the NYT, OpenAI -- which has since abandoned its non-profit roots entirely -- argued that it builds its "AI models using publicly available data, in a manner protected by fair use and related principles" and that is "critical for "US competitiveness."
[3]
Former OpenAI researcher says the company broke copyright law
Suchir Balaji spent nearly four years as an artificial intelligence researcher at OpenAI. Among other projects, he helped gather and organize the enormous amounts of internet data the company used to build its online chatbot, ChatGPT. At the time, he did not carefully consider whether the company had a legal right to build its products in this way. He assumed the San Francisco startup was free to use any internet data, whether it was copyrighted or not. But after the release of ChatGPT in late 2022, he thought harder about what the company was doing. He came to the conclusion that OpenAI's use of copyrighted data violated the law and that technologies like ChatGPT were damaging the internet.
[4]
Former OpenAI Staffer Says the Company Is Breaking Copyright Law and Destroying the Internet
Is OpenAI breaking U.S. copyright law? A former employee of the company says yes. A former researcher at the OpenAI has come out against the company's business model, writing, in a personal blog, that he believes the company is not complying with U.S. copyright law. That makes him one of a growing chorus of voices that sees the tech giant's data-hoovering business as based on shaky (if not plainly illegitimate) legal ground. “If you believe what I believe, you have to just leave the company,†Suchir Balaji recently told the New York Times. Balaji, a 25-year-old UC Berkeley graduate who joined OpenAI in 2020 and went on to work on GPT-4, said he originally became interested in pursuing a career in the AI industry because he felt the technology could "be used to solve unsolvable problems, like curing diseases and stopping aging.†Balaji worked for OpenAI for four years before leaving the company this summer. Now, Balaji says he sees the technology being used for things he doesn't agree with, and believes that AI companies are "destroying the commercial viability of the individuals, businesses and internet services that created the digital data used to train these A.I. systems," the Times writes. This week, Balaji posted an essay on his personal website, in which he argued that OpenAI was breaking copyright law. In the essay, he attempted to show "how much copyrighted information" from an AI system's training dataset ultimately "makes its way to the outputs of a model.“ Balaji's conclusion from his analysis was that ChatGPT's output does not meet the standard for "fair use," the legal standard that allows the limited use of copyrighted material without the copyright holder's permission. "The only way out of all this is regulation,†Balaji later told the Times, in reference to the legal issues created by AI's business model. Gizmodo reached out to OpenAI for comment. In a statement provided to the Times, the tech company offered the following rebuttal to Balaji's criticism: “We build our A.I. models using publicly available data, in a manner protected by fair use and related principles, and supported by longstanding and widely accepted legal precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.†It should be noted that the New York Times is currently suing OpenAI for unlicensed use of its copyrighted material. The Times claimed that the company and its partner, Microsoft, had used millions of news articles from the newspaper to train its algorithm, which has since sought to compete for the same market. The newspaper is not alone. OpenAI is currently being sued by a broad variety of celebrities, artists, authors, and coders, all of whom claim to have had their work ripped off by the company's data-hoovering algorithms. Other well-known folks/organizations who have sued OpenAI include Sarah Silverman, Ta-Nahisi Coates, George R. R. Martin, Jonathan Franzen, John Grisham, the Center for Investigative Reporting, The Intercept, a variety of newspapers (including The Denver Post and the Chicago Tribune), and a variety of YouTubers, among others. Despite a mixture of confusion and disinterest from the general public, the list of people who have come out to criticize the AI industry's business model continues to grow. Celebrities, tech ethicists, and legal experts are all skeptical of an industry that continues to grow in power and influence while introducing troublesome new legal and social dilemmas to the world.
[5]
Former OpenAI Researcher Says Company Broke Copyright Law
Cade Metz has written about artificial intelligence for 15 years. Suchir Balaji spent nearly four years as an artificial intelligence researcher at OpenAI. Among other projects, he helped gather and organize the enormous amounts of internet data the company used to build its online chatbot, ChatGPT. At the time, he did not carefully consider whether the company had a legal right to build its products in this way. He assumed the San Francisco start-up was free to use any internet data, whether it was copyrighted or not. But after the release of ChatGPT in late 2022, he thought harder about what the company was doing. He came to the conclusion that OpenAI's use of copyrighted data violated the law and that technologies like ChatGPT were damaging the internet. In August, he left OpenAI because he no longer wanted to contribute to technologies that he believed would bring society more harm than benefit. "If you believe what I believe, you have to just leave the company," he said during a recent series of interviews with The New York Times. Mr. Balaji, 25, who has not taken a new job and is working on what he calls "personal projects," is among the first employees to leave a major A.I. company and speak out publicly against the way these companies have used copyrighted data to create their technologies. A former vice president at the London start-up Stability AI, which specializes in image- and audio-generating technologies, has made similar arguments. Over the past two years, a number of individuals and businesses have sued various A.I. companies, including OpenAI, arguing that they illegally used copyrighted material to train their technologies. Those who have filed suits include computer programmers, artists, record labels, book authors and news organizations. In December, The New York Times sued OpenAI and its primary partner, Microsoft, claiming they used millions of articles published by The Times to build chatbots that now compete with the news outlet as a source of reliable information. Both companies have denied the claims. Many researchers who have worked inside OpenAI and other tech companies have cautioned that A.I. technologies could cause serious harm. But most of those warnings have been about future risks, like A.I. systems that could one day help create new bioweapons or even destroy humanity. Mr. Balaji believes the threats are more immediate. ChatGPT and other chatbots, he said, are destroying the commercial viability of the individuals, businesses and internet services that created the digital data used to train these A.I. systems. "This is not a sustainable model for the internet ecosystem as a whole," he told The Times. OpenAI disagrees with Mr. Balaji, saying in a statement: "We build our A.I. models using publicly available data, in a manner protected by fair use and related principles, and supported by longstanding and widely accepted legal precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness." In 2013, an artificial intelligence start-up in London called DeepMind unveiled A.I. technology that learned to play classic Atari games on its own, including Space Invaders, Pong and Breakout. When Mr. Balaji was a teenager growing up in Cupertino, Calif., he stumbled across a news story about the technology. It captured his imagination, as did a later DeepMind creation that mastered the ancient game of Go. "I thought that A.I. was a thing that could be used to solve unsolvable problems, like curing diseases and stopping aging," he said. "I thought we could invent some kind of scientist that could help solve them." During a gap year after high school and as a computer science student at the University of California, Berkeley, Mr. Balaji began exploring the key idea behind DeepMind's technologies: a mathematical system called a neural network that could learn skills by analyzing digital data. In 2020, he joined a stream of Berkeley grads who went to work for OpenAI. In early 2022, Mr. Balaji began gathering digital data for a new project called GPT-4. This was a neural network that spent months analyzing practically all the English language text on the internet. He and his colleagues, Mr. Balaji said, treated it like a research project. Though OpenAI had recently transformed itself into a profit-making company and had started selling access to similar technology called GPT-3, they did not think of their work as something that would compete with existing internet services. GPT-3 was not a chatbot. It was a technology that allowed businesses and computer coders to build other software apps. "With a research project, you can, generally speaking, train on any data," Mr. Balaji said. "That was the mind-set at the time." Then OpenAI released ChatGPT. Initially driven by a precursor to GPT-4 and later by GPT-4 itself, the chatbot grabbed the attention of hundreds of millions of people and quickly became a moneymaker. OpenAI, Microsoft and other companies have said that using internet data to train their A.I. systems meets the requirements of the "fair use" doctrine. The doctrine has four factors. The companies argue that those factors -- including that they substantially transformed the copyrighted works and were not competing in the same market with a direct substitute for those works -- play in their favor. Mr. Balaji does not believe these criteria have been met. When a system like GPT-4 learns from data, he said, it makes a complete copy of that data. From there, a company like OpenAI can then teach the system to generate an exact copy of the data. Or it can teach the system to generate text that is in no way a copy. The reality, he said, is that companies teach the systems to do something in between. "The outputs aren't exact copies of the inputs, but they are also not fundamentally novel," he said. This week, he posted an essay on his personal website that included what he describes as a mathematical analysis that aims to show that this claim is true. Mark Lemley, a Stanford University law professor, argued the opposite. Most of what chatbots put out, he said, is sufficiently different from its training data. "There are occasionally circumstances where an output looks like an input," he said. "A vast majority of things generated by a ChatGPT or an image generation system do not draw heavily from a particular piece of content." The technology violates the law, Mr. Balaji argued, because in many cases it directly competes with the copyrighted works it learned from. Generative models are designed to imitate online data, he said, so they can substitute for "basically anything" on the internet, from news stories to online forums. The larger problem, he said, is that as A.I. technologies replace existing internet services, they are generating false and sometimes completely made-up information -- what researchers call "hallucinations." The internet, he said, is changing for the worse. Bradley J. Hulbert, an intellectual property lawyer who specializes in this intellectual property law, said that the intellectual copyright laws now in place were written well before the rise of A.I. and that no court has yet decided whether A.I. technologies like ChatGPT violate the law. He also argued that Congress should create a new law that addresses this technology. "Given that A.I. is evolving so quickly," he said, "it is time for Congress to step in." Mr. Balaji agreed. "The only way out of all this is regulation," he said.
[6]
Former OpenAI researcher says the company broke copyright law
In August, Balaji, 25, left OpenAI because he no longer wanted to contribute to technologies that he believed would bring society more harm than benefit. "If you believe what I believe, you have to just leave the company," he said during a recent series of interviews with The New York Times.Suchir Balaji spent nearly four years as an artificial intelligence researcher at OpenAI. Among other projects, he helped gather and organize the enormous amounts of internet data the company used to build its online chatbot, ChatGPT. At the time, he did not carefully consider whether the company had a legal right to build its products in this way. He assumed the San Francisco startup was free to use any internet data, whether it was copyrighted or not. But after the release of ChatGPT in late 2022, he thought harder about what the company was doing. He came to the conclusion that OpenAI's use of copyrighted data violated the law and that technologies like ChatGPT were damaging the internet. In August, Balaji, 25, left OpenAI because he no longer wanted to contribute to technologies that he believed would bring society more harm than benefit. "If you believe what I believe, you have to just leave the company," he said during a recent series of interviews with The New York Times. Balaji who has not taken a new job and is working on what he calls "personal projects," is among the first employees to leave a major AI company and speak out publicly against the way these companies have used copyrighted data to create their technologies. A former vice president at the London startup Stability AI, which specializes in image- and audio-generating technologies, has made similar arguments. Over the past two years, a number of individuals and businesses have sued various AI companies, including OpenAI, arguing that they illegally used copyrighted material to train their technologies. Those who have filed suits include computer programmers, artists, record labels, book authors and news organizations. In December, The New York Times sued OpenAI and its primary partner, Microsoft, claiming they used millions of articles published by the Times to build chatbots that now compete with the news outlet as a source of reliable information. Both companies have denied the claims. Many researchers who have worked inside OpenAI and other tech companies have cautioned that AI technologies could cause serious harm. But most of those warnings have been about future risks, like AI systems that could one day help create new bioweapons or even destroy humanity. Balaji believes the threats are more immediate. ChatGPT and other chatbots, he said, are destroying the commercial viability of the individuals, businesses and internet services that created the digital data used to train these AI systems. "This is not a sustainable model for the internet ecosystem as a whole," he told the Times. OpenAI disagrees with Balaji, saying in a statement, "We build our AI models using publicly available data, in a manner protected by fair use and related principles, and supported by long-standing and widely accepted legal precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness." In 2013, an artificial intelligence startup in London called DeepMind unveiled AI technology that learned to play classic Atari games on its own, including Space Invaders, Pong and Breakout. When Balaji was a teenager growing up in Cupertino, California, he stumbled across a news story about the technology. It captured his imagination, as did a later DeepMind creation that mastered the ancient game of Go. "I thought that AI was a thing that could be used to solve unsolvable problems, like curing diseases and stopping aging," he said. "I thought we could invent some kind of scientist that could help solve them." During a gap year after high school and as a computer science student at the University of California, Berkeley, Balaji began exploring the key idea behind DeepMind's technologies: a mathematical system called a neural network that could learn skills by analyzing digital data. In 2020, he joined a stream of Berkeley grads who went to work for OpenAI. In early 2022, Balaji began gathering digital data for a new project called GPT-4. This was a neural network that spent months analyzing practically all the English language text on the internet. He and his colleagues, Balaji said, treated it like a research project. Although OpenAI had recently transformed itself into a profit-making company and had started selling access to similar technology called GPT-3, they did not think of their work as something that would compete with existing internet services. GPT-3 was not a chatbot. It was a technology that allowed businesses and computer coders to build other software apps. "With a research project, you can, generally speaking, train on any data," Balaji said. "That was the mindset at the time." Then OpenAI released ChatGPT. Initially driven by a precursor to GPT-4 and later by GPT-4 itself, the chatbot grabbed the attention of hundreds of millions of people and quickly became a moneymaker. OpenAI, Microsoft and other companies have said that using internet data to train their AI systems meets the requirements of the "fair use" doctrine. The doctrine has four factors. The companies argue that those factors -- including that they substantially transformed the copyrighted works and were not competing in the same market with a direct substitute for those works -- play in their favor. Balaji does not believe these criteria have been met. When a system like GPT-4 learns from data, he said, it makes a complete copy of that data. From there, a company like OpenAI can then teach the system to generate an exact copy of the data, or it can teach the system to generate text that is in no way a copy. The reality, he said, is that companies teach the systems to do something in between. "The outputs aren't exact copies of the inputs, but they are also not fundamentally novel," he said. This week, he posted an essay on his personal website that included what he describes as a mathematical analysis that seeks to show that this claim is true. Mark Lemley, a Stanford University law professor, argued the opposite. Most of what chatbots disseminate, he said, is sufficiently different from its training data. "There are occasionally circumstances where an output looks like an input," he said. "A vast majority of things generated by a ChatGPT or an image generation system do not draw heavily from a particular piece of content." The technology violates the law, Balaji argued, because in many cases, it directly competes with the copyrighted works it learned from. Generative models are designed to imitate online data, he said, so they can substitute for "basically anything" on the internet, from news stories to online forums. The larger problem, he said, is that as AI technologies replace existing internet services, they are generating false and sometimes completely made-up information -- what researchers call "hallucinations." The internet, he said, is changing for the worse. Bradley J. Hulbert, an intellectual property lawyer who specializes in this intellectual property law, said that the intellectual copyright laws now in place were written well before the rise of AI and that no court has yet decided whether AI technologies like ChatGPT violate the law. He also argued that Congress should create a new law that addresses this technology. "Given that AI is evolving so quickly," he said, "it is time for Congress to step in." Balaji agreed. "The only way out of all this is regulation," he said.
Share
Share
Copy Link
Suchir Balaji, a former OpenAI employee, speaks out against the company's data scraping practices, claiming they violate copyright law and pose a threat to the internet ecosystem.
Suchir Balaji, a 25-year-old artificial intelligence researcher who worked at OpenAI for nearly four years, has come forward with serious allegations against the company's data practices. Balaji, who left OpenAI in August 2024, claims that the company's use of copyrighted data to train its AI models violates copyright law and poses a significant threat to the internet ecosystem 12.
During his time at OpenAI, Balaji was involved in gathering and organizing vast amounts of internet data used to build products like ChatGPT. Initially, he viewed his work as part of a research project, assuming that using any internet data, copyrighted or not, was acceptable in that context 3.
However, Balaji's perspective changed dramatically after the release of ChatGPT in late 2022. He realized that what was once a closed-door research project had transformed into a commercialized product, raising serious ethical and legal concerns 2.
Balaji argues that OpenAI's data scraping practices do not meet the criteria for fair use, a legal doctrine that allows limited use of copyrighted material without permission 1. He contends that while the outputs of AI models like ChatGPT aren't exact copies of the inputs, they are also not fundamentally novel, potentially infringing on copyrights 4.
OpenAI, however, maintains that its use of publicly available data is protected by fair use principles and is critical for U.S. competitiveness 5. The company is currently facing several lawsuits related to copyright infringement, including a high-profile case brought by The New York Times 2.
Balaji expresses deep concern about the sustainability of OpenAI's business model for the internet ecosystem. He argues that AI technologies like ChatGPT are "destroying the commercial viability of the individuals, businesses and internet services that created the digital data used to train these AI systems" 5.
This sentiment is echoed by others in the tech industry, with a growing chorus of voices questioning the legitimacy and ethics of AI companies' data-hoovering practices 4.
The controversy surrounding AI training practices has led to increased calls for government intervention. Bradley Hulbert, an intellectual property lawyer, suggests that "it is time for Congress to step in" given the rapid evolution of AI technology 2.
As the debate intensifies, the AI industry faces mounting pressure to address these concerns. OpenAI's transition from a non-profit research organization to a commercial entity has only heightened scrutiny of its practices 25.
Balaji's whistleblowing adds to the ongoing discussion about the future of AI development and its impact on society. While he initially joined the AI industry believing in its potential to solve major global challenges, he now sees the technology causing more harm than good 14.
As lawsuits pile up and former insiders speak out, the AI industry may be forced to reckon with its data practices and their long-term consequences for creativity, innovation, and the digital landscape as a whole.
Reference
[2]
[3]
[4]
[5]
Major Canadian news organizations have filed a lawsuit against OpenAI, claiming copyright infringement and seeking billions in damages for the unauthorized use of their content in training AI models like ChatGPT.
22 Sources
22 Sources
OpenAI's Media Manager, a tool promised to allow creators to opt out of AI training data, has missed its 2025 deadline. The delay raises concerns about copyright protection and AI ethics in the rapidly evolving field of artificial intelligence.
5 Sources
5 Sources
OpenAI faces challenges in a copyright lawsuit as it accidentally erases crucial data during the discovery process, leading to delays and complications in the legal battle with The New York Times and Daily News.
13 Sources
13 Sources
A federal judge has dismissed a copyright lawsuit against OpenAI, filed by news outlets Raw Story and AlterNet, citing lack of evidence of harm. The case centered on OpenAI's use of news articles for AI training without consent.
10 Sources
10 Sources
OpenAI, the company behind ChatGPT, has responded to copyright infringement lawsuits filed by authors, denying allegations and asserting fair use. The case highlights the ongoing debate surrounding AI and intellectual property rights.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved