Curated by THEOUTPOST
On Tue, 6 Aug, 12:01 AM UTC
2 Sources
[1]
Why AI has so much trouble with open source and vice versa
Also: AI scientist: 'We need to think outside the large language model box' Oh, I know that when Meta CEO Mark Zuckerberg unveiled Llama 3.1 in a Threads post, he said, "Open-source AI is the path forward," and that Meta is "taking the next steps towards open-source AI becoming the industry standard." At a SIGGRAPH keynote discussion with Nvidea CEO Jensen Huang, Zuckerberg admitted: We're not pursuing [open source] out of altruism, though I believe it will benefit the ecosystem. We're doing it because we think it will enhance our offerings by creating a strong ecosystem. ... this might sound selfish, but after building this company for a while, one of my goals for the next 10 or 15 years is to ensure we can build the fundamental technology for our social experiences. Zuckerberg is sincere about open source. As we've seen repeatedly, open source is the way to unite technologies. For example, we use a unified Linux now instead of multiple, incompatible versions of Unix because Linus Torvalds open-sourced Linux under GPLv2. Zuck's not alone, though, in playing fast and loose with open source. From the name, you'd think OpenAI is open source. It was indeed open back when GPT-1 and GPT-2 were state-of-the-art. That was a long time -- and billions in revenue -- ago. Starting with GPL-3, OpenAI closed its doors. As Mark Dingemanse, a language scientist at Radboud University in Nijmegen, Netherlands said in a Nature article, "Some big firms are reaping the benefits of claiming to have open-source models while trying "to get away with disclosing as little as possible." Indeed, Dingemanse and his colleague Andreas Liesenfeld found only one AI chatbot that could truly be described as open: The Hugging Face-hosted Large-Language Model (LLM) BigScience/BloomZ. Other LLMs that qualify are Falcon, FastChat-T5, and OpenLLaMA. But most LLMs contain proprietary, copyrighted, or simply unknown information their owners won't tell you about. As the Electronic Frontier Foundation (EFF) observed, "Garbage In, Gospel Out." Now, much of the innovative software driving AI is open source. TensorFlow is a versatile learning framework that supports multiple programming languages and is used for machine learning. PyTorch is popular for its dynamic computational graphs and ease of use in deep learning applications that quickly come to mind. Also: How open source attracts some of the world's top innovators The LLMs and programs built on them are another story. All the most popular AI chatbots and programs are proprietary. So, why are companies claiming their projects are open source? By "open-washing" their efforts, businesses hope to gild their programs with open source's positive connotations of transparency, collaboration, and innovation. They also hope to con developers into helping advance their own projects. It's all about marketing. Clearly, we need to devise an open-source definition that fits AI programs to stop these faux-source efforts in their tracks. Unfortunately, that's easier said than done. While people constantly fuss over the finer details of what's open-source code and what isn't, the Open Source Initiative (OSI) has nailed down the definition, the Open Source Definition (OSD), for almost twenty years. The convergence of open source and AI is much more complicated. In fact, Joseph Jacks, founder of the Venture Capitalist (VC) business FOSS Capital, argued there is "no such thing as open-source AI" since "open source was invented explicitly for software source code." It's true. In addition, open-source's legal foundation is copyright law. As Jacks observed, "Neural Net Weights (NNWs) [which are essential in AI] are not software source code -- they are unreadable by humans, nor are they debuggable." As Stefano Maffulli, OSI executive director, has told me, software and data are mixed in AI, and existing open-source licenses are breaking down. Specifically, trouble emerges when all that data and code are merged in AI/ML artifacts -- such as datasets, models, and weights. "Therefore, we need to make a new definition for open-source AI," said Mafulli. Also: Switzerland's federal government requires releasing its software as open source However, getting there hasn't been easy. The main point of contention is the extent of openness required, particularly regarding training data. While some argue that releasing pre-trained models without the training data is sufficient, others argue that true open-source AI should also include access to the training data. As julia ferraioli (Stet: she spells her name in all lower case), Amazon Web Services (AWS) Open Source AI/ML Strategist, observed in a blog post, with the current OSI open-source AI definition 0.08 draft, "the only aspects of the data that a system desiring to be labeled as 'open source AI' would need to publish are: training methodologies and techniques; training data scope and characteristics; training data provenance (including how data was obtained and selected), training data labeling procedures, and training data cleaning methodology." None of that, ferraioli continued, "gives the prospective adopter of the AI system insight into the data that was used to train the system." Without this data, can an AI be open? Ferraioli argues it can't. She's not the only one who holds that position. She quotes her colleague, AWS Principal Open Source Technical Strategist Tom Callaway, who wrote, "Without requiring the data be open, it is not possible for anyone without the data to fully study or modify the LLM, or distribute all of its source code. You can only use it, tune/tweak it a bit, but you can't dive deep into it to understand why it does what it does." Also: More than money, open-source pros want these 2 things from their next jobs He has a good point. At its heart, open source is all about understanding the code. In AI's case, that means the data as well. As Maffulli said at the recent United Nations OSPOs for Good Conference, "While there's broad agreement on the overarching principles, it's becoming obvious that the devil is in the details." You can say that again. At the same conference, Sasha Luccioni, Hugging Face's AI and climate lead, argued, "You can't really expect all companies to be 100% open source as the open source license defines it. You can't expect companies just to give up everything that they're making money off of and do so in a way they're comfortable with." Still, Luccioni believes "a responsible AI license can exist" -- one that is open source friendly -- where you can define your terms of open source. By tweaking the language a little bit, you can move forward in a way that companies, governments, and academia are all comfortable with instead of saying this project or license is not open source. Also: Why don't more people use desktop Linux? I have a theory you might not like Open-source advocates disagreed. I suspect the arguments will continue for years to come. The OSI, with the help of 70 others, consisting of researchers, lawyers, policymakers, activists, and representatives from big tech companies like Meta, Google, and Amazon and groups such as the Linux Foundation and the Alfred P. Sloan Foundation, is wrestling to come up with a workable definition. The goal is to present a stable version of the Open Source AI Definition at the next All Things Open conference in Raleigh, North Carolina, from October 27th to the 29th. I'll be there. So strap in, folks. The combination of open-source principles and AI development is driving significant advancements. It's also enabling faster innovation, promoting collaboration, and democratizing access to powerful AI tools. But, its evolution promises to be a long, difficult process.
[2]
Can AI even be open source? It's complicated
Also: AI scientist: 'We need to think outside the large language model box' Oh, I know that when Meta CEO Mark Zuckerberg unveiled Llama 3.1 in a Threads post, he said, "Open-source AI is the path forward," and that Meta is "taking the next steps towards open-source AI becoming the industry standard." At a SIGGRAPH keynote discussion with Nvidea CEO Jensen Huang, Zuckerberg admitted: We're not pursuing [open source] out of altruism, though I believe it will benefit the ecosystem. We're doing it because we think it will enhance our offerings by creating a strong ecosystem. ... this might sound selfish, but after building this company for a while, one of my goals for the next 10 or 15 years is to ensure we can build the fundamental technology for our social experiences. Zuckerberg is sincere about open source. As we've seen repeatedly, open source is the way to unite technologies. For example, we use a unified Linux now instead of multiple, incompatible versions of Unix because Linus Torvalds open-sourced Linux under GPLv2. Zuck's not alone, though, in playing fast and loose with open source. From the name, you'd think OpenAI is open source. It was indeed open back when GPT-1 and GPT-2 were state-of-the-art. That was a long time -- and billions in revenue -- ago. Starting with GPL-3, OpenAI closed its doors. As Mark Dingemanse, a language scientist at Radboud University in Nijmegen, Netherlands said in a Nature article, "Some big firms are reaping the benefits of claiming to have open-source models while trying "to get away with disclosing as little as possible." Indeed, Dingemanse and his colleague Andreas Liesenfeld found only one AI chatbot that could truly be described as open: The Hugging Face-hosted Large-Language Model (LLM) BigScience/BloomZ. Other LLMs that qualify are Falcon, FastChat-T5, and OpenLLaMA. But most LLMs contain proprietary, copyrighted, or simply unknown information their owners won't tell you about. As the Electronic Frontier Foundation (EFF) observed, "Garbage In, Gospel Out." Now, much of the innovative software driving AI is open source. TensorFlow is a versatile learning framework that supports multiple programming languages and is used for machine learning. PyTorch is popular for its dynamic computational graphs and ease of use in deep learning applications that quickly come to mind. Also: How open source attracts some of the world's top innovators The LLMs and programs built on them are another story. All the most popular AI chatbots and programs are proprietary. So, why are companies claiming their projects are open source? By "open-washing" their efforts, businesses hope to gild their programs with open source's positive connotations of transparency, collaboration, and innovation. They also hope to con developers into helping advance their own projects. It's all about marketing. Clearly, we need to devise an open-source definition that fits AI programs to stop these faux-source efforts in their tracks. Unfortunately, that's easier said than done. While people constantly fuss over the finer details of what's open-source code and what isn't, the Open Source Initiative (OSI) has nailed down the definition, the Open Source Definition (OSD), for almost twenty years. The convergence of open source and AI is much more complicated. In fact, Joseph Jacks, founder of the Venture Capitalist (VC) business FOSS Capital, argued there is "no such thing as open-source AI" since "open source was invented explicitly for software source code." It's true. In addition, open-source's legal foundation is copyright law. As Jacks observed, "Neural Net Weights (NNWs) [which are essential in AI] are not software source code -- they are unreadable by humans, nor are they debuggable." As Stefano Maffulli, OSI executive director, has told me, software and data are mixed in AI, and existing open-source licenses are breaking down. Specifically, trouble emerges when all that data and code are merged in AI/ML artifacts -- such as datasets, models, and weights. "Therefore, we need to make a new definition for open-source AI," said Mafulli. Also: Switzerland's federal government requires releasing its software as open source However, getting there hasn't been easy. The main point of contention is the extent of openness required, particularly regarding training data. While some argue that releasing pre-trained models without the training data is sufficient, others argue that true open-source AI should also include access to the training data. As julia ferraioli (Stet: she spells her name in all lower case), Amazon Web Services (AWS) Open Source AI/ML Strategist, observed in a blog post, with the current OSI open-source AI definition 0.08 draft, "the only aspects of the data that a system desiring to be labeled as 'open source AI' would need to publish are: training methodologies and techniques; training data scope and characteristics; training data provenance (including how data was obtained and selected), training data labeling procedures, and training data cleaning methodology." None of that, ferraioli continued, "gives the prospective adopter of the AI system insight into the data that was used to train the system." Without this data, can an AI be open? Ferraioli argues it can't. She's not the only one who holds that position. She quotes her colleague, AWS Principal Open Source Technical Strategist Tom Callaway, who wrote, "Without requiring the data be open, it is not possible for anyone without the data to fully study or modify the LLM, or distribute all of its source code. You can only use it, tune/tweak it a bit, but you can't dive deep into it to understand why it does what it does." Also: More than money, open-source pros want these 2 things from their next jobs He has a good point. At its heart, open source is all about understanding the code. In AI's case, that means the data as well. As Maffulli said at the recent United Nations OSPOs for Good Conference, "While there's broad agreement on the overarching principles, it's becoming obvious that the devil is in the details." You can say that again. At the same conference, Sasha Luccioni, Hugging Face's AI and climate lead, argued, "You can't really expect all companies to be 100% open source as the open source license defines it. You can't expect companies just to give up everything that they're making money off of and do so in a way they're comfortable with." Still, Luccioni believes "a responsible AI license can exist" -- one that is open source friendly -- where you can define your terms of open source. By tweaking the language a little bit, you can move forward in a way that companies, governments, and academia are all comfortable with instead of saying this project or license is not open source. Also: Why don't more people use desktop Linux? I have a theory you might not like Open-source advocates disagreed. I suspect the arguments will continue for years to come. The OSI, with the help of 70 others, consisting of researchers, lawyers, policymakers, activists, and representatives from big tech companies like Meta, Google, and Amazon and groups such as the Linux Foundation and the Alfred P. Sloan Foundation, is wrestling to come up with a workable definition. The goal is to present a stable version of the Open Source AI Definition at the next All Things Open conference in Raleigh, North Carolina, from October 27th to the 29th. I'll be there. So strap in, folks. The combination of open-source principles and AI development is driving significant advancements. It's also enabling faster innovation, promoting collaboration, and democratizing access to powerful AI tools. But, its evolution promises to be a long, difficult process.
Share
Share
Copy Link
Exploring the challenges and complexities in the intersection of AI and open source software. The article delves into the reasons behind AI's struggle with open source principles and the complications of making AI truly open source.
The relationship between artificial intelligence (AI) and open source software is fraught with challenges and complexities. As AI continues to advance rapidly, questions arise about its compatibility with open source principles and the feasibility of truly open source AI systems 1.
One of the primary obstacles in open sourcing AI is the nature of AI systems themselves. Unlike traditional software, AI models often rely on vast amounts of proprietary data and complex algorithms that are difficult to replicate or share openly. This fundamental difference makes it challenging to apply traditional open source concepts to AI development 2.
A significant hurdle in open sourcing AI is the data used to train these systems. Much of this data is proprietary or subject to privacy concerns, making it impossible to share freely. Without access to the training data, replicating or improving upon an AI model becomes extremely difficult, if not impossible 1.
AI systems, particularly large language models like GPT-3, are incredibly complex and require substantial computational resources to develop and run. This complexity makes it challenging for individual developers or smaller organizations to contribute meaningfully to open source AI projects, potentially limiting the diversity and innovation that open source typically fosters 2.
The development of AI raises numerous ethical and legal questions, particularly around issues of bias, fairness, and accountability. These concerns become even more pronounced in an open source context, where control over the development and use of AI systems may be more distributed 1.
Large technology companies play a dominant role in AI development, often keeping their most advanced AI technologies proprietary. This concentration of resources and talent in a few companies can hinder the growth of a robust open source AI ecosystem 2.
Despite these challenges, there are ongoing efforts to make AI more open. Some companies and organizations are releasing open source AI tools and frameworks, while others are working on developing more transparent and explainable AI systems. These initiatives aim to bridge the gap between AI and open source principles 1 2.
Reference
The Open Source Initiative (OSI) has released the Open Source AI Definition (OSAID) 1.0, establishing criteria for what qualifies as open-source AI. This definition has sparked debate and disagreement among tech companies and AI developers.
9 Sources
9 Sources
The Open Source Alliance introduces the Open Weight Definition, a new framework for open-source AI models, sparking debate among industry leaders about the future of open-source AI standards.
2 Sources
2 Sources
The open-source AI community has reached a consensus on a definition for open-source AI, marking a significant milestone in the field. However, the new definition has sparked debates and raised concerns among various stakeholders.
4 Sources
4 Sources
Tech giants like Google and Meta face scrutiny over their 'open-source' AI models. The Open Source Initiative questions whether these models truly meet open-source criteria, sparking a debate in the tech community.
2 Sources
2 Sources
A comprehensive look at the latest developments in AI, including OpenAI's internal struggles, regulatory efforts, new model releases, ethical concerns, and the technology's impact on Wall Street.
6 Sources
6 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved