2 Sources
[1]
Innovative detection method makes AI smarter by cleaning up bad data before it learns
In the world of machine learning and artificial intelligence, clean data is everything. Even a small number of mislabeled examples known as label noise can derail the performance of a model, especially those like support vector machines (SVMs) that rely on a few key data points to make decisions. SVMs are a widely used type of machine learning algorithm, applied in everything from image and speech recognition to medical diagnostics and text classification. These models operate by finding a boundary that best separates different categories of data. They rely on a small but crucial subset of the training data, known as support vectors, to determine this boundary. If these few examples are incorrectly labeled, the resulting decision boundaries can be flawed, leading to poor performance on real-world data. Now, a team of researchers from the Center for Connected Autonomy and Artificial Intelligence (CA-AI) within the College of Engineering and Computer Science at Florida Atlantic University and collaborators have developed an innovative method to automatically detect and remove faulty labels before a model is ever trained -- making AI smarter, faster and more reliable. Before the AI even starts learning, the researchers clean the data using a math technique that looks for odd or unusual examples that don't quite fit. These "outliers" are removed or flagged, making sure the AI gets high-quality information right from the start. The paper is published in IEEE Transactions on Neural Networks and Learning Systems. "SVMs are among the most powerful and widely used classifiers in machine learning, with applications ranging from cancer detection to spam filtering," said Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the FAU Department of Electrical Engineering and Computer Science, director of CA-AI and an FAU Sensing Institute (I-SENSE) faculty fellow. "What makes them especially effective -- but also uniquely vulnerable -- is that they rely on just a small number of key data points, called support vectors, to draw the line between different classes. If even one of those points is mislabeled -- for example, if a malignant tumor is incorrectly marked as benign -- it can distort the model's entire understanding of the problem. The consequences of that could be serious, whether it's a missed cancer diagnosis or a security system that fails to flag a threat. Our work is about protecting models -- any machine learning and AI model including SVMs -- from these hidden dangers by identifying and removing those mislabeled cases before they can do harm." The data-driven method that "cleans" the training dataset uses a mathematical approach called L1-norm principal component analysis. Unlike conventional methods, which often require manual parameter tuning or assumptions about the type of noise present, this technique identifies and removes suspicious data points within each class purely based on how well they fit with the rest of the group. "Data points that appear to deviate significantly from the rest -- often due to label errors -- are flagged and removed," said Pados. "Unlike many existing techniques, this process requires no manual tuning or user intervention and can be applied to any AI model, making it both scalable and practical." The process is robust, efficient and entirely touch-free -- even handling the notoriously tricky task of rank selection (which determines how many dimensions to keep during analysis) without user input. Researchers extensively tested their technique on real and synthetic datasets with various levels of label contamination. Across the board, it produced consistent and notable improvements in classification accuracy, demonstrating its potential as a standard pre-processing step in the development of high-performance machine learning systems. "What makes our approach particularly compelling is its flexibility," said Pados. "It can be used as a plug-and-play preprocessing step for any AI system, regardless of the task or dataset. And it's not just theoretical -- extensive testing on both noisy and clean datasets, including well-known benchmarks like the Wisconsin Breast Cancer dataset, showed consistent improvements in classification accuracy. "Even in cases where the original training data appeared flawless, our new method still enhanced performance, suggesting that subtle, hidden label noise may be more common than previously thought." Looking ahead, the research opens the door to even broader applications. The team is interested in exploring how this mathematical framework might be extended to tackle deeper issues in data science such as reducing data bias and improving the completeness of datasets. "As machine learning becomes deeply integrated into high-stakes domains like health care, finance and the justice system, the integrity of the data driving these models has never been more important," said Stella Batalama, Ph.D., dean of the FAU College of Engineering and Computer Science. "We're asking algorithms to make decisions that impact real lives -- diagnosing diseases, evaluating loan applications, even informing legal judgments. If the training data is flawed, the consequences can be devastating. That's why innovations like this are so critical. "By improving data quality at the source -- before the model is even trained -- we're not just making AI more accurate; we're making it more responsible. This work represents a meaningful step toward building AI systems we can trust to perform fairly, reliably and ethically in the real world."
[2]
FAU CA-AI Engineers Make AI Smarter by Cleaning Up Bad Data Before It Learns | Newswise
Newswise -- In the world of machine learning and artificial intelligence, clean data is everything. Even a small number of mislabeled examples known as label noise can derail the performance of a model, especially those like Support Vector Machines (SVMs) that rely on a few key data points to make decisions. SVMs are a widely used type of machine learning algorithm, applied in everything from image and speech recognition to medical diagnostics and text classification. These models operate by finding a boundary that best separates different categories of data. They rely on a small but crucial subset of the training data, known as support vectors, to determine this boundary. If these few examples are incorrectly labeled, the resulting decision boundaries can be flawed, leading to poor performance on real-world data. Now, a team of researchers from the Center for Connected Autonomy and Artificial Intelligence (CA-AI) within the College of Engineering and Computer Science at Florida Atlantic University and collaborators have developed an innovative method to automatically detect and remove faulty labels before a model is ever trained - making AI smarter, faster and more reliable. Before the AI even starts learning, the researchers clean the data using a math technique that looks for odd or unusual examples that don't quite fit. These "outliers" are removed or flagged, making sure the AI gets high-quality information right from the start. "SVMs are among the most powerful and widely used classifiers in machine learning, with applications ranging from cancer detection to spam filtering," said Dimitris Pados, Ph.D., Schmidt Eminent Scholar Professor of Engineering and Computer Science in the FAU Department of Electrical Engineering and Computer Science, director of CA-AI and an FAU Sensing Institute (I-SENSE) faculty fellow. "What makes them especially effective - but also uniquely vulnerable - is that they rely on just a small number of key data points, called support vectors, to draw the line between different classes. If even one of those points is mislabeled - for example, if a malignant tumor is incorrectly marked as benign - it can distort the model's entire understanding of the problem. The consequences of that could be serious, whether it's a missed cancer diagnosis or a security system that fails to flag a threat. Our work is about protecting models - any machine learning and AI model including SVMs - from these hidden dangers by identifying and removing those mislabeled cases before they can do harm." The data-driven method that "cleans" the training dataset uses a mathematical approach called L1-norm principal component analysis. Unlike conventional methods, which often require manual parameter tuning or assumptions about the type of noise present, this technique identifies and removes suspicious data points within each class purely based on how well they fit with the rest of the group. "Data points that appear to deviate significantly from the rest - often due to label errors - are flagged and removed," said Pados. "Unlike many existing techniques, this process requires no manual tuning or user intervention and can be applied to any AI model, making it both scalable and practical." The process is robust, efficient and entirely touch-free - even handling the notoriously tricky task of rank selection (which determines how many dimensions to keep during analysis) without user input. Researchers extensively tested their technique on real and synthetic datasets with various levels of label contamination. Across the board, it produced consistent and notable improvements in classification accuracy, demonstrating its potential as a standard pre-processing step in the development of high-performance machine learning systems. "What makes our approach particularly compelling is its flexibility," said Pados. "It can be used as a plug-and-play preprocessing step for any AI system, regardless of the task or dataset. And it's not just theoretical - extensive testing on both noisy and clean datasets, including well-known benchmarks like the Wisconsin Breast Cancer dataset, showed consistent improvements in classification accuracy. Even in cases where the original training data appeared flawless, our new method still enhanced performance, suggesting that subtle, hidden label noise may be more common than previously thought." Looking ahead, the research opens the door to even broader applications. The team is interested in exploring how this mathematical framework might be extended to tackle deeper issues in data science such as reducing data bias and improving the completeness of datasets. "As machine learning becomes deeply integrated into high-stakes domains like health care, finance and the justice system, the integrity of the data driving these models has never been more important," said Stella Batalama, Ph.D., dean of the FAU College of Engineering and Computer Science. "We're asking algorithms to make decisions that impact real lives - diagnosing diseases, evaluating loan applications, even informing legal judgments. If the training data is flawed, the consequences can be devastating. That's why innovations like this are so critical. By improving data quality at the source - before the model is even trained - we're not just making AI more accurate; we're making it more responsible. This work represents a meaningful step toward building AI systems we can trust to perform fairly, reliably and ethically in the real world." This work will appear in the Institute of Electrical and Electronics Engineers' (IEEE), Transactions on Neural Networks and Learning Systems. Co-authors, who are all IEEE members, are Shruti Shukla; Ph.D. student in CA-AI and the FAU Department of Electrical Engineering and Computer Science; George Sklivanitis, Ph.D., Charles E. Schmidt Research Associate Professor in CA-AI and the Department of Electrical Engineering and Computer Science, and I-SENSE faculty fellow; Elizabeth Serena Bentley, Ph.D.; and Michael J. Medley, Ph.D., United States Air Force Research Laboratory. - FAU - About the Center for Connected Autonomy and Artificial Intelligence (CA-AI): The Center for Connected Autonomy and Artificial Intelligence (CA-AI) at Florida Atlantic University is an interdisciplinary research center focused on advancing the theory and practice of artificial intelligence and autonomous systems. Located in the Engineering East building on FAU's Boca Raton campus, the center brings together experts in AI, machine learning, sensing, and real-time communications to develop solutions for land, sea, air, and space applications. With a mission to accelerate innovation in connected autonomy, CA-AI plays a key role in developing smart, resilient systems -- from autonomous navigation and adaptive networks to decision-making in complex environments. With support from the National Science Foundation, U.S. Department of Defense, Schmidt Family Foundation, and other partners, CA-AI is committed to education and workforce development, including the creation of Florida's first M.S. program in artificial intelligence. Through impactful research and educational initiatives, CA-AI is shaping the future of networked AI robotics for a smarter, more resilient world. Learn more at ca-ai.fau.edu. About FAU's College of Engineering and Computer Science: The FAU College of Engineering and Computer Science is internationally recognized for cutting-edge research and education in the areas of computer science and artificial intelligence (AI), computer engineering, electrical engineering, biomedical engineering, civil, environmental and geomatics engineering, mechanical engineering, and ocean engineering. Research conducted by the faculty and their teams expose students to technology innovations that push the current state-of-the art of the disciplines. The College research efforts are supported by the National Science Foundation (NSF), the National Institutes of Health (NIH), the Department of Defense (DOD), the Department of Transportation (DOT), the Department of Education (DOEd), the State of Florida, and industry. The FAU College of Engineering and Computer Science offers degrees with a modern twist that bear specializations in areas of national priority such as AI, cybersecurity, internet-of-things, transportation and supply chain management, and data science. New degree programs include Master of Science in AI (first in Florida), Master of Science and Bachelor in Data Science and Analytics, and the new Professional Master of Science and Ph.D. in computer science for working professionals. For more information about the College, please visit eng.fau.edu.
Share
Copy Link
Researchers at Florida Atlantic University have created a new technique to automatically detect and remove faulty labels in AI training data, improving the performance and reliability of machine learning models, particularly Support Vector Machines (SVMs).
Researchers from the Center for Connected Autonomy and Artificial Intelligence (CA-AI) at Florida Atlantic University have developed a groundbreaking method to enhance the accuracy and reliability of artificial intelligence systems. The technique focuses on cleaning training data before it's fed into machine learning models, particularly benefiting Support Vector Machines (SVMs) 1.
Source: newswise
In the realm of machine learning, the quality of training data is paramount. Even a small number of mislabeled examples, known as label noise, can significantly impair a model's performance. This issue is especially critical for SVMs, which rely on a few key data points called support vectors to make decisions 2.
The research team, led by Dr. Dimitris Pados, has developed a data-driven method that "cleans" the training dataset using a mathematical approach called L1-norm principal component analysis. This technique identifies and removes suspicious data points within each class based on how well they fit with the rest of the group 1.
The researchers rigorously tested their technique on both real and synthetic datasets with various levels of label contamination. The results consistently showed notable improvements in classification accuracy across the board 2.
Source: Tech Xplore
This innovative method has potential applications in numerous fields where AI is increasingly being used for critical decision-making:
The research team is exploring how this mathematical framework might be extended to address broader issues in data science, such as reducing data bias and improving dataset completeness 1.
As machine learning becomes more integrated into high-stakes domains, the integrity of the data driving these models is increasingly crucial. By improving data quality at the source, this innovation represents a significant step towards building AI systems that can be trusted to perform fairly, reliably, and ethically in real-world scenarios 2.
ChatGPT and other AI chatbots are encouraging harmful delusions and conspiracy theories, leading to mental health crises, dangerous behavior, and even death in some cases. Experts warn of the risks of using AI as a substitute for mental health care.
5 Sources
Technology
22 hrs ago
5 Sources
Technology
22 hrs ago
A major Google Cloud Platform outage caused widespread disruptions to AI services and internet platforms, highlighting the vulnerabilities of cloud-dependent systems and raising concerns about the centralization of digital infrastructure.
4 Sources
Technology
22 hrs ago
4 Sources
Technology
22 hrs ago
Google is experimenting with AI-generated audio summaries of search results, bringing its popular Audio Overviews feature from NotebookLM to Google Search as part of a limited test.
8 Sources
Technology
14 hrs ago
8 Sources
Technology
14 hrs ago
The article discusses the surge in mergers and acquisitions in the data infrastructure sector, driven by the AI race. Legacy tech companies are acquiring data processing firms to stay competitive in the AI market.
3 Sources
Business and Economy
6 hrs ago
3 Sources
Business and Economy
6 hrs ago
Morgan Stanley's research highlights China's leading position in the global race for advanced robotics and AI, citing ten key factors that give the country a strategic edge over the US.
2 Sources
Technology
22 hrs ago
2 Sources
Technology
22 hrs ago