Curated by THEOUTPOST
On Sat, 1 Mar, 12:04 AM UTC
2 Sources
[1]
Next top model: Competition-based AI study aims to lower data center costs
Who, or rather what, will be the next top model? Data scientists and developers at the U.S. Department of Energy's Thomas Jefferson National Accelerator Facility are trying to find out, exploring some of the latest artificial intelligence (AI) techniques to help make high-performance computers more reliable and less costly to run. The models in this case are artificial neural networks trained to monitor and predict the behavior of a scientific computing cluster, where torrents of numbers are constantly crunched. The goal is to help system administrators quickly identify and respond to troublesome computing jobs, reducing downtime for scientists processing data from their experiments. In almost fashion-show style, these machine learning (ML) models are judged to see which is best suited for the ever-changing dataset demands of experimental programs. But unlike the hit reality TV series "America's Next Top Model" and its international spinoffs, it doesn't take an entire season to pick a winner. In this contest, a new "champion model" is crowned every 24 hours based on its ability to learn from fresh data. "We're trying to understand characteristics of our computing clusters that we haven't seen before," said Bryan Hess, Jefferson Lab's scientific computing operations manager and a lead investigator -- or judge, so to speak -- in the study. "It's looking at the data center in a more holistic way, and going forward, that's going to be some kind of AI or ML model." While these models don't win any glitzy photoshoots, the project recently took the spotlight in IEEE Software as part of a special edition dedicated to machine learning in data center operations (MLOps). The results of the study could have big implications for Big Science. The need Large-scale scientific instruments, such as particle accelerators, light sources and radio telescopes, are critical DOE facilities that enable scientific discovery. At Jefferson Lab, it's the Continuous Electron Beam Accelerator Facility (CEBAF), a DOE Office of Science User Facility relied on by a global community of more than 1,650 nuclear physicists. Experimental detectors at Jefferson Lab collect faint signatures of tiny particles originating from the CEBAF electron beams. Because CEBAF produces beam 24/7, those signals translate into mountains of data. The information collected is on the order of tens of petabytes per year. That's enough to fill an average laptop's hard drive about once a minute. Particle interactions are processed and analyzed in Jefferson Lab's data center using high-throughput computing clusters with software tailored to each experiment. Among the blinking lights and bundled cables, complex jobs requiring several processors (cores) are the norm. The fluid nature of these workloads means many moving parts -- and more things that could go wrong. Certain compute jobs or hardware problems can result in unexpected cluster behavior, referred to as "anomalies." They can include memory fragmenting or input/output overcommitments, resulting in delays for scientists. "When compute clusters get bigger, it becomes tough for system administrators to keep track of all the components that might go bad," said Ahmed Hossam Mohammed, a postdoctoral researcher at Jefferson Lab and an investigator on the study. "We wanted to automate this process with a model that flashes a red light whenever something weird happens. "That way, system administrators can take action before conditions deteriorate even further." A DIDACT-ic approach To address these challenges, the group developed an ML-based management system called DIDACT (Digital Data Center Twin). The acronym is a play on the word "didactic," which describes something that's designed to teach. In this case, it's teaching artificial neural networks. DIDACT is a program that provides the resources for laboratory staff to pursue projects that could make rapid and significant contributions to critical national science and technology problems of mission relevance and/or advance the laboratory's core scientific and technical capabilities. The DIDACT system is designed to detect anomalies and diagnose their source using an AI approach called continual learning. In continual learning, ML models are trained on data that arrive incrementally, similar to the lifelong learning experienced by people and animals. The DIDACT team trains multiple models in this fashion, with each representing the system dynamics of active computing jobs, then selects the top performer based on that day's data. The models are variations of unsupervised neural networks called autoencoders. One is equipped with a graph neural network (GNN), which looks at relationships between components. "They compete using known data to determine which had lower error," said Diana McSpadden, a Jefferson Lab data scientist and lead on the MLOps study. "Whichever won that day would be the 'daily champion.' " The method could one day help reduce downtime in data centers and optimize critical resources -- meaning lower costs and improved science. To train the models without affecting day-to-day compute needs, the DIDACT team developed a testbed cluster called the "sandbox." Think of the sandbox as a runway where the models are scored, in this case based on their ability to train. The DIDACT software is an ensemble of open-source and custom-built code used to develop and manage and ML models, monitor the sandbox cluster, and write out the data. All those numbers are visualized on a graphical dashboard. The system includes three pipelines for the ML "talent." One is for offline development, like a dress rehearsal. Another is for continual learning -- where the live competition takes place. Each time a new top model emerges, it becomes the primary monitor of cluster behavior in the real-time pipeline -- until it's unseated by the next day's winner. "DIDACT represents a creative stitching together of hardware and open-source software," said Hess, who is also the infrastructure architect for the High Performance Data Facility Hub being built at Jefferson Lab in partnership with DOE's Lawrence Berkeley National Laboratory. "It's a combination of things that you normally wouldn't put together, and we've shown that it can work. It really draws on the strength of Jefferson Lab's data science and computing operations expertise." In future studies, the DIDACT team would like to explore an ML framework that optimizes a data center's energy usage, whether by reducing the water flow used in cooling or by throttling down cores based on data-processing demands. "The goal is always to provide more bang for the buck," Hess said, "more science for the dollar."
[2]
Next Top Model: Competition-Based AI Study Aims to Lower Data Center Costs
A testbed computing cluster, known as the "Sandbox," is shown within the data center at Jefferson Lab. Newswise -- NEWPORT NEWS, VA - Who, or rather what, will be the next top model? Data scientists and developers at the U.S. Department of Energy's Thomas Jefferson National Accelerator Facility are trying to find out, exploring some of the latest artificial intelligence (AI) techniques to help make high-performance computers more reliable and less costly to run. The models in this case are artificial neural networks trained to monitor and predict the behavior of a scientific computing cluster, where torrents of numbers are constantly crunched. The goal is to help system administrators quickly identify and respond to troublesome computing jobs, reducing downtime for scientists processing data from their experiments. In almost fashion-show style, these machine learning (ML) models are judged to see which is best suited for the ever-changing dataset demands of experimental programs. But unlike the hit reality TV series "America's Next Top Model" and its international spinoffs, it doesn't take an entire season to pick a winner. In this contest, a new "champion model" is crowned every 24 hours based on its ability to learn from fresh data. "We're trying to understand characteristics of our computing clusters that we haven't seen before," said Bryan Hess, Jefferson Lab's scientific computing operations manager and a lead investigator - or judge, so to speak - in the study. "It's looking at the data center in a more holistic way, and going forward, that's going to be some kind of AI or ML model." While these models don't win any glitzy photoshoots, the project recently took the spotlight in the peer-reviewed scientific magazine IEEE Software as part of a special edition dedicated to machine learning in data center operations (MLOps). The results of the study could have big implications for Big Science. The Need Large-scale scientific instruments, such as particle accelerators, light sources and radio telescopes, are critical DOE facilities that enable scientific discovery. At Jefferson Lab, it's the Continuous Electron Beam Accelerator Facility (CEBAF), a DOE Office of Science User Facility relied on by a global community of more than 1,650 nuclear physicists. Experimental detectors at Jefferson Lab collect faint signatures of tiny particles originating from the CEBAF electron beams. Because CEBAF produces beam 24/7, those signals translate into mountains of data. The information collected is on the order of tens of petabytes per year. That's enough to fill an average laptop's hard drive about once a minute. Particle interactions are processed and analyzed in Jefferson Lab's data center using high-throughput computing clusters with software tailored to each experiment. Among the blinking lights and bundled cables, complex jobs requiring several processors (cores) are the norm. The fluid nature of these workloads means many moving parts - and more things that could go wrong. Certain compute jobs or hardware problems can result in unexpected cluster behavior, referred to as "anomalies." They can include memory fragmenting or input/output overcommitments, resulting in delays for scientists. "When compute clusters get bigger, it becomes tough for system administrators to keep track of all the components that might go bad," said Ahmed Hossam Mohammed, a postdoctoral researcher at Jefferson Lab and an investigator on the study. "We wanted to automate this process with a model that flashes a red light whenever something weird happens. "That way, system administrators can take action before conditions deteriorate even further." A DIDACT-ic Approach To address these challenges, the group developed an ML-based management system called DIDACT (Digital Data Center Twin). The acronym is a play on the word "didactic," which describes something that's designed to teach. In this case, it's teaching artificial neural networks. DIDACT is a project funded by Jefferson Lab's Laboratory Directed Research & Development (LDRD) program. The program provides the resources for laboratory staff to pursue projects that could make rapid and significant contributions to critical national science and technology problems of mission relevance and/or advance the laboratory's core scientific and technical capabilities. The DIDACT system is designed to detect anomalies and diagnose their source using an AI approach called continual learning. In continual learning, ML models are trained on data that arrive incrementally, similar to the lifelong learning experienced by people and animals. The DIDACT team trains multiple models in this fashion, with each representing the system dynamics of active computing jobs, then selects the top performer based on that day's data. The models are variations of unsupervised neural networks called autoencoders. One is equipped with a graph neural network (GNN), which looks at relationships between components. "They compete using known data to determine which had lower error," said Diana McSpadden, a Jefferson Lab data scientist and lead on the MLOps study. "Whichever won that day would be the 'daily champion.' " The method could one day help reduce downtime in data centers and optimize critical resources - meaning lower costs and improved science. To train the models without affecting day-to-day compute needs, the DIDACT team developed a testbed cluster called the "sandbox." Think of the sandbox as a runway where the models are scored, in this case based their ability to train. The DIDACT software is an ensemble of open-source and custom-built code used to develop and manage and ML models, monitor the sandbox cluster, and write out the data. All those numbers are visualized on a graphical dashboard. The system includes three pipelines for the ML "talent." One is for offline development, like a dress rehearsal. Another is for continual learning - where the live competition takes place. Each time a new top model emerges, it becomes the primary monitor of cluster behavior in the real-time pipeline - until it's unseated by the next day's winner. "DIDACT represents a creative stitching together of hardware and open-source software," said Hess, who is also the infrastructure architect for the High Performance Data Facility Hub being built at Jefferson Lab in partnership with DOE's Lawrence Berkeley National Laboratory. "It's a combination of things that you normally wouldn't put together, and we've shown that it can work. It really draws on the strength of Jefferson Lab's data science and computing operations expertise." In future studies, the DIDACT team would like to explore an ML framework that optimizes a data center's energy usage, whether by reducing the water flow used in cooling or by throttling down cores based on data-processing demands. "The goal is always to provide more bang for the buck," Hess said, "more science for the dollar."
Share
Share
Copy Link
Researchers at Jefferson Lab are using AI models in a daily competition to improve data center efficiency and reduce costs for large-scale scientific experiments.
Researchers at the U.S. Department of Energy's Thomas Jefferson National Accelerator Facility are pioneering an innovative approach to data center management using artificial intelligence (AI). The project, dubbed DIDACT (Digital Data Center Twin), aims to enhance the reliability and cost-effectiveness of high-performance computing systems crucial for scientific research 12.
At Jefferson Lab, the Continuous Electron Beam Accelerator Facility (CEBAF) generates massive amounts of data - tens of petabytes annually - from particle physics experiments. This data deluge, equivalent to filling a laptop's hard drive every minute, requires robust computing infrastructure for processing and analysis 12.
DIDACT employs machine learning models, specifically artificial neural networks, to monitor and predict the behavior of scientific computing clusters. These models compete in a daily contest to detect anomalies and optimize system performance 12.
Bryan Hess, Jefferson Lab's scientific computing operations manager, explains, "We're trying to understand characteristics of our computing clusters that we haven't seen before. It's looking at the data center in a more holistic way" 12.
Unlike traditional AI training methods, DIDACT uses a continual learning approach. Multiple models, including variations of unsupervised neural networks called autoencoders, are trained on incrementally arriving data. Each day, a new "champion model" is crowned based on its ability to learn from fresh data and detect anomalies with the lowest error rate 12.
Diana McSpadden, a Jefferson Lab data scientist, describes the process: "They compete using known data to determine which had lower error. Whichever won that day would be the 'daily champion'" 12.
To avoid disrupting ongoing scientific computations, the team developed a testbed cluster called the "sandbox." This environment serves as a runway where AI models can be trained and evaluated without impacting day-to-day operations 12.
The DIDACT system has the potential to significantly reduce downtime in data centers and optimize critical resources. By automating the detection of anomalies and potential issues, it allows system administrators to take proactive measures, ultimately lowering costs and improving scientific productivity 12.
Ahmed Hossam Mohammed, a postdoctoral researcher at Jefferson Lab, highlights the importance of this automation: "When compute clusters get bigger, it becomes tough for system administrators to keep track of all the components that might go bad. We wanted to automate this process with a model that flashes a red light whenever something weird happens" 12.
The project has gained recognition in the scientific community, recently featured in IEEE Software as part of a special edition on machine learning in data center operations (MLOps) 12. As large-scale scientific instruments continue to generate ever-increasing volumes of data, AI-driven management systems like DIDACT may become essential tools for maintaining efficient and cost-effective research infrastructure.
Researchers at SLAC are leveraging artificial intelligence to optimize particle accelerators, process big data, and accelerate drug discovery, pushing the boundaries of scientific exploration.
2 Sources
2 Sources
The National Synchrotron Light Source II (NSLS-II) at Brookhaven National Laboratory is leveraging AI and machine learning to enhance research efficiency, automate processes, and tackle data challenges in synchrotron experiments.
2 Sources
2 Sources
Chinese AI startup DeepSeek has disrupted the AI industry with its cost-effective and powerful AI models, causing significant market reactions and challenging the dominance of major U.S. tech companies.
14 Sources
14 Sources
Leading AI companies are experiencing diminishing returns on scaling their AI systems, prompting a shift in approach and raising questions about the future of AI development.
7 Sources
7 Sources
As edge computing rises in prominence for AI applications, it's driving increased cloud consumption rather than replacing it. This symbiosis is reshaping enterprise AI strategies and infrastructure decisions.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved