Curated by THEOUTPOST
On Thu, 6 Mar, 12:08 AM UTC
4 Sources
[1]
AI infrastructure: The future of data centers and enterprise computing - SiliconANGLE
Three insights you may have missed from theCUBE's coverage of the 'Mastering AI: The New Infrastructure Rules' event The rise of artificial intelligence is transforming the global data center industry, with enterprises facing critical decisions about how to build, deploy and optimize their AI infrastructure. With AI becoming the cornerstone of competitive advantage, enterprises must adopt new strategies to ensure they are not left behind. Over the next decade, AI-driven computing will account for nearly 90% of all data center spending, fundamentally reshaping IT strategies, according to Dave Vellante, chief analyst of theCUBE Research. However, many organizations are struggling with challenges such as skills gaps, underutilized GPU clusters and the complexity of integrating AI into existing systems. "We are witnessing the rise of a completely new computing era," Vellante said. "Within the next decade, a trillion-dollar-plus data center business is poised for transformation, powered by what we refer to as extreme parallel computing, or as some prefer to call it, accelerated computing. While artificial intelligence is the primary accelerant, the effects ripple across the entire technology stack." During the "Mastering AI: The New Infrastructure Rules" event. Vellante spoke with Penguin Solutions Inc.'s Pete Manca (pictured), president of AI infrastructure provider, and Trey Layton, vice president of software and product management, about how organizations can successfully navigate AI adoption. (* Disclosure below.) One of the biggest hurdles enterprises face when implementing AI is that traditional IT infrastructures are not designed for the demands of AI workloads. Many companies are experimenting with AI in the cloud but are hesitant to move their proprietary data there due to security and cost concerns. As a result, organizations are looking for ways to build AI infrastructure in-house, according to Manca. "Traditional infrastructures are very different than AI infrastructures, and so, they have to rethink how they do IT," he said. "Building in-house is probably the preferred way to go, but it means literally a soup-to-nuts transformation from data center build-out, power, cooling, all the way through the architecture of their system." Enterprises must consider key architectural decisions, such as whether to use liquid cooling, direct-to-chip cooling or traditional air cooling. They also need to decide on the best combination of chip vendors, storage solutions and networking technologies. With numerous options to consider, organizations often seek expert guidance from companies such as Penguin Solutions to design AI environments that can scale effectively, according to Manca. "A lot of the technology you can pick and choose depending upon your use case," he said. "The trick is designing it right up front. What you don't want to do is put these pieces together." Unlike conventional IT setups, which may involve managing hundreds of thousands of servers in-house, AI workloads -- particularly large language model training -- demand highly sophisticated networking solutions and specialized hardware configurations. This includes direct-to-chip connections, GPU-to-GPU communication technologies such as NVLink and advanced optical networking to bypass traditional CPU bottlenecks. By partnering with experts in AI infrastructure, enterprises can avoid costly missteps and ensure their AI deployments are built for long-term success, Manca explained. Here's theCUBE's complete video interview with Pete Manca: Unlike traditional IT environments that focus on uptime and high availability, AI infrastructure must be optimized for maximum performance at all times. Enterprises often struggle with underutilized GPU clusters, leading to inefficiencies that drive up costs. To address this, organizations must implement intelligent compute environments that optimize workloads and minimize downtime, according to Layton. "You're talking about a massively scalable parallel processing infrastructure that's designed to run at peak performance all the time -- that's different than what organizations of the past have built," he said. One of the biggest challenges in AI infrastructure is ensuring that GPU clusters operate efficiently. GPUs fail 33 times more often than general-purpose CPUs because they run at full capacity continuously, according to Layton. Organizations need predictive failure analysis tools that can proactively identify and address potential failures before they impact operations. Penguin Solutions has developed software solutions, such as ICE ClusterWare AIM, to tackle these challenges. The service enhances AI and HPC infrastructure by leveraging over 2 billion hours of GPU runtime expertise, using patent-pending software to prevent failures, automate maintenance and optimize performance at any cluster size, Layton added. "We're actually monitoring for nominal variations in temperatures in the GPUs themselves," he said. "We're doing latency throughput testing on the InfiniBand fabric and any deviation outside of nominal parameters, we'll begin to institute automation that will attempt to remediate that in software. If we can't, then we remove that device from the production workload so it doesn't actually result in production outages." By integrating AI-driven monitoring and remediation capabilities, enterprises can maintain high performance while reducing downtime, ensuring that their AI infrastructure operates as efficiently as possible, Layton added. Here's theCUBE's complete video interview with Trey Layton: One of the most significant obstacles to AI adoption is the lack of in-house expertise. AI infrastructure requires a unique skill set that blends traditional enterprise IT knowledge with HPC expertise. Many IT professionals are accustomed to managing virtualization and cloud environments but lack experience in designing high-performance AI clusters, Layton pointed out. "The high-performance computing world needs to understand the problems of IT, and the IT world needs to understand the problems of high-performance computing," he said. "And in that we get a convergence of those two skills and that will be the future artificial intelligence infrastructure engineer ... one who gets both worlds." To address this challenge, organizations must invest in training and seek out AI-focused technology partners. Companies such as Penguin Solutions provide AI-optimized architecture models and modular infrastructure solutions that allow businesses to scale their AI environments while maintaining operational flexibility, Layton pointed out. Future-proofing AI infrastructure is another critical consideration. Given the rapid advancements in AI hardware and software, companies need modular architectures that can adapt to new technologies. Designing for long-term scalability is crucial for sustainable growth and efficiency. "The reality is that there's a blistering pace of development with the underlying hardware that's out there," Layton said. "You need an underlying architecture that is deployed in an environment that can accommodate those changes and also find ways to utilize some of those technologies." By adopting a modular, adaptable approach and leveraging the expertise of AI infrastructure specialists, enterprises can ensure that their AI investments remain viable and competitive in the long term, Layton concluded. Here's theCUBE's continuing conversation with Trey Layton:
[2]
Penguin aims to solve the architectural problem of AI - SiliconANGLE
How Penguin Solutions is driving HPC expertise into AI success Companies are facing an architectural problem as they integrate power-hungry artificial intelligence models. In the chaos of AI adoption, Penguin Solutions Inc. has emerged as a strong player when it comes managing high-performance computing for AI. Accelerated computing comprised 10% of all data spending in 2022 and will account for almost 90% by 2030, according to theCUBE Research. "The market's ... growing exponentially," said Pete Manca (pictured), president of Penguin. "The enterprise customers we speak to know they have to get some AI strategy in place, whether it's for simple things like increased service offerings or more complex things like fraud detection and other use cases we hear out there. But they don't know how to get there or they're not set up today in order to get there. Traditional infrastructures are very different than AI infrastructures, and so they have to rethink how they do IT." Manca spoke with theCUBE's Dave Vellante at the "Mastering AI: The New Infrastructure Rules" event, during an exclusive broadcast on theCUBE, SiliconANGLE Media's livestreaming studio. They discussed Penguin's history with accelerated computing and structuring hybrid AI. (* Disclosure below.) Penguin's background in managing HPC clusters and large scale clusters has proved a boon for the current era of AI. The company guides customers through every step of the process for upgrading their computing infrastructure. "You create a software abstraction layer that hides the complexity of the underlying hardware, and you make it simple for the end user to manage the environment while you abstract away the complexities of the underlying hardware," Manca explained. "That's something that ClusterWare does for our customers ... we try to abstract away all those complexities." Many businesses struggle with architecting hybrid AI, so Penguin builds a solution with them from the ground up. It starts with the data center, deciding whether or not to use liquid cooling or direct to chip, and goes up through the software layer, making the choice between a custom solution and an off-the-shelf solution. For companies needing to restructure their IT set up, Manca highlights two options: going to a tier two service provider or leveraging their capabilities and building in house. "Building in-house is probably the preferred way to go, but it means literally a soup to nuts transformation, from data center build out power cooling all the way through architecture of their system," he said. "It's a very complex environment, and they look to partners like Penguin Solutions to help guide them through that as a trusted advisor." In tackling the architectural problem many companies face, Penguin has helped customers grow their uptime on GPU clusters from 50 to 90%. Predictive failure analysis, or understanding if a GPU might fail before it does, is one of the keys to keeping a network fast and reliable, according to Manca. "You've got to get the data from the storage through the network into memory to feed these GPUs and keep them busy," he said. "Right there, you've got an architectural problem that you're trying to solve around how do I get very sophisticated high speed parallel file systems to feed these GPUs as much data as possible? In some cases, in real time, it could be a batch or it could be a real-time processing engine. You've got to figure that out. Once you do that, then you've got to make sure that you keep the GPUs up and running. They're a little bit finicky. It's new technology." Here's the complete video interview, part of SiliconANGLE's and theCUBE Research's coverage of the "Mastering AI: The New Infrastructure Rules" event:
[3]
Optimizing AI infrastructure for scalable and high-performance systems - SiliconANGLE
Optimizing AI infrastructure: Bridging IT and HPC for scalable, high-performance AI systems As artificial intelligence applications grow in complexity and demand, optimizing AI infrastructure allows seamless scaling, ensuring that systems can handle increased workloads without performance degradation. Recognizing that AI's potential hinges on the strength of its underlying infrastructure, Penguin Solutions Inc. is taking a proactive approach by offering a sustainable operational model designed to enhance productivity and scalability, according to Trey Layton (pictured), vice president of software and product management at Penguin Solutions. "The high-performance computing world needs to understand the problems of IT, and the IT world needs to understand the problems of high-performance computing," he said. "In that, we get a convergence of those two skills, and the future artificial intelligence infrastructure engineer is one who gets both worlds. When we accommodate those two things by building an infrastructure that is modular, you're acquiring partnerships with organizations that understand how to deal with the complexity and the scale simultaneously." Layton spoke with theCUBE's Dave Vellante at the "Mastering AI: The New Infrastructure Rules" event, during an exclusive broadcast on theCUBE, SiliconANGLE Media's livestreaming studio. They discussed why optimizing AI infrastructure should be top of mind in the modern digital landscape. (* Disclosure below.) Optimizing AI infrastructure enhances performance, reduces costs, ensures scalability and improves sustainability. As a result, taming the skills gap is needed for efficient management of AI workloads, according to Layton. "When we're building an artificial intelligence environment, we're really talking about constructing an F1 car that's designed to run around a track, and you need a different set of tools to be able to construct that highly specialized solution to be able to deliver those outcomes," he said. In the IT world, achieving peak performance is an occasional milestone, whereas AI infrastructure operates at peak efficiency continuously, demanding distinct skills and tools, Layton added. This contrast highlights the evolving need for specialized expertise and adaptive technologies to optimize both environments effectively. Since HPC and advanced IT form the foundation of modern AI, they are absolutely crucial. They serve as the driving force behind optimizing AI infrastructure, according to Layton. "If you think about the modern HPC engineer, he's going to need to be versed in Kubernetes and microservices, where they're largely experienced in batch-based processing technologies like Slurm and things like that," he said. "Whereas the IT person has been skilled in virtualization and cloud technologies, and now they're going to have to learn storage technologies like parallel file systems and how to run massively scalable clustered outcomes. These two worlds are colliding, and the skills are unique to each particular environment." Mitigating AI failures involves proactive planning, monitoring and refining AI systems to ensure accuracy, reliability and ethical use. This explains why optimizing AI infrastructure is needed to mitigate errors, according to Layton. "Our own report internal analysis shows that GPUs fail about 33 times the rate of a general purpose CPU," he said. "If you go back to that car analogy, when you're running a race car around a track, and the engine's running at full RPMs all the time, sometimes tires are going to blow, sometimes cylinders are going to blow, and that's what happens in these AI infrastructure solutions is that we're running all the devices at peak performance all the time. How do you construct the environment to accommodate those failure conditions?" Here's the complete video interview, part of SiliconANGLE's and theCUBE Research's coverage of the "Mastering AI: The New Infrastructure Rules" event:
[4]
ICE ClusterWare automates AI infrastructure provisioning - SiliconANGLE
The future of AI infrastructure: Scaling intelligence with unified compute environments Artificial intelligence is only as powerful as the infrastructure that supports it. With today's enterprises deploying hundreds, if not thousands, of GPUs, ensuring peak performance, efficiency and security is paramount. Penguin Solutions Inc. is addressing this need with ICE ClusterWare, a software solution designed to simplify and optimize AI infrastructure deployment, according to Trey Layton (pictured), vice president of software and product management at Penguin Solutions. "I think the unique thing with artificial intelligence is we're talking about constructing an environment that needs to run at peak performance all the time, which is in a little bit of contrast to what IT organizations are typically used to managing," he said. "You're talking about a massively scalable parallel-processing infrastructure that's designed to run at peak performance all the time. That's different than what organizations of the past have built, and that's what we're focused on building." Layton spoke with theCUBE's Dave Vellante at the "Mastering AI: The New Infrastructure Rules" event, during an exclusive broadcast on theCUBE, SiliconANGLE Media's livestreaming studio. They discussed insights on optimizing AI infrastructure, underscoring the need for integrated AI environments that adapt to evolving demands. (* Disclosure below.) Many enterprises have experimented with cloud-based AI solutions, but scaling these experiments into production environments is costly and complex. Data gravity, latency concerns and limited AI expertise further complicate cloud-based approaches. To address these issues, organizations must shift toward unified AI infrastructure that combines on-premise computing power with cloud flexibility. Deploying AI environments at scale requires specialized expertise, which many enterprises lack. To bridge this gap, Penguin has developed ICE ClusterWare, a software solution designed to automate the provisioning of AI clusters. It simplifies deployment, enabling organizations to build high-performance AI environments without requiring deep technical expertise, according to Layton. "ICE ClusterWare is designed to provision these artificial intelligence clusters that are needed in numerous use cases out there when you provision these infrastructures," he said. "A lot of organizations don't have the skill sets to deploy these particular configurations, and this software is designed to automate these outcomes so it makes it easier for organizations to deploy those environments." Beyond automation, the solution ensures optimal resource utilization by managing GPU clusters effectively. AI workloads demand constant fine-tuning of compute resources, and Penguin's software provides the necessary orchestration to maintain peak performance. With AI adoption accelerating across industries, solutions such as ICE ClusterWare offer a streamlined path for enterprises to scale their AI capabilities, Layton added. AI environments operate under extreme conditions, often running at full capacity around the clock. This continuous strain on hardware increases the likelihood of silent failures; subtle issues that can cascade into system-wide disruptions if left undetected. To mitigate these risks, Penguin has introduced ICE ClusterWare AIM service, a software tool that provides telemetry and predictive failure analysis, Layton explained. "When you're running infrastructure at high performance, low latency, maximum performance, you're going to experience failures that are sometimes silent that lead to larger failures -- and you're going to experience outright hardware failures," he said. "The AIM software solution is designed to diagnose and remediate those failures before they impact the actual production environment." Here's the complete video interview, part of SiliconANGLE's and theCUBE Research's coverage of the "Mastering AI: The New Infrastructure Rules" event:
Share
Share
Copy Link
The rise of AI is transforming data centers and enterprise computing, with new infrastructure requirements and challenges. Companies like Penguin Solutions are offering innovative solutions to help businesses navigate this complex landscape.
The rise of artificial intelligence (AI) is fundamentally reshaping the global data center industry, presenting enterprises with critical decisions about building, deploying, and optimizing their AI infrastructure. According to Dave Vellante, chief analyst of theCUBE Research, AI-driven computing is projected to account for nearly 90% of all data center spending over the next decade 1. This shift is creating a new era of computing that will transform the trillion-dollar-plus data center business.
Organizations face several challenges when implementing AI infrastructure:
Pete Manca, president of Penguin Solutions, emphasizes that building in-house AI infrastructure is often preferred but requires a comprehensive transformation 1. This includes rethinking data center build-out, power, cooling, and system architecture.
Enterprises must consider several factors when designing AI environments:
Trey Layton, VP of software and product management at Penguin Solutions, highlights the need for massively scalable parallel processing infrastructures designed to run at peak performance continuously 2.
Unlike traditional IT setups focused on uptime and high availability, AI infrastructure must be optimized for maximum performance at all times. Key considerations include:
Penguin Solutions has developed software solutions like ICE ClusterWare AIM to tackle these challenges, leveraging over 2 billion hours of GPU runtime expertise 3.
The convergence of high-performance computing (HPC) and IT skills is crucial for managing AI infrastructure effectively. Layton emphasizes the need for AI infrastructure engineers who understand both worlds 3. This includes expertise in:
To address the complexity of deploying AI environments at scale, Penguin Solutions has developed ICE ClusterWare, a software solution designed to automate the provisioning of AI clusters 4. This tool simplifies deployment and enables organizations to build high-performance AI environments without requiring deep technical expertise.
As AI applications grow in complexity and demand, optimizing infrastructure for scalability, efficiency, and reliability becomes crucial. The future of AI infrastructure lies in unified compute environments that combine on-premise computing power with cloud flexibility, allowing enterprises to scale their AI capabilities effectively while managing costs and performance 4.
Reference
[1]
[3]
Weka, Nvidia, and partners showcase advancements in AI infrastructure at SC24, addressing challenges in scalability, efficiency, and sustainability for enterprise AI deployments.
7 Sources
7 Sources
As AI continues to transform enterprise computing, companies are navigating new infrastructure paradigms. From cloud-based solutions to custom on-premises setups, businesses are exploring various options to gain a competitive edge in the AI-driven landscape.
4 Sources
4 Sources
A comprehensive look at the latest advancements in high-performance computing and multicloud AI strategies, highlighting key insights from SC24 and Microsoft Ignite 2024 events.
2 Sources
2 Sources
Nutanix and Nvidia partner to address challenges in enterprise AI adoption, offering solutions for hybrid cloud environments and full-stack accelerated computing to meet the demands of generative and agentic AI.
3 Sources
3 Sources
Dell Technologies and its partners presented advancements in AI infrastructure, including the AI Factory, cooling technologies, and networking solutions at the Supercompute conference (SC24).
11 Sources
11 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved