4 Sources
4 Sources
[1]
AppleInsider.com
Real-world test of Apple's latest implementation of Mac cluster computing proves it can help AI researchers work using massive models, thanks to pooling memory resources over Thunderbolt 5. In November, Apple teased inbound features in macOS Tahoe 26.2 that stands to considerably change how AI researchers perform machine learning processing. At the time, the headline improvement made to MLX, Apple's machine learning framework, was to support GPU-based neural accelerators, but Thunderbolt 5 clustering support was also a big change. One month later, and the benefits of Thunderbolt 5 for clustering are finally being seen in a real-world environment. YouTuber Jeff Geerling wrote a blog post and published a video on December 18, detailing the experience he had with a cluster of Mac Studios loaned to him by Apple. The set of four Macs cost just short of $40,000 in total, and were used to show off the Thunderbolt 5 connectivity in relation to cluster computing. All models were M3 Ultra models, each equipped with a 32-core CPU, 80-core GPU, and a 32-core Neural Engine. Two of the models supplied had 512GB of unified memory and 8TB of storage, while the other two had 256GB of memory and 4TB of storage. Put into a compact 10-inch rack, the collection of Mac Studios were said by Geerling to be "almost whisper-quiet" and running at under 250 watts apiece. However, the key is the combination of Thunderbolt 5 support between the Mac Studios and the capability to pool the memory. Massive memory resources The MLX changes in macOS Tahoe 26.2 included a new driver with Thunderbolt 5 support. This is important since it can considerably speed up inter-Mac connections when used in small clusters, such as this. Typical Ethernet-based cluster computing is limited to a maximum of 10Gb/s, depending on the Mac's specification and not using concepts such as link aggregation and multiple Ethernet ports. To improve on this, researchers have used Thunderbolt to handle connections between Macs in a cluster, since it has much higher bandwidth. Under previous efforts and using Thunderbolt 4, the maximum bandwidth was 40Gb/s. With Thunderbolt 5, the bandwidth is boosted to a maximum of 80Gb/s. The massive bandwidth is especially useful thanks to Apple's inclusion of RDMA (Remote Direct Access Memory) in Thunderbolt 5. Under RDMA, one CPU node in the cluster is capable of directly reading the memory of another, expanding its available memory pool to incorporate others in the cluster. Crucially it is performed directly, as the name indicates, without requiring much processing from the secondary Mac's CPU at all. In short, the different processors have access to all of a cluster's memory reserves at once. For the collection of four Mac Studios as loaned to Geerling, that's a total of 1.5 terabytes of memory in use. With Thunderbolt 5 improving the inter-Mac bandwidth, that access has now improved considerably. The upshot for researchers working in machine learning is that it's a way to use huge Large Language Models (LLMs) that go beyond the theoretical limitations of one Mac's memory capacity. Doing a cluster this way does have a limit, due to the use of Thunderbolt 5 itself. In lieu of any theoretical Thunderbolt 5 networking switch, all of the Mac Studios have to be daisy-chained, severely limiting the number of units you could cluster together without network latency that would hobble performance. Real-world testing Geerling was able to run some benchmarks on the Mac Studio collection to determine how beneficial it actually can be. After running a command to enable RDMA in recovery mode, he used an open source tool called Exo as well as Llama.cpp to run models across the cluster. Both were used as a form of testing RDMA's effectiveness. Exo supports RDMA, while Llama does not. An initial benchmark using Qwen3 235B showed promise in the system. Under a single node, or a single Mac from the cluster, Llama was better at 20.4 tokens per second versus 19.5 tokens per second for Exo. But when two nodes were in use, Llama dropped to 17.2 tokens per second while Exo improved considerably to 26.2 tokens per second. At four nodes, Llama shrank again to 15.2 tokens per second while Exo went up to 31.9 tokens per second. Similar improvements were seen using DeepSeek V3.1 671B, with Exo's performance going from 21.1 tokens per second on a single node to 27.8 tokens per second for two, and 32.5 tokens per second for four nodes. There was also a test of a one-trillion-parameter model, Kimi K2 Thinking 1T A32B, albeit only 32 billion parameters were active at any time. This is a model that is simply too big for a single Mac Studio with 512GB of storage to deal with. Over two nodes, Llama reported a speed of 18.5 tokens per second, with Exo's RDMA bumping it up to 21.6 tokens per second. Over four nodes, Exo got to 28.3 tokens per second. Across the clustering tests, Exo improved considerably as more nodes were available to use, thanks to RDMA. Big potential, with asterisks The big takeaway from Geerling's testing is that there's a lot of performance available for researchers working in machine learning, especially when it comes to handling massive LLMs. Apple has certainly demonstrated that it is possible, without sacrificing performance, thanks to RDMA and Thunderbolt 5's available bandwidth. Creating a cluster like this can still be expensive for the typical user, and it may be a bit too expensive for hobbyists to undertake. However, a $40,000 setup similar to this is a fairly reasonable-priced expense for teams working for companies with a vested interest in AI development. There are some reservations, though, such as reported stability issues stemming from running HPL benchmarks over Thunderbolt and other bugs that surface in prerelease software. Geerling adds he has trust issues when it comes to the secretive development team working on Exo, especially considering it's an open source project. However, there's also some unrealized potential here. The cluster uses the M3 Ultra as that's the fastest chip in a Mac that supports Thunderbolt 5, not the slower Thunderbolt 4. While an M4 Ultra chip is out of the way, it's proposed that an M5 Ultra Mac Studio could be much better, thanks to its use of GPU neural accelerator support. That should give even more of a boost to machine learning research, if Apple gets around to releasing that chip. Geerling also wonders if Apple could extend the inter-device Thunderbolt 5 connectivity even more, to include SMB Direct. He reasons that network shares behaving at speeds similar to if they were directly attached to the Mac could be a big assist for people working with latency-sensitive and high-bandwidth applications. Like video editing for YouTubers.
[2]
Powerful Apple Mac Studio AI Supercomputer with 2TB of RAM
What if you could build a machine so powerful it could handle trillion-parameter AI models, yet so accessible it could sit right in your home office? In the video, NetworkChuck breaks down how he constructed a local AI supercomputer with a staggering 2TB of RAM, using nothing more than four Mac Studios and some clever optimizations. This isn't just a tech flex, it's a bold challenge to the notion that high-performance AI computing is reserved for massive corporations with endless budgets. By combining consumer-grade hardware with innovative techniques like tensor parallelism and RDMA, he's crafted a system that rivals traditional supercomputers at a fraction of the cost. This guide will walk you through the key takeaways from his build, including the hardware configuration, the software breakthroughs that slashed latency, and the real-world applications that make this setup more than just a theoretical exercise. Whether you're curious about how local AI computing can enhance data security or intrigued by the idea of running trillion-parameter models without relying on the cloud, there's plenty to unpack here. What does it mean when consumer hardware starts to close the gap with enterprise systems? Let's explore the possibilities, and the challenges, of this new project. The hardware setup forms the backbone of this AI supercomputer. Each Mac Studio in the cluster is equipped with: When combined, the cluster delivers: The system relies on Thunderbolt 5 and Ethernet networking to ensure fast and reliable communication between devices. At an estimated cost of $50,000, this configuration provides a cost-effective alternative to traditional high-performance computing systems, such as Nvidia H100 clusters, which can exceed $780,000. This affordability makes innovative AI computing accessible to smaller organizations and independent researchers. One of the most significant challenges in clustering Mac Studios was addressing latency. Initial attempts faced delays of up to 300 microseconds, causing performance drops of up to 91%. This issue was resolved with the introduction of macOS Tahoe 26.2, which includes support for Remote Direct Memory Access (RDMA). RDMA reduces latency to just 3 microseconds, allowing faster communication between GPUs and significantly improving the cluster's overall efficiency. The integration of RDMA allows data to bypass the CPU during transfers, directly accessing memory across devices. This innovation ensures that the cluster operates at peak performance, making it capable of handling demanding AI workloads with minimal delays. Discover other guides from our vast content that could be of interest on AI Supercomputer. To further enhance the cluster's capabilities, the system transitioned from pipeline parallelism to tensor parallelism. This approach divides large AI models into smaller tensors, which are processed simultaneously across multiple GPUs. Tensor parallelism maximizes the utilization of the cluster's 320 GPU cores, making sure efficient distribution of computational tasks. When combined with RDMA, tensor parallelism tripled the system's performance compared to earlier configurations. The cluster successfully managed trillion-parameter models, such as Kimmy K2, showcasing its ability to handle some of the most complex AI models available today. This optimization highlights the potential of consumer-grade hardware to rival traditional supercomputers in specific applications. The cluster underwent rigorous testing with a variety of AI models, including: These tests confirmed the system's compatibility with real-world applications such as Open Web UI and Xcode. Running these models locally offers several advantages, including enhanced data security by reducing reliance on cloud-based solutions and lower operational costs by eliminating recurring cloud service fees. This capability is particularly valuable for organizations that handle sensitive data or operate on tight budgets. At a price point of $50,000, this AI supercomputer represents a significant step toward providing widespread access to access to high-performance AI computing. It provides researchers, developers, and small organizations with the tools needed to innovate in fields such as machine learning, application development, and scientific research. By bridging the gap between consumer-grade hardware and enterprise-level capabilities, this project opens new doors for experimentation and discovery. Despite its impressive achievements, the project encountered several challenges that highlight areas for future improvement: These issues underscore the importance of continued development in both hardware and software to fully realize the potential of local AI clustering. This project serves as a compelling proof of concept for the viability of local AI clustering using consumer-grade hardware. By addressing current limitations and using ongoing advancements in networking and software, you can unlock new possibilities in high-performance computing. As technology continues to evolve, local AI clusters have the potential to rival traditional supercomputers, offering scalable and accessible solutions for a wide range of applications, from academic research to industrial innovation. The development of this AI supercomputer demonstrates how innovative hardware and optimized software can deliver exceptional performance at a fraction of the cost of traditional systems. This achievement not only highlights the feasibility of local AI computing but also encourages further exploration into clustering technologies and their practical applications.
[3]
Apple's AI Advantage On Its Mac Cluster Now Under Threat
The ability to pool computational power by clustering a number of Mac mini or Apple Studio devices via the Thunderbolt 5 is indeed a potent tool, especially given Apple's unified memory architecture, which makes available copious memory at a time when a given quantum of memory resource is worth its weight in gold. Yet, even as Apple introduces tools to better exploit this unique advantage, its overall competitiveness is being chipped away by the expiration of its memory-focused long-term agreements (LTAs), setting the stage for a considerable surge in prices for its upcoming products, including the upcoming M5-based Mac mini and Apple Studio. Alex Ziskind recently showed that it was cheaper to run less complicated machine learning (ML) and AI tasks on dedicated Apple silicon as compared to the NVIDIA RTX 4090, which is NVIDIA's most expensive consumer-oriented GPU. At the heart of this advantage lies Apple silicon's unified memory architecture, where the CPU and GPU use the same memory cache. So, as an example, the M4 Pro Mac mini boasts 64GB of RAM (unified memory) vs. the RTX 4090's 24GB of RAM. When you link a number of Mac mini devices via Thunderbolt 5, the pooled memory scales rapidly for AI-related tasks. What's more, this advantage becomes all the more lucrative when combined with the superior processing power of the upcoming M5-based Mac mini and Apple Studio. Meanwhile, Apple appears to be doing everything in its power to highlight this pooled computing advantage. For instance, macOS Tahoe 26.2 introduced a new driver to the MLX, Apple's bespoke machine learning platform, replete with support for Thunderbolt 5. Unlike a typical Ethernet-based computing cluster, where the connection speed maxes out at around 10Gb/s, the Thunderbolt 5 has a max bandwidth of 80Gb/s. What's more, Apple has instituted RDMA (Remote Direct Access Memory) with Thunderbolt 5, which allows any given CPU node in the cluster to read the memory of another, and that too without expending much processing power of the CPU node that is being read. To illustrate this concept, the YouTuber, Jeff Geerling, recently built a cluster of four Mac Studios loaned to him by Apple. The cluster boasted of a unified memory of around 1.5TB and costs roughly $40,000. For comparison, pooling the same quantum of memory by clustering the NVIDIA DGX Spark would require you to acquire 12 units, with each unit costing roughly $4,000. This equates to a total cost of $48,000, giving Geerling's cluster of 4 Apple Mac Studios an ~$8,000 advantage. This cost advantage that Apple currently retains is at least partially due to its long-term agreements (LTAs) with some of the biggest names in the memory spheres. However, as we noted in a recent dedicated post, some of Apple's LTAs are set to end as soon as January 2026, with Samsung and SK Hynix chomping at the bit to increase their quotation prices for Apple as a result. Against this backdrop, we would not be surprised if the $8,000 advantage that we detailed above shrinks to mere hundreds of dollars, or disappears in its entirety, come January.
[4]
M4 Pro Macs Stack : Thunderbolt 5 Links Make Mac AI Go Way Faster
What if you could run trillion-parameter AI models on your desk without relying on expensive cloud infrastructure? In the video, Alex Ziskind breaks down Apple's latest innovations in artificial intelligence, and it's nothing short of innovative. With the release of Exo 1.0, macOS 26.2, and RDMA over Thunderbolt 5, Apple is reshaping how AI workflows operate on their hardware. Imagine clustering multiple Mac Studios or Mac Minis to handle massive machine learning tasks with ease, this isn't just a technical upgrade; it's a bold step toward making advanced AI accessible to more people than ever before. In this deep dive, we'll explore how Apple's ecosystem is transforming AI development, from the new tensor parallelism in Exo 1.0 to the lightning-fast data transfers enabled by RDMA. Whether you're a seasoned developer or just curious about the future of AI, you'll discover how these innovations eliminate bottlenecks, boost scalability, and redefine performance. The implications for researchers, businesses, and creators are enormous, but the real question is: how will this change what's possible in your own work? At the heart of Apple's advancements lies Exo 1.0, a powerful clustering solution designed to simplify distributed machine learning. With its user-friendly installer and intuitive interface, Exo 1.0 allows you to set up and manage clusters with remarkable ease. Its real-time dashboard provides detailed insights into cluster performance and model execution, making it accessible even if you are new to distributed computing. Exo 1.0 introduces tensor parallelism, a method that divides large AI models into smaller, manageable segments for simultaneous processing across multiple devices. This approach optimizes model sharding, making sure that even the most complex models can run efficiently. Whether you are a developer or researcher, Exo 1.0 removes the technical barriers to using clustering technology, allowing you to focus on innovation and results. A standout feature of Apple's ecosystem is the integration of RDMA (Remote Direct Memory Access) over Thunderbolt 5, which transforms data transfer speeds between devices. This technology achieves communication speeds up to 10 times faster than traditional methods, significantly reducing data transfer delays. By eliminating bottlenecks, RDMA ensures that distributed AI tasks run smoothly and efficiently, even at scale. To take advantage of RDMA over Thunderbolt 5, you will need devices equipped with Apple's latest M4 Pro chips or higher. This combination of innovative hardware and software delivers exceptional performance, allowing you to scale AI workloads across multiple nodes without the latency issues that often hinder multi-machine setups. This innovation is particularly beneficial for tasks requiring real-time processing or large-scale model training. Here are more detailed guides and articles that you may find helpful on Apple Silicon. The MLX Distributed Framework is another cornerstone of Apple's AI ecosystem, specifically designed to maximize the potential of Apple Silicon. Seamlessly integrated with RDMA, MLX accelerates both model training and inference, offering unparalleled performance for a wide range of AI applications. It supports both dense models and quantized models, providing flexibility based on your specific requirements. This adaptability ensures that you can optimize performance regardless of the complexity or scale of your AI tasks. Whether you are working on resource-intensive projects or lightweight applications, MLX provides the tools needed to achieve your goals efficiently. The release of macOS 26.2 further enhances Apple's AI capabilities by introducing native support for RDMA, creating a seamless integration between hardware and software. One of the most notable features of Apple Silicon, unified memory, allows memory to be shared effortlessly across clusters. This capability enables you to run larger models on devices like the M4 Mac Mini, which are more cost-effective than traditional high-end alternatives. By combining macOS 26.2 with unified memory, Apple ensures that AI workflows are not only faster but also more accessible. Whether you are using a compact Mac Mini or a high-performance Mac Studio, this cohesive ecosystem enables you to achieve remarkable results without the need for expensive cloud-based infrastructure. Apple's advancements in AI clustering technology open up a wide array of possibilities for real-world applications. Large language models (LLMs), which power technologies such as chatbots, natural language processing tools, and content generation systems, can now be run locally on Apple Silicon clusters. This eliminates the reliance on costly cloud-based solutions, giving you greater control over your data while significantly reducing operational expenses. For developers and researchers, these tools provide a robust platform for experimentation and innovation. Whether you are training new models, fine-tuning existing ones, or exploring novel AI applications, Apple's ecosystem offers the resources and scalability needed to push the boundaries of what is possible in artificial intelligence. Apple's integration of tensor parallelism, RDMA, and unified memory delivers substantial performance improvements for AI workflows. Key metrics, such as token generation rates -- a critical measure of efficiency for LLMs, have seen significant enhancements. Apple's clustering technology scales seamlessly across multiple nodes, allowing faster processing even for trillion-parameter models. This scalability ensures that you can tackle demanding AI workloads without compromising speed or accuracy. By using Apple's ecosystem, you can achieve results that were previously only possible with large-scale cloud infrastructure. This makes Apple's solution not only powerful but also cost-effective for businesses and researchers alike. Apple's latest innovations represent a significant leap forward in machine learning technology. By combining Exo 1.0, the MLX Distributed Framework, RDMA over Thunderbolt 5, and macOS 26.2, Apple has created an ecosystem that makes advanced AI more accessible, efficient, and scalable. These tools provide the performance, flexibility, and ease of use required to meet the demands of modern AI workflows. Whether you are an experienced AI professional or just beginning your journey, Apple's advancements offer a powerful platform for innovation. With these technologies, Apple is not only enhancing the way AI models are developed and deployed but also paving the way for a future where artificial intelligence is more integrated into everyday life.
Share
Share
Copy Link
Apple's macOS Tahoe 26.2 introduces RDMA over Thunderbolt 5, allowing Mac clusters to pool up to 1.5TB of unified memory for running massive AI models. Real-world tests show performance tripling with four Mac Studios, handling trillion-parameter models at a fraction of traditional supercomputer costs. But expiring memory supply agreements threaten Apple's cost advantage.
Apple has introduced a significant enhancement to its machine learning capabilities with macOS Tahoe 26.2, which brings RDMA (Remote Direct Memory Access) support over Thunderbolt 5 to Mac cluster configurations. This advancement addresses a critical bottleneck in distributed machine learning, reducing latency from 300 microseconds to just 3 microseconds and allowing AI researchers to pool massive memory resources across multiple devices
1
2
. The technology enables one CPU node in a Mac cluster to directly read another's memory without consuming significant processing power, effectively creating a unified memory pool across all connected devices. YouTuber Jeff Geerling demonstrated this capability using four Apple Mac Studio units loaned by Apple, achieving a combined 1.5 terabytes of unified memory at a total cost of approximately $40,0001
3
.
Source: AppleInsider
The integration of Thunderbolt 5 into Apple's clustering ecosystem represents a substantial leap from previous networking solutions. While typical Ethernet-based cluster computing maxes out at 10Gb/s, and Thunderbolt 4 offered 40Gb/s, Thunderbolt 5 doubles that capacity to 80Gb/s
1
4
. This bandwidth increase proves essential when running Large Language Models that exceed the memory capacity of a single device. Geerling's testing with M3 Ultra models equipped with 32-core CPUs, 80-core GPUs, and 32-core Neural Engines showed dramatic performance improvements when RDMA was enabled. Using the open-source tool Exo 1.0, which supports RDMA, performance on the Qwen3 235B model jumped from 19.5 tokens per second on a single node to 31.9 tokens per second across four nodes1
. By comparison, Llama.cpp without RDMA support actually decreased from 20.4 to 15.2 tokens per second as more nodes were added, highlighting the critical role of RDMA in distributed machine learning.Apple's MLX Distributed Framework, enhanced in macOS Tahoe 26.2, now supports tensor parallelism alongside RDMA capabilities. Tensor parallelism divides large AI models into smaller segments that can be processed simultaneously across multiple GPUs, maximizing utilization of the cluster's 320 GPU cores when four Mac Studios are connected
2
4
. This approach proved essential when testing the Kimi K2 Thinking 1T A32B model, a trillion-parameter AI model that simply couldn't fit within a single Mac Studio's 512GB memory capacity. Over four nodes, the system achieved 28.3 tokens per second, demonstrating that consumer-grade Apple Silicon can handle workloads previously reserved for enterprise systems1
. The MLX framework's seamless integration with RDMA accelerates both model training and inference, supporting both dense models and quantized models depending on specific project requirements4
.
Source: Wccftech
Related Stories
The local AI supercomputer configuration offers a compelling cost advantage compared to traditional enterprise solutions. At $40,000 to $50,000 for a four-unit Mac cluster, the setup costs significantly less than NVIDIA H100 clusters, which can exceed $780,000
2
3
. When comparing memory pooling capabilities, achieving 1.5TB of unified memory through NVIDIA DGX Spark units would require 12 devices at approximately $4,000 each, totaling $48,000 and giving Apple an $8,000 cost advantage3
. However, this advantage faces potential erosion as Apple's long-term agreements with memory suppliers like Samsung and SK Hynix expire as soon as January 2026. Industry observers anticipate that these suppliers will increase quotation prices once current contracts end, potentially shrinking or eliminating Apple's current pricing edge for upcoming M5-based Mac mini and Apple Studio devices3
.
Source: Geeky Gadgets
Running AI models locally on a Mac cluster provides several strategic advantages beyond raw performance metrics. Organizations handling sensitive data can maintain enhanced data security by eliminating reliance on cloud infrastructure, keeping proprietary information within their own controlled environment
2
4
. The setup also eliminates recurring cloud service fees, reducing long-term operational costs for researchers, developers, and small organizations working with machine learning applications. Testing confirmed compatibility with real-world tools including Open Web UI and Xcode, demonstrating practical utility beyond benchmark scenarios2
. The compact rack configuration runs almost whisper-quiet at under 250 watts per unit, making it suitable for office environments rather than requiring dedicated data center facilities1
. However, the daisy-chain requirement for Thunderbolt 5 connections limits scalability, as adding more units without a dedicated networking switch would introduce network latency that could impact performance1
.Summarized by
Navi
[1]
[2]
[4]
12 Mar 2025•Technology

06 Mar 2025•Technology

30 Oct 2024•Technology

1
Policy and Regulation

2
Technology

3
Technology
