3 Sources
3 Sources
[1]
AppleInsider.com
Real-world test of Apple's latest implementation of Mac cluster computing proves it can help AI researchers work using massive models, thanks to pooling memory resources over Thunderbolt 5. In November, Apple teased inbound features in macOS Tahoe 26.2 that stands to considerably change how AI researchers perform machine learning processing. At the time, the headline improvement made to MLX, Apple's machine learning framework, was to support GPU-based neural accelerators, but Thunderbolt 5 clustering support was also a big change. One month later, and the benefits of Thunderbolt 5 for clustering are finally being seen in a real-world environment. YouTuber Jeff Geerling wrote a blog post and published a video on December 18, detailing the experience he had with a cluster of Mac Studios loaned to him by Apple. The set of four Macs cost just short of $40,000 in total, and were used to show off the Thunderbolt 5 connectivity in relation to cluster computing. All models were M3 Ultra models, each equipped with a 32-core CPU, 80-core GPU, and a 32-core Neural Engine. Two of the models supplied had 512GB of unified memory and 8TB of storage, while the other two had 256GB of memory and 4TB of storage. Put into a compact 10-inch rack, the collection of Mac Studios were said by Geerling to be "almost whisper-quiet" and running at under 250 watts apiece. However, the key is the combination of Thunderbolt 5 support between the Mac Studios and the capability to pool the memory. Massive memory resources The MLX changes in macOS Tahoe 26.2 included a new driver with Thunderbolt 5 support. This is important since it can considerably speed up inter-Mac connections when used in small clusters, such as this. Typical Ethernet-based cluster computing is limited to a maximum of 10Gb/s, depending on the Mac's specification and not using concepts such as link aggregation and multiple Ethernet ports. To improve on this, researchers have used Thunderbolt to handle connections between Macs in a cluster, since it has much higher bandwidth. Under previous efforts and using Thunderbolt 4, the maximum bandwidth was 40Gb/s. With Thunderbolt 5, the bandwidth is boosted to a maximum of 80Gb/s. The massive bandwidth is especially useful thanks to Apple's inclusion of RDMA (Remote Direct Access Memory) in Thunderbolt 5. Under RDMA, one CPU node in the cluster is capable of directly reading the memory of another, expanding its available memory pool to incorporate others in the cluster. Crucially it is performed directly, as the name indicates, without requiring much processing from the secondary Mac's CPU at all. In short, the different processors have access to all of a cluster's memory reserves at once. For the collection of four Mac Studios as loaned to Geerling, that's a total of 1.5 terabytes of memory in use. With Thunderbolt 5 improving the inter-Mac bandwidth, that access has now improved considerably. The upshot for researchers working in machine learning is that it's a way to use huge Large Language Models (LLMs) that go beyond the theoretical limitations of one Mac's memory capacity. Doing a cluster this way does have a limit, due to the use of Thunderbolt 5 itself. In lieu of any theoretical Thunderbolt 5 networking switch, all of the Mac Studios have to be daisy-chained, severely limiting the number of units you could cluster together without network latency that would hobble performance. Real-world testing Geerling was able to run some benchmarks on the Mac Studio collection to determine how beneficial it actually can be. After running a command to enable RDMA in recovery mode, he used an open source tool called Exo as well as Llama.cpp to run models across the cluster. Both were used as a form of testing RDMA's effectiveness. Exo supports RDMA, while Llama does not. An initial benchmark using Qwen3 235B showed promise in the system. Under a single node, or a single Mac from the cluster, Llama was better at 20.4 tokens per second versus 19.5 tokens per second for Exo. But when two nodes were in use, Llama dropped to 17.2 tokens per second while Exo improved considerably to 26.2 tokens per second. At four nodes, Llama shrank again to 15.2 tokens per second while Exo went up to 31.9 tokens per second. Similar improvements were seen using DeepSeek V3.1 671B, with Exo's performance going from 21.1 tokens per second on a single node to 27.8 tokens per second for two, and 32.5 tokens per second for four nodes. There was also a test of a one-trillion-parameter model, Kimi K2 Thinking 1T A32B, albeit only 32 billion parameters were active at any time. This is a model that is simply too big for a single Mac Studio with 512GB of storage to deal with. Over two nodes, Llama reported a speed of 18.5 tokens per second, with Exo's RDMA bumping it up to 21.6 tokens per second. Over four nodes, Exo got to 28.3 tokens per second. Across the clustering tests, Exo improved considerably as more nodes were available to use, thanks to RDMA. Big potential, with asterisks The big takeaway from Geerling's testing is that there's a lot of performance available for researchers working in machine learning, especially when it comes to handling massive LLMs. Apple has certainly demonstrated that it is possible, without sacrificing performance, thanks to RDMA and Thunderbolt 5's available bandwidth. Creating a cluster like this can still be expensive for the typical user, and it may be a bit too expensive for hobbyists to undertake. However, a $40,000 setup similar to this is a fairly reasonable-priced expense for teams working for companies with a vested interest in AI development. There are some reservations, though, such as reported stability issues stemming from running HPL benchmarks over Thunderbolt and other bugs that surface in prerelease software. Geerling adds he has trust issues when it comes to the secretive development team working on Exo, especially considering it's an open source project. However, there's also some unrealized potential here. The cluster uses the M3 Ultra as that's the fastest chip in a Mac that supports Thunderbolt 5, not the slower Thunderbolt 4. While an M4 Ultra chip is out of the way, it's proposed that an M5 Ultra Mac Studio could be much better, thanks to its use of GPU neural accelerator support. That should give even more of a boost to machine learning research, if Apple gets around to releasing that chip. Geerling also wonders if Apple could extend the inter-device Thunderbolt 5 connectivity even more, to include SMB Direct. He reasons that network shares behaving at speeds similar to if they were directly attached to the Mac could be a big assist for people working with latency-sensitive and high-bandwidth applications. Like video editing for YouTubers.
[2]
Apple's AI Advantage On Its Mac Cluster Now Under Threat
The ability to pool computational power by clustering a number of Mac mini or Apple Studio devices via the Thunderbolt 5 is indeed a potent tool, especially given Apple's unified memory architecture, which makes available copious memory at a time when a given quantum of memory resource is worth its weight in gold. Yet, even as Apple introduces tools to better exploit this unique advantage, its overall competitiveness is being chipped away by the expiration of its memory-focused long-term agreements (LTAs), setting the stage for a considerable surge in prices for its upcoming products, including the upcoming M5-based Mac mini and Apple Studio. Alex Ziskind recently showed that it was cheaper to run less complicated machine learning (ML) and AI tasks on dedicated Apple silicon as compared to the NVIDIA RTX 4090, which is NVIDIA's most expensive consumer-oriented GPU. At the heart of this advantage lies Apple silicon's unified memory architecture, where the CPU and GPU use the same memory cache. So, as an example, the M4 Pro Mac mini boasts 64GB of RAM (unified memory) vs. the RTX 4090's 24GB of RAM. When you link a number of Mac mini devices via Thunderbolt 5, the pooled memory scales rapidly for AI-related tasks. What's more, this advantage becomes all the more lucrative when combined with the superior processing power of the upcoming M5-based Mac mini and Apple Studio. Meanwhile, Apple appears to be doing everything in its power to highlight this pooled computing advantage. For instance, macOS Tahoe 26.2 introduced a new driver to the MLX, Apple's bespoke machine learning platform, replete with support for Thunderbolt 5. Unlike a typical Ethernet-based computing cluster, where the connection speed maxes out at around 10Gb/s, the Thunderbolt 5 has a max bandwidth of 80Gb/s. What's more, Apple has instituted RDMA (Remote Direct Access Memory) with Thunderbolt 5, which allows any given CPU node in the cluster to read the memory of another, and that too without expending much processing power of the CPU node that is being read. To illustrate this concept, the YouTuber, Jeff Geerling, recently built a cluster of four Mac Studios loaned to him by Apple. The cluster boasted of a unified memory of around 1.5TB and costs roughly $40,000. For comparison, pooling the same quantum of memory by clustering the NVIDIA DGX Spark would require you to acquire 12 units, with each unit costing roughly $4,000. This equates to a total cost of $48,000, giving Geerling's cluster of 4 Apple Mac Studios an ~$8,000 advantage. This cost advantage that Apple currently retains is at least partially due to its long-term agreements (LTAs) with some of the biggest names in the memory spheres. However, as we noted in a recent dedicated post, some of Apple's LTAs are set to end as soon as January 2026, with Samsung and SK Hynix chomping at the bit to increase their quotation prices for Apple as a result. Against this backdrop, we would not be surprised if the $8,000 advantage that we detailed above shrinks to mere hundreds of dollars, or disappears in its entirety, come January.
[3]
M4 Pro Macs Stack : Thunderbolt 5 Links Make Mac AI Go Way Faster
What if you could run trillion-parameter AI models on your desk without relying on expensive cloud infrastructure? In the video, Alex Ziskind breaks down Apple's latest innovations in artificial intelligence, and it's nothing short of innovative. With the release of Exo 1.0, macOS 26.2, and RDMA over Thunderbolt 5, Apple is reshaping how AI workflows operate on their hardware. Imagine clustering multiple Mac Studios or Mac Minis to handle massive machine learning tasks with ease, this isn't just a technical upgrade; it's a bold step toward making advanced AI accessible to more people than ever before. In this deep dive, we'll explore how Apple's ecosystem is transforming AI development, from the new tensor parallelism in Exo 1.0 to the lightning-fast data transfers enabled by RDMA. Whether you're a seasoned developer or just curious about the future of AI, you'll discover how these innovations eliminate bottlenecks, boost scalability, and redefine performance. The implications for researchers, businesses, and creators are enormous, but the real question is: how will this change what's possible in your own work? At the heart of Apple's advancements lies Exo 1.0, a powerful clustering solution designed to simplify distributed machine learning. With its user-friendly installer and intuitive interface, Exo 1.0 allows you to set up and manage clusters with remarkable ease. Its real-time dashboard provides detailed insights into cluster performance and model execution, making it accessible even if you are new to distributed computing. Exo 1.0 introduces tensor parallelism, a method that divides large AI models into smaller, manageable segments for simultaneous processing across multiple devices. This approach optimizes model sharding, making sure that even the most complex models can run efficiently. Whether you are a developer or researcher, Exo 1.0 removes the technical barriers to using clustering technology, allowing you to focus on innovation and results. A standout feature of Apple's ecosystem is the integration of RDMA (Remote Direct Memory Access) over Thunderbolt 5, which transforms data transfer speeds between devices. This technology achieves communication speeds up to 10 times faster than traditional methods, significantly reducing data transfer delays. By eliminating bottlenecks, RDMA ensures that distributed AI tasks run smoothly and efficiently, even at scale. To take advantage of RDMA over Thunderbolt 5, you will need devices equipped with Apple's latest M4 Pro chips or higher. This combination of innovative hardware and software delivers exceptional performance, allowing you to scale AI workloads across multiple nodes without the latency issues that often hinder multi-machine setups. This innovation is particularly beneficial for tasks requiring real-time processing or large-scale model training. Here are more detailed guides and articles that you may find helpful on Apple Silicon. The MLX Distributed Framework is another cornerstone of Apple's AI ecosystem, specifically designed to maximize the potential of Apple Silicon. Seamlessly integrated with RDMA, MLX accelerates both model training and inference, offering unparalleled performance for a wide range of AI applications. It supports both dense models and quantized models, providing flexibility based on your specific requirements. This adaptability ensures that you can optimize performance regardless of the complexity or scale of your AI tasks. Whether you are working on resource-intensive projects or lightweight applications, MLX provides the tools needed to achieve your goals efficiently. The release of macOS 26.2 further enhances Apple's AI capabilities by introducing native support for RDMA, creating a seamless integration between hardware and software. One of the most notable features of Apple Silicon, unified memory, allows memory to be shared effortlessly across clusters. This capability enables you to run larger models on devices like the M4 Mac Mini, which are more cost-effective than traditional high-end alternatives. By combining macOS 26.2 with unified memory, Apple ensures that AI workflows are not only faster but also more accessible. Whether you are using a compact Mac Mini or a high-performance Mac Studio, this cohesive ecosystem enables you to achieve remarkable results without the need for expensive cloud-based infrastructure. Apple's advancements in AI clustering technology open up a wide array of possibilities for real-world applications. Large language models (LLMs), which power technologies such as chatbots, natural language processing tools, and content generation systems, can now be run locally on Apple Silicon clusters. This eliminates the reliance on costly cloud-based solutions, giving you greater control over your data while significantly reducing operational expenses. For developers and researchers, these tools provide a robust platform for experimentation and innovation. Whether you are training new models, fine-tuning existing ones, or exploring novel AI applications, Apple's ecosystem offers the resources and scalability needed to push the boundaries of what is possible in artificial intelligence. Apple's integration of tensor parallelism, RDMA, and unified memory delivers substantial performance improvements for AI workflows. Key metrics, such as token generation rates -- a critical measure of efficiency for LLMs, have seen significant enhancements. Apple's clustering technology scales seamlessly across multiple nodes, allowing faster processing even for trillion-parameter models. This scalability ensures that you can tackle demanding AI workloads without compromising speed or accuracy. By using Apple's ecosystem, you can achieve results that were previously only possible with large-scale cloud infrastructure. This makes Apple's solution not only powerful but also cost-effective for businesses and researchers alike. Apple's latest innovations represent a significant leap forward in machine learning technology. By combining Exo 1.0, the MLX Distributed Framework, RDMA over Thunderbolt 5, and macOS 26.2, Apple has created an ecosystem that makes advanced AI more accessible, efficient, and scalable. These tools provide the performance, flexibility, and ease of use required to meet the demands of modern AI workflows. Whether you are an experienced AI professional or just beginning your journey, Apple's advancements offer a powerful platform for innovation. With these technologies, Apple is not only enhancing the way AI models are developed and deployed but also paving the way for a future where artificial intelligence is more integrated into everyday life.
Share
Share
Copy Link
Apple's macOS Tahoe 26.2 introduces RDMA support over Thunderbolt 5, transforming Mac cluster computing for AI researchers. Real-world tests show a four-Mac Studio cluster with 1.5TB pooled memory can run trillion-parameter models locally, delivering up to 32.5 tokens per second. But rising memory prices could threaten Apple's cost advantage over NVIDIA solutions.
Apple's latest update to macOS Tahoe 26.2 has introduced a significant enhancement to its machine learning framework that stands to reshape how AI researchers approach complex computational tasks. The update brings Remote Direct Access Memory (RDMA) support over Thunderbolt 5 to Mac cluster computing, allowing multiple Mac devices to pool their memory resources with unprecedented efficiency. This development addresses a critical bottleneck in AI workflows: accessing sufficient memory to run massive large language models without relying on expensive cloud infrastructure.
1
The implementation leverages Thunderbolt 5's maximum bandwidth of 80Gb/s, doubling the 40Gb/s available with Thunderbolt 4 and vastly outperforming typical Ethernet-based cluster computing limited to 10Gb/s. More importantly, RDMA enables one CPU node in a cluster to directly read another's memory without requiring significant processing power from the secondary device, effectively creating a unified memory pool across multiple machines.
1

Source: Geeky Gadgets
YouTuber Jeff Geerling conducted real-world testing using four M3 Ultra Mac Studio units loaned by Apple, collectively worth just under $40,000. The cluster featured a combined 1.5 terabytes of unified memory, with each Mac Studio equipped with a 32-core CPU, 80-core GPU, and 32-core Neural Engine. Running at under 250 watts apiece and remaining "almost whisper-quiet," the compact setup demonstrated the practical viability of desktop-scale AI processing.
1

Source: AppleInsider
Geerling's benchmarks using Exo 1.0, which supports RDMA, versus Llama.cpp, which does not, revealed striking differences. Testing the Qwen3 235B model showed Exo improving from 19.5 tokens per second on a single node to 31.9 tokens per second across four nodes. In contrast, Llama's performance degraded from 20.4 to 15.2 tokens per second as nodes increased. Similar results emerged with DeepSeek V3.1 671B, where Exo achieved 32.5 tokens per second on four nodes compared to 21.1 on a single node.
1
Perhaps most impressive was the successful execution of Kimi K2 Thinking 1T A32B, a one-trillion-parameter model too large for a single Mac Studio with 512GB memory. Across four nodes, Exo delivered 28.3 tokens per second, demonstrating that Apple Silicon clusters can handle models previously accessible only through enterprise-grade infrastructure.
1
The unified memory architecture inherent to Apple Silicon creates a distinct advantage for accelerating AI workflows. As Alex Ziskind demonstrated, running machine learning tasks on Apple Silicon proves more cost-effective than using NVIDIA's RTX 4090 for less complex operations. The M4 Pro Mac mini offers 64GB of unified memory compared to the RTX 4090's 24GB, and when multiple Mac mini or Mac Studio devices connect via Thunderbolt 5, this pooled memory scales rapidly.
2
The cost comparison becomes compelling when examining equivalent memory configurations. Geerling's four-Mac Studio cluster with approximately 1.5TB of memory cost roughly $40,000. Achieving the same memory capacity by clustering NVIDIA DGX Spark units would require 12 devices at approximately $4,000 each, totaling $48,000—an $8,000 disadvantage for NVIDIA.
2
Apple's MLX Distributed Framework, specifically designed to maximize Apple Silicon potential, integrates seamlessly with RDMA to accelerate both model training and inference. The framework supports dense and quantized models, providing flexibility for various AI applications. Combined with Exo 1.0's tensor parallelism—which divides large AI models into smaller segments for simultaneous processing—the ecosystem removes technical barriers that previously made distributed computing inaccessible to many developers and AI researchers.
3
Exo 1.0's user-friendly installer and real-time dashboard provide detailed insights into cluster performance, making distributed machine learning more approachable. This democratization of advanced AI capabilities means researchers and businesses can experiment with and deploy large language models locally, maintaining data control while reducing operational expenses associated with cloud-based solutions.
3

Source: Wccftech
Related Stories
Despite these technical advances, Apple's cost advantage faces significant headwinds. The company's long-term agreements with major memory suppliers including Samsung and SK Hynix are set to expire as soon as January 2026. Industry observers anticipate substantial price increases as these suppliers renegotiate terms, potentially eroding or eliminating the $8,000 cost advantage Apple currently holds over comparable NVIDIA configurations.
2
This pricing pressure could significantly impact upcoming M5-based Mac mini and Mac Studio devices, potentially making Apple's clustering solution less attractive compared to traditional GPU-based alternatives. For organizations planning long-term AI infrastructure investments, this uncertainty introduces risk into what otherwise appears to be a compelling technical solution.
While Thunderbolt 5 RDMA support represents a substantial leap forward, the technology carries inherent limitations. The absence of a Thunderbolt 5 networking switch means Mac Studio or Mac mini units must be daisy-chained, severely restricting cluster size before network latency degrades performance. This constraint makes Apple's solution most viable for small to medium-scale deployments rather than large enterprise clusters.
1
AI researchers and developers should monitor how Apple addresses scaling challenges and whether third-party networking solutions emerge. The current implementation works exceptionally well for teams needing to run models that exceed single-device memory capacity but don't require dozens of nodes. As model sizes continue growing and memory prices potentially increase, the window for Apple's cost-effective clustering advantage may narrow, making near-term adoption particularly attractive for organizations seeking alternatives to cloud infrastructure.
Summarized by
Navi
[1]
[3]
12 Mar 2025•Technology

30 Oct 2024•Technology

06 Mar 2025•Technology

1
Technology

2
Technology

3
Technology
