2 Sources
[1]
Apple details how it trained its new AI models, see highlights - 9to5Mac
During WWDC25, Apple announced new versions of its on-device and cloud-based foundation models. Now, they have published a tech report detailing how those models were trained, optimized, and evaluated. And the report includes some genuinely interesting under-the-hood tidbits. In a comprehensive document called "Apple Intelligence Foundation Language Models - Tech Report 2025", the company walks through multiple aspects of the new models, including their architecture, data sources, pre-training, post-training, tool use development, optimizations, and benchmarks. It is a very technical, but very worthwhile read if you like to get into the nuts and bolts of this sort of stuff. Here are a few particularly interesting highlights. We already knew that Apple's on-device model (the one developers will get to tap into) has around 3 billion parameters. Now, the company has detailed that this model is actually divided into two blocks: "Block 1 contains 62.5% of the total transformer layers, while Block 2 contains the remaining 37.5% of the transformer layers, but had the key and value projections removed." In practice, this means that the local model requires 37.5% less memory for caching, and the time it takes to output the first token (basically, a fragment of a word) was also cut by about 37.5%. Still, Apple structured the split in a way that it says preserves the model's overall performance and output quality. As a side note, a few years ago, Apple published this study, which looked at swapping parts of an LLM between RAM and flash storage as needed, in order to pack a local model that was bigger than what would otherwise fit on the device's memory. While Apple ultimately took a different route, it is interesting to note the different ways the company has been experimenting to offer good local performance, even on memory-constrained devices. For its server model, Apple built a custom architecture that was tailor-made for its Private Cloud Compute platform. It's called Parallel-Track Mixture-of-Experts (PT-MoE), and the way it works is pretty neat. In a nutshell (and at the risk of oversimplifying things), Mixture of Experts is when, instead of relying on one huge AI model, it's split into smaller subnetworks (or experts) that are only activated when the task is related to something they're... well, an expert in. So if your prompt is about cooking, only cooking-related experts are activated, while others remain dormant. The result is still a massive overall model, but its modular design allows it to respond faster (and often more accurately) than if everything were running through the huge, unified model, for every prompt. Here is an IBM Mixture of Experts explainer, in case you have 8 minutes to spare: Apple built a new kind of Transformer called the Parallel Track Transformer, then scaled it up with Mixture of Experts (MoE) layers. That sounds way too complicated, but the gist of it is: Traditional Transformers process tokens through a single stack of layers, one after the other. But rather than using this single-track approach to calculate every token, Apple's design splits the model into multiple, parallel tracks. Each track processes tokens independently, and only syncs up at certain points. Then, inside each of those tracks, Apple replaced every other regular transformer layer with an MoE layer, which activates just a few experts for each token, while the rest stay idle. And because each track has its own local experts, the model avoids the processing bottlenecks that happen when everything has to coordinate across the entire system. Add to that a clever setup that balances local context with big-picture understanding (called Interleaving Global and Local Attention Layers), and the result is a very modular, efficient, and scalable model that's faster and leaner, but still pretty smart. One of the biggest knocks against the initial rollout of Apple Intelligence was (and still is) limited language support beyond English. With its new models, Apple has expanded language support, and the document details the steps it took in order to do that. According to the document, Apple increased the amount of multilingual data used during training from 8% to 30%. This includes both organic and synthetic content. Apple also increased its tokenizer (which is basically the model's token vocabulary) by 50%. This means that its model now knows 150K different tokens, up from the previous 100K. The company says that these changes led to "significant gains" in performance across non-English benchmarks, especially after reinforcement learning fine-tuning. In the deocument, Apple explains that evaluations were conducted using prompts written by native speakers (rather than translations), and the model was tested on both accuracy and how natural its responses sounded in local contexts. If this sounds familiar, you probably read our recent coverage of this Apple Research study. In practice, all of this means that features like Writing Tools should work more reliably in the supported languages. Like with its first models, most of the training data came from crawling the web. But Apple says that its Applebot crawler respects exclusions, meaning that if a website doesn't want Apple to scrape its content, it can say so, and Applebot will leave it alone. That said, here is how Apple says it sourced the data for its new models: There has been no shortage of news on Apple's internal drama, technical struggles, and overall inability to gain the momentum it needs to bridge the gap (which some might call a chasm) between its AI offerings, and the competition. All of those are true. Yet, the fact that Apple is largely perceived as being behind on AI doesn't mean the company is standing still. This report offers an interesting insight into the under-the-hood improvements (and shortcomings) of Apple's newest models, along with extensive details on a privacy-conscious approach that few companies are even attempting.
[2]
Despite Its Dip In Popularity, Apple Reveals AI Model Training Tactics - From Mass Web Scraping To Secret Licensing Deals And Synthetic Content
While the WWDC majorly revolved around the new visual design language coming to its operating system, calling the Liquid design, it also announced the next generation of its AI foundational models that would be built for both on-device and cloud. After the event, the tech giant seems to be letting users and the tech community dig deep into how its models are trained and optimized through an elaborate technical report, which allows for a better understanding of Apple's AI strategy. The company emphasizes in its report the truly focused approach it had when training the models with privacy and efficiency at the core. Despite losing its popularity in the AI space, Apple has released a detailed report on its foundation models called the "Apple Intelligence Foundation Language Models -Tech report 2025," which gives us in-depth information on the key elements of the latest AI models. The document covers pretty much everything, from the architecture of the model to the training period, post-training, and then how the models were fine-tuned. It also explores the methods used to ensure technical improvements in the models so that they are more efficient and do not compromise privacy. While Apple had previously shared about its on-device AI models that are available for developers' use and about the 3 billion parameters it has, the limitation was that its structure was sparse until now. As per the report, the model is put into parts to boost efficiency. The first part is referred to as Block 1 and contains more than 60 percent of the core building blocks called the transformer layers. AI then understands the main way of language, and then responses are generated. The second part is called Block 2 and is lighter due to removing two technical pieces that take up a lot of memory: key and value projection. Because of this strategy, Apple was able to have the model use about 38 percent less memory and even speed up the response time of the model. The company has been looking into ways to improve its AI models' performance locally, and a few years ago, it explored the idea of running a model larger than the memory of a device could handle. Although it did not go with the approach laid out, it keeps seeing ways to counter hardware limitations and other challenges. Regarding the server-side of the AI model, Apple ensured that a custom architecture was used for its Private Cloud Compute system. The approach is called Parallel-Track Mixture-of-Experts (PT-MoE) and is a smart strategy, as if we phrase it in simpler words, breaks large AI models into smaller parts called the experts. Now, by dividing the model into a Mixture of Experts, the model does not need to be run entirely every time; instead, it could only focus on the relevant expert for the task at hand. Only part of the model that has expertise in the domain would be activated, allowing for saving performance and increasing efficiency. Apple additionally designed a new Transformer architecture called the Parallel Track Transformer, which has multiple tracks working independently and only working together at key points. Because of this process, the model does not experience any system-wide lags. The Cupertino tech giant has also removed one of the biggest pain points with Apple Intelligence, that is, the limited support for languages. With its new models, it has truly improved multilingual capabilities. To expand the language support, Apple increased the non-English data in its training process from 8 percent to 30 percent, which included both real and AI-generated content, so that the model had a better understanding and was equipped with a broader range of languages. This would allow features like Writing Tools to work better. When it comes to the training of its new AI systems, Apple relied heavily on the web data collected by Applebot, the company's own web crawler, and has been used with previous models as well. The interesting part is that since Apple respects privacy, if a website does not want to be crawled, it would not use its content. The company uses multiple techniques to train its models; mainly, public web data is used for the training material. Apple tends to filter irrelevant content and focus on datasets that are useful and to the point. Similarly, the tech giant also relies on publishers' licensed content, although it does give away the names of the media companies it relies on. The company also uses smaller models for collecting synthetic data, especially when it comes to image-language tasks, codes, or instruction following, for better fine-tuning. The multi-approach also involves visual data, as the giant has over 10 billion image-caption pairs, including screenshots and handwritten notes. Its own models are also used to generate richer captions. All of these training methods help Apple build smarter and more capable models. Apple's approach to training its AI models is well-articulated. It is a balanced strategy that ensures the system remains powerful and versatile without compromising on its core value: privacy.
Share
Copy Link
Apple has released a detailed technical report on its new AI foundation models, revealing innovative training methods, architectural improvements, and expanded language support, showcasing its commitment to AI development while prioritizing efficiency and privacy.
Apple has released a comprehensive technical report detailing the training and optimization of its latest AI foundation models, showcasing significant advancements in both on-device and cloud-based AI capabilities 12. The report, titled "Apple Intelligence Foundation Language Models - Tech Report 2025," provides insights into the company's innovative approaches to AI development.
Source: Wccftech
Apple's on-device AI model, containing approximately 3 billion parameters, has been strategically divided into two blocks to enhance efficiency 1:
This structure results in a 37% reduction in memory requirements for caching and a 37% decrease in the time needed to output the first token, while maintaining overall performance and output quality 1.
For its server-side model, Apple has developed a custom architecture called Parallel-Track Mixture-of-Experts (PT-MoE) 12. This innovative approach combines:
This modular design allows for faster and more efficient processing while maintaining high accuracy. The architecture also incorporates Interleaving Global and Local Attention Layers to balance local context with broader understanding 1.
Addressing previous limitations in non-English language support, Apple has significantly improved its multilingual capabilities 12:
These enhancements have led to substantial improvements in non-English language performance, particularly after reinforcement learning fine-tuning 1.
Apple's approach to data collection for AI model training emphasizes diversity and privacy 12:
The company employs filtering techniques to focus on relevant and high-quality datasets, ensuring the models are trained on valuable information 2.
Source: 9to5Mac
Throughout the development process, Apple has maintained a strong emphasis on privacy and efficiency 2:
This approach aligns with Apple's core values while still pushing the boundaries of AI capabilities 2.
As Apple continues to advance its AI technologies, these innovations demonstrate the company's commitment to bridging the perceived gap between its offerings and those of competitors in the AI space 12.
OpenAI reveals ChatGPT's staggering daily usage statistics, showcasing its rapid growth and potential impact on the search engine market.
3 Sources
Technology
16 mins ago
3 Sources
Technology
16 mins ago
OpenAI CEO Sam Altman reveals plans to scale up to over 1 million GPUs by year-end, with an ambitious goal of 100 million GPUs in the future, raising questions about feasibility, cost, and energy requirements.
2 Sources
Technology
8 hrs ago
2 Sources
Technology
8 hrs ago
The $500 billion Stargate AI infrastructure project, led by OpenAI and SoftBank, has encountered significant delays and scaled-back ambitions six months after its high-profile announcement. The venture now aims to build a small data center in Ohio by year-end amid disagreements between partners.
6 Sources
Business and Economy
15 mins ago
6 Sources
Business and Economy
15 mins ago
OpenAI signs a strategic partnership with the UK government to explore AI applications in public services, boost security research, and potentially expand AI infrastructure investments.
9 Sources
Policy and Regulation
8 hrs ago
9 Sources
Policy and Regulation
8 hrs ago
Fidji Simo, former Instacart CEO, is set to join OpenAI as CEO of Applications. In her first memo to staff, she shares an optimistic vision for AI's potential to democratize opportunities and transform various aspects of life.
3 Sources
Technology
8 hrs ago
3 Sources
Technology
8 hrs ago