3 Sources
3 Sources
[1]
New Apple research could unlock fast-talking Siri
This breakthrough could lead to a significantly faster and more responsive Siri, addressing user complaints about the assistant's sluggish performance. Hopes for a more accurate and functional Siri voice assistant currently lean heavily on the short-term fix: Apple's recently announced partnership with Google to use the latter's Gemini tech to improve its own AI offerings. But in the longer term, a new research paper offers a method that could allow Apple to make Siri faster all by itself. The paper, Principled Coarse-Grained Acceptance for Speculative Decoding in Speech, was written by five researchers working for Apple and Tel-Aviv University and published late last month (via 9to5Mac). It proposes a new approach that could, in researchers' words, "accelerate speech token generation while maintaining speech quality." The key to speed, the researchers argue, is avoiding unnecessary strictness. "For speech LLMs that generate acoustic tokens," they write, "exact token matching is overly restrictive: many discrete tokens are acoustically or semantically interchangeable, reducing acceptance rates and limiting speedups." In other words, at a certain level of similarity, it doesn't matter which of two possible speech tokens is selected, since they sound or mean essentially the same thing, and it's wasting time and processing resources to insist on working out which one is right. The solution proposed is to group acoustically similarly tokens together. "We propose Principled Coarse-Graining (PCG), a framework that replaces exact token matching with group-level verification," the paper explains. "We construct Acoustic Similarity Groups (ASGs) in the target model's token embedding space, capturing its internal organization of semantic and acoustic similarity. PCG performs speculative sampling on the coarse-grained distribution over ASGs and carries out rejection sampling at the group level." The researchers claim this will increase speed without significantly lowering reliability. In experiments (see page 4 of the paper), increasing the number of tokens per second slightly lowers accuracy, but far less than with standard speculative decoding. The paper is rather technical, but it's not very long. Check out the pdf to read the whole thing.
[2]
AppleInsider.com
Apple Intelligence researchers are proposing a new approach to text-to-speech that would make Siri quicker to respond. That might, though, also make conversations flow more naturally. Apple may have lost the odd AI researcher, but it continues to publish significant papers on the topic. Previously it has published about limiting AI from taking actions a user didn't approve, and examining how to prevent hallucinations. Now in a study called "Principled Coarse-Grained Acceptance for Speculative Decoding in Speech," researchers from Apple and Tel Aviv University, have focused on text-to-speech applications. In AI, speech is sometimes generated based on tokens, or very short samples of sound. These are phonetic sounds, measured in milliseconds, which are then assembled into sentences. We've all heard Siri in Apple Maps give the odd peculiar pronunciation of a place or a road name, and that's down to which phonetic sounds were chosen to be used. Maps directions have to be delivered on time if they're to be of any use, so speed of generating the speech is crucial. It's also important in other circumstances, where a prompt response helps with conversational exchanges. What Apple's researchers argue in this new paper is that the process of taking text and looking up the best speech token can be done faster than at present. They argue that previous methods which work through every token using autoregression -- narrowing down results as the search continues -- are not optimal. Apple says that this working through each token means that processes ignore "acoustic similarity" and also risk "erroneous acceptances." In this proposal, Apple suggests replacing this exact token-matching system, and instead looking first for what they define as Acoustic Similarity Groups (ASGs). Claiming "two key innovations," Apple says that ASGs contain "perceptually similar sounds," but also that the sounds can belong in multiple, overlapping groups. Using probabilities, such a text-to-speech system can narrow down the search to a smaller set of tokens. Within multiple ASGs, the process can use autoregression to further eliminate the wrong sounds within each group. Then it can use probabilities to select from the groups, what should be the most accurate speech token to use in its spoken response. Apple argues that its full process is faster "while better preserving generation quality" than previous models. What this should mean is that conversations with systems such as Siri would flow more quickly. The speed difference is unlikely to be enormous, but humans are used to human speech, and delays are noticeable. The paper does not focus on improving how natural a text to speech system is, but speed would help. Separately, Apple researchers have long been looking at ways to improve how Siri's spoken responses could be tailored to suit a user's preferences or environment.
[3]
Apple Researchers Figure Out A Way To Unlock Faster, More Natural-Sounding Conversations With Siri
Apple might have adopted a Google Gemini crutch to compensate for its own AI-related shortcomings, but that has not stopped the researchers at the Cupertino giant from trying to explore novel ways to make Siri noticeably better. Now, one new research paper from Apple researchers aims to unlock faster, more natural-sounding responses from Siri. AI models typically generate speech based on tokens or short snippets of phonetic sounds, often spanning just milliseconds. The model then selects which phonetic sound (speech token) to use in its responses by employing autoregression. This approach, however, introduces an inherent response delay, along with the occasional weird-sounding pronunciation, given the limited number of phonetic snippets used to train that particular AI model. In a new study, Apple researchers argue that replacing the current token-matching system with one that employs Acoustic Similarity Groups (ASGs) could lead to faster, more natural-sounding responses from Siri. ASGs band together speech tokens based on how perceptually similar they sound, with inevitable overlapping between some ASGs. By then employing probabilistic search and autoregression within ASGs, a given AI model can arrive at the most appropriate speech token a lot faster. While not groundbreaking in any particular sense, the paper does show Apple's continuing focus on improving its own AI and machine-learning capabilities. The effort also serves as a testament of sorts to Apple's overarching ambitions to eventually adopt a holistically bespoke AI solution for its devices and doing away with third-party crutches such as Google's Gemini models.
Share
Share
Copy Link
Apple researchers have published a breakthrough study proposing a new text-to-speech approach that could dramatically improve Siri's performance. The research introduces Acoustic Similarity Groups (ASGs) to accelerate speech token generation while maintaining quality, potentially addressing long-standing complaints about the voice assistant's sluggish response time.
Apple researchers have unveiled a promising approach to improve Siri's performance through faster, more natural-sounding conversations
1
. The research paper, titled "Principled Coarse-Grained Acceptance for Speculative Decoding in Speech," was published late last month by five researchers from Apple and Tel-Aviv University1
. This development comes as the company seeks long-term solutions beyond its recently announced partnership with Google Gemini to enhance Apple Intelligence capabilities3
.
Source: AppleInsider
The breakthrough could address persistent user complaints about Siri's response time, a critical factor in making voice assistant interactions feel more human-like
2
. While speed differences may not be enormous, humans are sensitive to delays in conversational exchanges, making even incremental improvements noticeable2
.The core innovation centers on text-to-speech technology and how AI models generate spoken responses. Current systems rely on speech tokens—phonetic sounds measured in milliseconds that are assembled into sentences
2
. AI models typically select these tokens using autoregression, which narrows down results as the search continues but introduces inherent response delays3
.
Source: Wccftech
Apple's researchers argue that exact token matching is overly restrictive for speech LLMs that generate acoustic tokens. Many discrete tokens are acoustically or semantically interchangeable, meaning that at a certain level of similarity, it doesn't matter which of two possible speech tokens is selected since they sound or mean essentially the same thing
1
. The current approach wastes time and processing resources insisting on determining which token is precisely right1
.The solution proposed involves grouping acoustically similar tokens together through Acoustic Similarity Groups (ASGs). These groups contain perceptually similar sounds, with tokens able to belong in multiple, overlapping groups
2
. The researchers introduce Principled Coarse-Graining (PCG), a framework that replaces exact token matching with group-level verification1
.PCG constructs ASGs in the target model's token embedding space, capturing its internal organization of semantic and acoustic similarity. The framework performs speculative sampling on the coarse-grained distribution over ASGs and carries out rejection sampling at the group level
1
. Using probabilities, the text-to-speech system narrows the search to a smaller set of tokens, then uses autoregression to further eliminate incorrect sounds within each group before selecting the most accurate speech token2
.Related Stories
The researchers claim their approach can accelerate speech token generation while maintaining speech quality and generation quality
1
. In experiments detailed on page 4 of the research paper, increasing the number of tokens per second slightly lowers accuracy, but far less than with standard speculative decoding in speech1
. Apple argues that its full process is faster "while better preserving generation quality" than previous models2
.This research demonstrates Apple's continuing focus on improving its own AI and machine-learning capabilities
3
. The effort serves as evidence of Apple's overarching ambitions to eventually adopt a holistically bespoke AI solution for its devices and move away from third-party dependencies such as Google's Gemini models3
. While the paper does not focus explicitly on improving how natural a text-to-speech system sounds, faster responses would help conversations flow more naturally2
.Summarized by
Navi
[2]
30 Jan 2026•Technology

12 Jan 2026•Technology

21 Jan 2026•Technology

1
Business and Economy

2
Policy and Regulation

3
Technology
