2 Sources
[1]
A startup claims it broke through a bottleneck that's holding back LLMs
Subquadratic has now shared more details about its new model. But some are still skeptical. Miami-based AI startup Subquadratic came out of stealth mode last month with a huge claim. It announced that it had solved a mathematical bottleneck that had been holding back large language models for almost a decade. The details were thin, and many people were unconvinced. But Subquadratic has started to bring the receipts, sharing the results of an independent evaluation of its new tech. The results suggest that the company's claims might be worth paying attention to. According to Subquadratic, it has developed a new kind of LLM, called SubQ, that is faster and cheaper and uses a lot less energy than any other model on the market. The company also claims that SubQ is able to process up to 12 times as much text at once than most other models, allowing it to carry out a range of data-heavy tasks, such as analyzing entire code bases, or documents that are hundreds of pages long. What's more, Subquadratic says, SubQ does this while more or less matching the performance of the best models put out by Google DeepMind, OpenAI, and Anthropic on key tasks like coding. The problem was that the company at first provided little evidence for its claims beyond a handful of self-published test scores. And it has yet to make SubQ widely available for people to try out themselves. So it's no surprise that Subquadratic's claims were met with skepticism. Dan McAteer, an artificial intelligence engineer, captured the overall response on X: "SubQ is either the biggest breakthrough since the Transformer ... or it's AI Theranos." A month on, the company has published more information about its model, including the results of additional independent tests run by third-party firm Appen. "We expected healthy skepticism," says Subquadratic cofounder and chief technology officer Alex Whedon. "In hindsight, releasing the third-party benchmarks alongside the initial announcement would have preempted much of the skepticism, which is why we're taking the time to make sure any future results are fully verified before putting them out." Subquadratic asked Appen, which evaluates other companies' models, to run its tests on SubQ. The results seem to back up a lot of Subquadratic's claims. "That was really exciting to me, it validated their architecture," says Jeanine Sinanan-Singh, Appen's director of generative AI research. "I was like, 'Wow, this could be a game changer,' because models struggle with speed and inefficiency," she adds. "But when you have kind of shocking results, it's really not as credible when you say it yourself." SubQ won't replace existing top models across the board, but it could offer huge increases in speed at a fraction of the typical cost for certain tasks. Subquadratic still insists that in the long run, though, its breakthrough could change how LLMs are built. "We hope we're kicking off a new age of efficiency," says Justin Dangel, the firm's cofounder and CEO. "We don't think anybody will be building on transformers in a few years." Attention! To understand why Subquadratic's claims are a big deal, let's dig into how most LLMs work. The key mechanism inside an LLM is a type of neural network called a transformer, which runs a process known as dense attention. Today's LLMs typically chain together multiple transformers. (The foundational paper of the LLM era, published by researchers at Google in 2017, was titled "Attention Is All You Need.") Dense attention works like this: When a transformer processes a chunk of text, it first encodes each word (or part of a word, known as a token) with a number. To capture the meaning of the full text, it then multiplies each of those numbers with every other number for that text. For example, a piece of text 10,000 words long would kick off almost 50 million individual multiplications. That's a lot of computation and the main reason that LLMs are notorious power hogs. "If you want to summarize The Great Gatsby, you have to look at the first word and the last word together, and then you have to look at every other combination," says Dangel. As the length of the text increases, the number of computations skyrockets. That's because each additional number must be multiplied by all other previous numbers that were assigned. Double the number of words, and you roughly quadruple the number of computations, a rate of increase known as a quadratic expansion. (You can picture this yourself: Draw a circle and mark dots around its edge. Each dot is a token. Then draw lines between pairs of dots to represent the multiplication of those two tokens. A circle with five dots will have 10 lines crossing it. Make it 10 dots and you will have 45 lines, 20 dots and you will have 190 lines, and so on.) Slashing costs Subquadratic's solution is to ditch dense attention, the core operation of a transformer, in favor of what's known as sparse attention, which slashes the number of computations needed. Instead of multiplying the number assigned to each token by every other number, sparse attention selects just some of the numbers to multiply. The idea is that not all relationships between words in a piece of text matter. "Sparse attention says not all of those relationships are important, because they're not," says Whedon. "If you're reading a book, you're not going to look at the first and second words, first and third -- that's insane." It's a simple approach, and Subquadratic is not the first to try it. "Pretty much everything under the sun has been attempted," says Will Depue, an independent AI researcher who previously worked at OpenAI. "It's not impossible, but it's akin to running a four-minute mile." Previous techniques for selecting which numbers to multiply and which to ignore have not produced a mechanism that can capture the meaning of a document as well as dense attention can. Subquadratic claims to have cracked the problem at last. It pitches SubQ as the first sparse-attention LLM that rivals mainstream dense-attention models in performance. "Historically, most mechanisms have used fixed patterns, like always comparing the first word to the fifth," says Whedon. "That's pretty limiting. Language is too sophisticated for that. And so, one of the things that makes our mechanism unique is that we dynamically select which ones are important." The firm won't say exactly how SubQ chooses which words to focus on, but the selection is calculated on the fly and differs for each piece of text the model is given. "That's kind of where the secret sauce is," says Whedon. The upshot is that for certain tasks, SubQ may be faster and cheaper to run than most other models. Appen evaluated SubQ on a handful of standard tests. In a straight-up speed test, which sets a baseline for how fast a model can operate in theory rather than assess what a model can actually do, Appen found that SubQ was 56 times faster than models using FlashAttention, a previous sparse-attention technique. On LiveCodeBench, a test that looks at how well models perform on competitive coding problems taken from real contests, SubQ scored 89.7%, putting it in the same ballpark as other top coding models. "This model continues to provide frontier-level performance in coding," says Appen's Sinanan-Singh. Subquadratic's claims about cost are harder to verify because SubQ is not yet widely available. According to Dangel, it costs $2600 to run Anthropic's LLM Opus 4.6 through RULER 128, a test developed by Nvidia to assess a model's ability to retrieve information from large data sets. And SubQ? "It cost us eight dollars," he says. SubQ does seem to be able to handle very large data sets. The model has a context window (roughly akin to a working memory) up to 12 million tokens long. Most top models today have context windows one million tokens long. In a demo that Whedon ran for me, he asked SubQ to perform a task that required it to reason about information contained in 400 documents. It responded in seconds. When he gave Perplexity -- a popular LLM-powered search engine -- the same task, it failed to load all 400 documents. Appen also ran the needle-in-a-haystack test, which assesses how well a model can retrieve specific information buried in a large amount of data. In its report, Appen states that SubQ scored 98% with context windows six million and 12 million tokens long, "sustaining near-perfect long-context retrieval at scales few models are tested at." Too good to be true? Despite the high scores, benchmarks paint an incomplete picture of what a model can and cannot do. Testing under very specific conditions is not a substitute for running a model on a wide range of real tasks. Subquadratic is offering SubQ as a model tailored to coding and to searching very large data sets. It says that tens of thousands of potential users have already signed up for early access, including more than 500 enterprise customers. But there's a long waitlist, and the firm has given very few people access so far. Subquadratic's response is that it is a new, small company with limited resources and cannot serve too many people at once. Until more people get their hands on the model and try it out for themselves, some skepticism is justified. One nagging issue is that Subquadratic reused the weights (values set within a model during training that determine how it will behave) from a version of the Chinese open-source model Qwen to bootstrap SubQ, rather than training it from scratch. That's a common thing for model makers to do, but it cuts across Subquadratic's claim that it has fully reinvented how LLMs work. "They may have built something real and useful," says Depue. "But the public evidence does not yet justify the stronger claim that they have solved the quadratic attention bottleneck." In the meantime, Subquadratic cofounder Whedon insists that making something different was his only option. If you want to build a competitive model, you have to have new ideas, he says: "We're more up against it than OpenAI is."
[2]
A startup says it cracked the bottleneck holding back AI
Subquadratic, a Miami startup, says its new model breaks the 'quadratic attention' bottleneck that has made large language models slow and power-hungry for almost a decade. The claim drew Theranos comparisons. Now independent benchmarks back a lot of it, though doubts remain. A Miami startup says it has cracked a maths problem that has made AI models slow and power-hungry for almost a decade. The claim was bold enough to draw comparisons with Theranos. Now, though, the company has independent test results that back much of it up. The startup is called Subquadratic. It came out of stealth in May with $29mn in seed funding and a new language model named SubQ. According to the company, SubQ is faster, cheaper, and far less energy-hungry than today's leading models. It can also read up to 12 times as much text at once. The decade-old bottleneck To see why that matters, it helps to know how most large language models work. At their core sits a "transformer", introduced by Google researchers in 2017. The transformer runs a process called dense attention. Dense attention is thorough, but it is expensive. It compares every word in a text with every other word. So when you double the length of the text, the work roughly quadruples. That "quadratic" scaling is the main reason LLMs guzzle so much compute and power. Subquadratic's fix Subquadratic's answer is to drop dense attention for "sparse attention". Instead of comparing every word with every other, sparse attention keeps only the pairs that matter. The idea is old, and plenty of teams have tried it. Until now, however, none had matched dense attention's quality. The company says its version finally does. Crucially, it picks which words to focus on dynamically, based on the content rather than a fixed pattern. "That's kind of where the secret sauce is," says co-founder and chief technology officer Alex Whedon. The receipts At first, the claims rested on a handful of self-published scores. Naturally, the reaction was sceptical. One AI engineer summed it up on X: SubQ is "either the biggest breakthrough since the Transformer ... or it's AI Theranos". So the company brought in a third party. It asked Appen, a firm that evaluates other companies' models, to run the tests. The results were striking. On a raw speed test, SubQ ran 56 times faster than FlashAttention, a leading existing method. On a tough coding benchmark, it scored 89.7 per cent, close to the best models around. The cost gap looks just as wide. By the startup's account, running one long-context test on Anthropic's top model costs about $2,600. On SubQ, it says, the same test cost eight dollars. Still too good to be true? Even so, there are reasons for caution. Benchmarks are not the same as real-world use. SubQ is also not widely available yet. Tens of thousands have joined the waitlist, but only a handful have access. There is a wrinkle in the origin story, too. Rather than train SubQ from scratch, Subquadratic started from an existing open-weight model and swapped in its new attention method. That is common practice. However, it sits awkwardly next to the claim of fully reinventing how LLMs work. "They may have built something real and useful," says Will Depue, an independent researcher who used to work at OpenAI. "But the public evidence does not yet justify the stronger claim that they have solved the quadratic attention bottleneck." Why it matters If the results hold, the payoff is large. Cheaper, faster long-context models could read entire codebases, contract sets, or document troves in one pass. They would also cut the cost and energy of running AI. That prize is one the whole industry is chasing. AI already strains against the spiralling economics of AI agents, and other startups, such as Thomas Reardon's Flourish, are attacking efficiency from other angles. Subquadratic, though, is betting the whole field will follow it. "We don't think anybody will be building on transformers in a few years," says chief executive Justin Dangel.
Share
Copy Link
Subquadratic emerged from stealth with bold claims about solving a decade-old mathematical bottleneck in large language models. The Miami-based AI startup says its SubQ model uses sparse attention instead of dense attention, making it 56 times faster than existing methods while slashing costs dramatically. Independent tests from Appen back many claims, though skeptics say more proof is needed before declaring the quadratic attention problem solved.
Subquadratic, a Miami-based startup, emerged from stealth mode last month with $29 million in seed funding and an audacious claim: it has solved the LLM bottleneck that has plagued large language models for nearly a decade
2
. The company announced its SubQ language model, which it says is faster, cheaper, and consumes far less energy than competing models from Google DeepMind, OpenAI, and Anthropic1
. The initial announcement drew immediate skepticism, with one AI engineer comparing it to Theranos on X: "SubQ is either the biggest breakthrough since the Transformer ... or it's AI Theranos"1
.To grasp why Subquadratic's claims matter, it helps to understand how the transformer architecture works. Most LLMs rely on a dense attention mechanism introduced by Google researchers in 2017. This process encodes each word or token with a number, then multiplies each number with every other number to capture the full meaning of the text
1
. For a 10,000-word text, this kicks off almost 50 million individual multiplications, creating massive computational overhead1
. The problem intensifies as text length increases: double the words and you roughly quadruple the computations, a rate known as quadratic expansion2
. This quadratic scaling is the primary reason LLMs consume enormous amounts of power and compute resources.
Source: MIT Tech Review
Subquadratic's approach ditches the dense attention mechanism in favor of sparse attention, which dramatically slashes the number of computations required
1
. Instead of comparing every word with every other word, sparse attention keeps only the pairs that matter most2
. While the concept isn't new, previous attempts failed to match the quality of dense attention. Subquadratic's version picks which words to focus on dynamically based on content rather than fixed patterns. "That's kind of where the secret sauce is," says co-founder and chief technology officer Alex Whedon2
. The company claims SubQ can process up to 12 times as much text at once compared to most other models, enabling long-context processing of entire code bases or documents hundreds of pages long1
.Facing widespread skepticism after providing only self-published test scores initially, Subquadratic brought in Appen, a third-party firm that evaluates AI models, to run independent tests
1
. The results proved striking. On raw speed tests, SubQ ran 56 times faster than FlashAttention, a leading existing method2
. On a challenging coding benchmark, it scored 89.7 percent, approaching the performance of the best models available2
. The cost differential appears even more dramatic: running one long-context test on Anthropic's top model costs approximately $2,600, while the same test on SubQ costs just eight dollars2
. "That was really exciting to me, it validated their architecture," says Jeanine Sinanan-Singh, Appen's director of generative AI research1
.Related Stories
Despite the independent validation, caution persists among AI researchers. Will Depue, an independent researcher who previously worked at OpenAI, notes that "the public evidence does not yet justify the stronger claim that they have solved the quadratic attention bottleneck"
2
. A key concern centers on SubQ's development approach: rather than training from scratch, Subquadratic started with an existing open-weight model and swapped in its new attention method2
. While this practice is common in the industry, it sits awkwardly alongside claims of fundamentally reinventing how LLMs work. Additionally, SubQ isn't widely available yet—tens of thousands have joined the waitlist, but only a handful have access2
. Benchmarks don't always translate to real-world performance, making broader testing critical.If Subquadratic's results hold up under scrutiny, the implications for energy-efficient AI could be substantial. Cheaper, faster models with superior long-context processing capabilities could analyze entire codebases, contract sets, or document collections in a single pass
2
. This addresses a pressing need as AI already strains against spiraling economics, particularly for AI agents and other compute-intensive applications. Other startups like Thomas Reardon's Flourish are attacking efficiency from different angles, but Subquadratic believes its approach will reshape the entire field2
. "We hope we're kicking off a new age of efficiency," says Justin Dangel, the firm's cofounder and CEO. "We don't think anybody will be building on transformers in a few years"1
. Whether this AI startup can deliver on its ambitious vision remains to be seen, but the independent benchmarks suggest the technology deserves serious attention as the industry seeks alternatives to power-hungry transformer models.Summarized by
Navi
[1]
[2]
06 May 2026•Technology

13 Nov 2024•Technology

28 Jan 2025•Technology

1
Policy and Regulation

2
Policy and Regulation

3
Business and Economy
