Curated by THEOUTPOST
On Sat, 2 Nov, 8:01 AM UTC
2 Sources
[1]
Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw
Apple just shipped its first Apple Intelligence features and launched new AI-optimized Macs. But for all the AI hype, there are clearly limitations with the technology's intelligence. And one of those limits was highlighted by Apple's AI research through a recent experiment. See if you can solve this arithmetic problem: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have? If you answered "190," congratulations: You did as well as the average grade school kid by getting it right. (Friday's 44 plus Saturday's 58 plus Sunday's 44 multiplied by 2, or 88, equals 190.) You also did better than more than 20 state-of-the-art artificial intelligence models tested by an AI research team at Apple. The AI bots, they found, consistently got it wrong. The research paper explains that the best and brightest LLM models saw "catastrophic performance drops" when trying to answer simple math problems that were written out like this. It happened primarily when those problems included irrelevant data, which even schoolchildren quickly learn to disregard. Thus calling into question AI's current intelligence capabilities. Due to the variety of tests Apple's AI research entailed, the paper concludes that current AI models are 'not capable of genuine logical reasoning.' Which might be something we're generally aware of, but it stands as an important cautionary note as more and more trust is given to AI's 'intelligence.' AI optimists might assume the problem is an easy fix, but Apple's team disagreed. "Can scaling data, models, or compute fundementaly solve this? We don't think so!" Ultimately, Apple's paper is not meant to dampen enthusiasm over AI's capabilities, but rather provide a measure of common sense. AI can perform some tasks as though it's extremely intelligent, but in many ways that 'intelligence' isn't what it might appear. What do you make of Apple's AI findings? Let us know in the comments.
[2]
Michael Hiltzik: These Apple researchers just showed that AI bots can't think, and possibly never will
By Michael Hiltzik, Los Angeles Times The Tribune Content Agency Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have? If you answered "190," congratulations: You did as well as the average grade school kid by getting it right. (Friday's 44 plus Saturday's 58 plus Sunday's 44 multiplied by 2, or 88, equals 190.) You also did better than more than 20 state-of-the-art artificial intelligence models tested by an AI research team at Apple. The AI bots, they found, consistently got it wrong. The Apple team found "catastrophic performance drops" by those models when they tried to parse simple mathematical problems written in essay form. In this example, the systems tasked with the question often didn't understand that the size of the kiwis have nothing to do with the number of kiwis Oliver has. Some, consequently, subtracted the five undersized kiwis from the total and answered "185." Human schoolchildren, the researchers posited, are much better at detecting the difference between relevant information and inconsequential curveballs. The Apple findings were published earlier this month in a technical paper that has attracted widespread attention in AI labs and the lay press, not only because the results are well-documented, but also because the researchers work for the nation's leading high-tech consumer company - and one that has just rolled out a suite of purported AI features for iPhone users. "The fact that Apple did this has gotten a lot of attention, but nobody should be surprised at the results," says Gary Marcus, a critic of how AI systems have been marketed as reliably, well, "intelligent." Indeed, Apple's conclusion matches earlier studies that have found that large language models, or LLMs, don't actually "think" so much as match language patterns in materials they've been fed as part of their "training." When it comes to abstract reasoning - "a key aspect of human intelligence," in the words of Melanie Mitchell, an expert in cognition and intelligence at the Santa Fe Institute - the models fall short. "Even very young children are adept at learning abstract rules from just a few examples," Mitchell and colleagues wrote last year after subjecting GPT bots to a series of analogy puzzles. Their conclusion was that "a large gap in basic abstract reasoning still remains between humans and state-of-the-art AI systems." That's important because LLMs such as GPT underlie the AI products that have captured the public's attention. But the LLMs tested by the Apple team were consistently misled by the language patterns they were trained on. The Apple researchers set out to answer the question, "Do these models truly understand mathematical concepts?" as one of the lead authors, Mehrdad Farajtabar, put it in a thread on X. Their answer is no. They also pondered whether the shortcomings they identified can be easily fixed, and their answer is also no: "Can scaling data, models, or compute fundamentally solve this?" Farajtabar asked in his thread. "We don't think so!" The Apple research, along with other findings about the limitations of AI bots' cogitative limitations, is a much-needed corrective to the sales pitches coming from companies hawking their AI models and systems, including OpenAI and Google's DeepMind lab. The promoters generally depict their products as dependable and their output as trustworthy. In fact, their output is consistently suspect, posing a clear danger when they're used in contexts where the need for rigorous accuracy is absolute, say in healthcare applications. That's not always the case. "There are some problems which you can make a bunch of money on without having a perfect solution," Marcus told me. Recommendation engines powered by AI - those that steer buyers on Amazon to products they might also like, for example. If those systems get a recommendation wrong, it's no big deal; a customer might spend a few dollars on a book he or she didn't like. "But a calculator that's right only 85% of the time is garbage," Marcus says. "You wouldn't use it." The potential for damagingly inaccurate outputs is heightened by AI bots' natural language capabilities, with which they offer even absurdly inaccurate answers with convincingly cocksure elan. Often they double down on their errors when challenged. These errors are typically described by AI researchers as "hallucinations." The term may make the mistakes seem almost innocuous, but in some applications, even a minuscule error rate can have severe ramifications. That's what academic researchers concluded in a recently published analysis of Whisper, an AI-powered speech-to-text tool developed by OpenAI, which can be used to transcribe medical discussions or jailhouse conversations monitored by correction officials. The researchers found that about 1.4% of Whisper-transcribed audio segments in their sample contained hallucinations, including the addition to transcribed conversation of wholly fabricated statements including portrayals of "physical violence or death ... (or) sexual innuendo," and demographic stereotyping. That may sound like a minor flaw, but the researchers observed that the errors could be incorporated in official records such as transcriptions of court testimony or prison phone calls - which could lead to official decisions based on "phrases or claims that a defendant never said." Updates to Whisper in late 2023 improved its performance, the researchers said, but the updated Whisper "still regularly and reproducibly hallucinated." That hasn't deterred AI promoters from unwarranted boasting about their products. In an Oct. 29 tweet, Elon Musk invited followers to submit "x-ray, PET, MRI or other medical images to Grok (the AI application for his X social media platform) for analysis." Grok, he wrote, "is already quite accurate and will become extremely good." It should go without saying that, even if Musk is telling the truth (not an absolutely certain conclusion), any system used by healthcare providers to analyze medical images needs to be a lot better than "extremely good," however one might define that standard. That brings us to the Apple study. It's proper to note that the researchers aren't critics of AI as such but believers that its limitations need to be understood. Farajtabar was formerly a senior research scientist at DeepMind, where another author interned under him; other co-authors hold advanced degrees and professional experience in computer science and machine learning. The team plied their subject AI models with questions drawn from a popular collection of more than 8,000 grade school arithmetic problems testing schoolchildren's understanding of addition, subtraction, multiplication and division. When the problems incorporated clauses that might seem relevant but weren't, the models' performance plummeted. That was true of all the models, including versions of the GPT bots developed by OpenAI, Meta's Llama, Microsoft's Phi-3, Google's Gemma and several models developed by the French lab Mistral AI. Some did better than others, but all showed a decline in performance as the problems became more complex. One problem involved a basket of school supplies including erasers, notebooks and writing paper. That requires a solver to multiply the number of each item by its price and add them together to determine how much the entire basket costs. When the bots were also told that "due to inflation, prices were 10% cheaper last year," the bots reduced the cost by 10%. That produces a wrong answer, since the question asked what the basket would cost now, not last year. Why did this happen? The answer is that LLMs are developed, or trained, by feeding them huge quantities of written material scraped from published works or the internet - not by trying to teach them mathematical principles. LLMs function by gleaning patterns in the data and trying to match a pattern to the question at hand. But they become "overfitted to their training data," Farajtabar explained via X. "They memorized what is out there on the web and do pattern matching and answer according to the examples they have seen. It's still a [weak] type of reasoning but according to other definitions it's not a genuine reasoning capability." (the brackets are his.) That's likely to impose boundaries on what AI can be used for. In mission-critical applications, humans will almost always have to be "in the loop," as AI developers say-vetting answers for obvious or dangerous inaccuracies or providing guidance to keep the bots from misinterpreting their data, misstating what they know, or filling gaps in their knowledge with fabrications. To some extent, that's comforting, for it means that AI systems can't accomplish much without having human partners at hand. But it also means that we humans need to be aware the tendency of AI promoters to overstate their products' capabilities and conceal their limitations. The issue is not so much what AI can do, but how users can be gulled into thinking what it can do. "These systems are always going to make mistakes because hallucinations are inherent," Marcus says. "The ways in which they approach reasoning are an approximation and not the real thing. And none of this is going away until we have some new technology."
Share
Share
Copy Link
Apple researchers conducted tests revealing significant limitations in AI models' ability to perform simple arithmetic and logical reasoning, raising questions about the true intelligence of current AI systems.
A recent study conducted by Apple's AI research team has exposed a fundamental flaw in the intelligence capabilities of current artificial intelligence models. The research, which has garnered widespread attention in AI labs and the press, demonstrates that state-of-the-art AI models struggle with simple arithmetic problems when presented in a narrative format 1.
The researchers presented AI models with a straightforward arithmetic problem involving kiwi picking. While the average grade school child could easily solve the problem, more than 20 advanced AI models consistently failed to provide the correct answer. The AI systems often misinterpreted irrelevant information, such as the size of some kiwis, leading to incorrect calculations 2.
This study calls into question the true intelligence capabilities of current AI systems. The Apple team concluded that these models are "not capable of genuine logical reasoning," highlighting a significant gap between human and AI cognitive abilities. Even young children can easily distinguish between relevant and irrelevant information, a skill that seems to elude even the most advanced AI models 1.
Experts like Gary Marcus and Melanie Mitchell support Apple's findings, noting that large language models (LLMs) primarily engage in pattern matching rather than true abstract reasoning. This limitation becomes evident when AI systems are presented with problems requiring logical thinking beyond mere language processing 2.
The Apple researchers express skepticism about easy solutions to this problem. They argue that simply scaling up data, models, or computing power is unlikely to fundamentally solve the issue of logical reasoning in AI systems 1.
These findings have significant implications for the deployment of AI in critical areas such as healthcare, where accuracy is paramount. While AI can be useful in certain contexts, such as recommendation engines, its limitations in logical reasoning raise concerns about its reliability in more complex decision-making scenarios 2.
The research also touches on the issue of AI "hallucinations," where systems confidently provide inaccurate information. This phenomenon has been observed in various AI applications, including speech-to-text tools, potentially leading to serious consequences in legal or medical contexts 2.
Apple's research serves as a reality check on the current state of AI technology. While not intended to dampen enthusiasm for AI's potential, it emphasizes the need for a more measured and realistic understanding of AI capabilities. As AI continues to be integrated into various aspects of technology and daily life, recognizing its limitations becomes crucial for responsible development and application 1.
A recent study by Apple researchers exposes significant flaws in the mathematical reasoning capabilities of large language models (LLMs), challenging the notion of AI's advanced reasoning skills and raising questions about their real-world applications.
17 Sources
17 Sources
Apple rolls out its AI features, Apple Intelligence, with a focus on privacy and security. The update brings new capabilities but faces criticism for inconsistent performance and battery drain issues.
4 Sources
4 Sources
Apple's voice assistant Siri lags behind competitors, causing delays in product launches and raising questions about the company's AI strategy. This struggle reflects broader challenges in the consumer tech industry's push for AI integration.
3 Sources
3 Sources
A recent study shows that a majority of Apple and Samsung smartphone users find AI features on their devices to be of little value, raising questions about the future of AI in mobile technology.
15 Sources
15 Sources
Apple's rollout of Apple Intelligence, its AI suite, showcases a measured approach to AI integration. Despite initial limitations, it could normalize AI use and significantly impact user perceptions.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved