2 Sources
[1]
Analysis | The AI industry is awash in hype, hyperbole and horrible charts
The mockery about "chart crimes" -- big boo-boos with data graphics -- nearly overshadowed the technology upgrades announced by two artificial intelligence start-ups. During a demonstration Thursday of ChatGPT's newest version, GPT-5, the company showed a visual in which it appeared 52.8 percent was a larger number than 69.1 percent, which also was somehow equal to 30.8 percent. It was the chart that launched a thousand snarky tweets -- or X posts, whatever. Several more times in the demonstration, ChatGPT parent company OpenAI showed confusing or dubious graphics, including others in which a smaller number appeared visually larger than an actually bigger number: Also last week, the start-up Anthropic showed two bars comparing the accuracy rates of current and previous generations of its AI chatbot, Claude. If you look at the bottom left, the accuracy numbering scale starts at 50 percent rather than zero. That's typically a major no-no among data nerds. Starting the scale halfway to 100 percent magnifies a relatively small difference in accuracy. A long-standing meme shows how a difference of a few inches in height can look comically massive when this type of graphical representation doesn't start at zero. (Or here's a version with pop stars.) The AI chart crimes were ridiculed on X and on the "Limitless" tech podcast. They became an instant classic on an X account devoted to bad charts, Graph Crimes. Conspiracy theories started that AI generated the botched data visuals. (An OpenAI employee apologized for the "unintentional chart crime," and CEO Sam Altman said on Reddit that staff messed up charts in rushing to get their work done. Asked for further comment, OpenAI referred to Altman's Reddit remarks.) AI geniuses making oopsies is relatable to us human mortals. But to some data experts and AI specialists, the chart crimes are a symptom of an AI industry that regularly wields fuzzy numbers to stoke hype and score bragging points against rivals. Chart crime is rampant in Silicon Valley I showed a couple of the charts from OpenAI and Anthropic to Alberto Cairo, a professor of visualization design at the University of Miami and author of the book "How Charts Lie: Getting Smarter About Visual Information." "They're terrible," Cairo said. Skip to end of carousel Shira Ovide (Patrick Dias for The Washington Post) Tech Friend writer Shira Ovide gives you advice and context to make technology work for you. Sign up for the free Tech Friend newsletter. Contact her securely on Signal at ShiraOvide.70 End of carousel He wasn't only irked about the basic arithmetic abuses. Cairo also was dubious about OpenAI's and Anthropic's use of graphs for two or three numbers that people could understand without any charts at all. "Sometimes a chart doesn't really add anything," he said. (The Washington Post has a content partnership with OpenAI.) Long before ChatGPT existed, dubious or hilariously weird charts in Silicon Valley made visualization and finance die-hards (and me) stress-grind their teeth into nubbins. Big technology companies and start-ups love charts that appear to show impressive growth in sales or other business goals, but that have no disclosed scale that reveal the numbers behind those graphics. The scale could be as skewed as that of women's heights and you'd never know. To the companies, these charts show a glimpse at their success without overexposing their finances. To Cairo -- well, he described charts like those with an expletive and said the companies behind them "should be mocked mercilessly." Cairo pointed to research that may help explain why companies gravitate to charts: They ooze authority and objectivity, and people may be more likely to trust the information. Bad charts aren't just about the charts Jessica Dai, a PhD student at the University of California at Berkeley's AI research lab, said that her big beef with the Anthropic chart was the "hypocrisy," not the off-base scale. The company has previously prodded researchers evaluating AI effectiveness to include what are called confidence intervals, or a range of expected values if a data study is repeated many times. Dai wasn't sure that's the right approach, but also said that Anthropic didn't even follow its own recommendation. If Anthropic had, Dai said, it might have wiped out statistical evidence of an accuracy difference between old and new versions of Claude. To her and some other AI specialists that I spoke with, misguided charts may point to a tendency in the industry to use confidently expressed but unverified data to boast about the technology or bash competitors. The Post previously found that AI detection companies claiming to be up to 99 percent accurate had largely untested capabilities. Meta was mocked this spring for apparently gaming its AI to boost the company's standings in a technology scoreboard. Rival AI companies squabbled recently over software performance in a math competition for high school students. "Just because you put a number on it, that's supposed to be more rigorous and more real," Dai said. "It's all over this industry." Niloofar Mireshghallah, an AI specialist and an incoming Carnegie Mellon University professor, said that people largely evaluate AI not by metrics but on their subjective feel for how useful the technology seems. And the vibes, she said, are actually pretty good for the latest version of ChatGPT.
[2]
OpenAI's performance charts in the GPT-5 launch video are such a mess you have to think GPT-5 itself probably made them, and the company's attempted fixes raise even more questions
What to make of OpenAI's latest GPT-5 chatbot? Let's just say the reception from users has been sufficiently mixed to have OpenAI head honcho Sam Altman posting apologetically on X. And more than once. But one thing we can say for sure, the charts in the launch video were a bizarre mess that OpenAI has since attempted to tidy up, to mixed avail. Most obviously, the claimed SWE-bench performance of GPT-5 versus older model shown on launch day was badly botched. The chart showed accuracy figures of 74.9% for ChatGPT 5, 69.1% for OpenAi o3 and 30.8% for GPT-4o. Problem is, the bar graph heights were exactly the same for the latter two, giving the at-a-glance impression of total dominance for GPT-5 when in fact it is only marginally superior to OpenAI o3. It's a basic enough mistake that you have to wonder whether OpenAI used, well, GPT-5 itself make the charts and couldn't be bothered to proof them. Later on in the video there's another graph showing, not a little ironically, the deception rate of GPT-5. In this chart, it shows the bar graphs for GPT-5 and OpenAI o3. GPT-5 scores a "coding deception" rate of 50%, OpenAI o3's is 47.4%. But the bar for OpenAI o3 is rendered roughly three times higher than that of GPT-5. Now, you could recognise that a lower deception rate is better and make some kind of convoluted argument for therefore making OpenAI o3's bar higher. Apart from the fact that this approach still doesn't account for the large discrepancy in bar height, the problem is that on the same slide OpenAI also shows stats for "CharXiv missing image". And here the bars are accurately proportional to the percentage results, with the 9% for GPT-5 a tiny fraction of the height of the 86.7% for OpenAI o3. Another wonky chart cooked up by AI? Some kind of subtle satire? Just lazy, sloppy work? The usual adage probably applies, so assuming conspiracy where mere incompetence will suffice is probably unjustified. But it certainly implies a level of complacency that squares with the overall sense of entitlement and lack of rigour and accountability that surrounds the AI industry at large. OpenAI has since posted some updated charts on its website. The new deception rate chart certainly suggests that a mere mistake was made. The revised stats show GPT-5's coding deception rate at 16.5%, which squares with the bar heights in the launch video. However, while the bar height in the SWE-bench chart has also been corrected, OpenAI added a further disclaimer, pointing out that the figures were achieved using 477 tasks within the SWE-bench suite, not the full 500. That has lead some observers to question whether a few inconvenient tasks were left out in order to allow GPT-5 to hit 74.9% and thus edge marginally ahead of the 74.5% score racked up by Anthropic's Claude Opus 4.1 model. Indeed, this raised the eyebrow of none other than Elon Musk, too. Meanwhile, original launch video with the messed up charts is still there on OpenAI's YouTube channel, implying OpenAI isn't all that bothered. Whatever is going on, exactly, it's all fairly unsightly. At best it's unbecoming of an organisation that supposedly produces artificial "intelligence." At worst, it's thoroughly unnerving if managing the dangers AI and all our safety depends on these people.
Share
Copy Link
OpenAI's GPT-5 launch presentation featured misleading charts and data visualization errors, sparking criticism and raising questions about the AI industry's approach to presenting performance metrics.
OpenAI's recent unveiling of GPT-5, the latest version of its AI language model, has been overshadowed by a series of data visualization errors in the presentation charts. The incident has sparked widespread criticism and mockery, raising questions about the AI industry's approach to presenting performance metrics 1.
During the GPT-5 demonstration, OpenAI displayed several confusing and dubious graphics. One chart appeared to show that 52.8% was larger than 69.1%, which was somehow equal to 30.8%. This glaring error quickly became the subject of ridicule on social media platforms 1.
Source: pcgamer
Another chart comparing the "coding deception" rates of GPT-5 and OpenAI o3 showed inconsistent bar heights that did not accurately represent the percentages. The bar for OpenAI o3 (47.4%) was rendered roughly three times higher than that of GPT-5 (50%), despite the latter having a higher percentage 2.
The chart controversies extend beyond OpenAI. Anthropic, another AI startup, presented a chart comparing accuracy rates of its AI chatbot Claude, where the scale started at 50% instead of zero. This technique, frowned upon by data visualization experts, can magnify small differences and potentially mislead viewers 1.
Alberto Cairo, a professor of visualization design at the University of Miami, described the charts as "terrible." He criticized not only the arithmetic errors but also the unnecessary use of graphs for small sets of numbers that could be easily understood without visual aids 1.
Jessica Dai, a PhD student at UC Berkeley's AI research lab, pointed out the hypocrisy in Anthropic's approach. The company had previously advocated for including confidence intervals in AI evaluations but failed to follow its own recommendation in this instance 1.
These visualization errors have reignited discussions about the AI industry's tendency to use confidently expressed but unverified data to boast about technology or criticize competitors. Previous instances of such behavior include AI detection companies claiming untested high accuracy rates and Meta allegedly gaming its AI to improve its standings in a technology scoreboard 1.
OpenAI has since attempted to address the issues by posting updated charts on its website. However, these revisions have raised further questions. A disclaimer added to the SWE-bench chart revealed that the figures were based on 477 tasks instead of the full 500, leading some observers to speculate whether inconvenient tasks were omitted to edge out competitors 2.
Sam Altman, OpenAI's CEO, apologized on Reddit for the "unintentional chart crime," attributing the errors to staff rushing to complete their work 1.
The chart controversies surrounding OpenAI's GPT-5 launch have highlighted ongoing issues within the AI industry regarding data representation and transparency. As the field continues to advance rapidly, these incidents serve as a reminder of the need for rigorous standards in presenting AI performance metrics to the public and stakeholders.
Summarized by
Navi
[1]
NVIDIA announces significant upgrades to its GeForce NOW cloud gaming service, including RTX 5080-class performance, improved streaming quality, and an expanded game library, set to launch in September 2025.
9 Sources
Technology
7 hrs ago
9 Sources
Technology
7 hrs ago
As nations compete for dominance in space, the risk of satellite hijacking and space-based weapons escalates, transforming outer space into a potential battlefield with far-reaching consequences for global security and economy.
7 Sources
Technology
1 day ago
7 Sources
Technology
1 day ago
OpenAI updates GPT-5 to make it more approachable following user feedback, sparking debate about AI personality and user preferences.
6 Sources
Technology
16 hrs ago
6 Sources
Technology
16 hrs ago
A pro-Russian propaganda group, Storm-1679, is using AI-generated content and impersonating legitimate news outlets to spread disinformation, raising concerns about the growing threat of AI-powered fake news.
2 Sources
Technology
1 day ago
2 Sources
Technology
1 day ago
A study reveals patients' increasing reliance on AI for medical advice, often trusting it over doctors. This trend is reshaping doctor-patient dynamics and raising concerns about AI's limitations in healthcare.
3 Sources
Health
15 hrs ago
3 Sources
Health
15 hrs ago