4 Sources
[1]
GPT-5 bombed my coding tests, but redeemed itself with code analysis
With the big news that OpenAI has released GPT-5, the team here at ZDNET is working to learn about and communicate its strengths and weaknesses. In another article, I put its programming prowess to the test and came up with a less-than-impressive result. Also: I tested GPT-5's coding skills, and it was so bad that I'm sticking with GPT-4o When Deep Research first appeared with the OpenAI o3 LLM, I was quite impressed with what it could understand from examining a code repository. I wanted to know how well it understood the project just from the available code. In this article, I'm examining how well the three GPT-5 variants do in examining that same code repository. We'll dig in and compare them. The results are quite interesting. Here are the four models. I gave all four models the same assignment. I connected them to my private GitHub repository for my open-source free WordPress security plugin and its freemium add-on modules, selected Deep Research, and gave them this prompt. Examine the repository and learn its structure and architecture. Then report back what you've learned. For those models that asked to choose areas of detail about what I wanted, I gave them this prompt. Everything you can tell me, be as comprehensive as possible. As you can see, I didn't provide any context other than the source code repo itself. That code has a README file, as well as comments throughout the code, so there was some English-language context. But most of the context has to be derived from the folder structure, file names, and code itself. Also: The best AI for coding in 2025 (and what not to use) From that, I hoped that the AIs would assess its structure, quality, security posture, extensibility, and possibly suggest improvements. This should be relevant to ZDNET readers because it's the kind of high-judgement, detail-oriented work that AIs are being used for. It certainly can make coming up to speed on an existing coding project easier, or at least provide a foundation for initial understanding. Other than the two prompts above, I didn't give the LLMs any guidance about what to tell me. I wanted to see how they evaluated the repository and what sort of analysis they could provide. As you can see from this table, overall coverage was quite varied in scope. More checks mean more depth of coverage. To create this aggregate, topics like "Project Purpose & Architecture," "System Architecture," and "Plugin Design & Integration" were all normalized under Purpose/Architecture. Directory/File Structure contained any section mapping folders and files. Execution flow combines anything about how the software code runs. Recommendations/Issues combines all discussions of modernization suggestions, open issues, and minor red flags. In terms of overall value, I'd rank the four LLMs as follows (from best to least best). Pro, of course, is only available in the $200/mo ChatGPT Pro tier. Later in this article, I'll show one way to modify the above prompts to get GPT-5 (non-Pro) to provide a fairly close approximation of the overall depth of the Pro response. GPT-5 Thinking, which is a model available in the $20/mo Plus plan, was the least helpful of the group. The GPT-4 generation o3 Deep Thinking model still holds up, but you can see how its self-directed focus is a bit different from the other two. Also: Google's Jules AI coding agent built a new feature I could actually ship - while I made coffee My main conclusion is that I was a bit surprised about how close the models were to each other. GPT-5, as OpenAI promised, did seem to provide a jump in overall cognition and usefulness, but nothing I would consider game-changing. With that, let's dive into some specific examples that help illustrate my conclusions. Each of these sections is pulled from the various reports generated and shows you how each model provided similar information. Below, you can see how GPT-5 Pro names exact mechanisms (like file-top guard, nonces, manage_options). I've provided exact snippets here from the reports generated by all four models. GPT-5 affirms best practices but keeps it conceptual. o3 describes what happens (redirects/login flow) more than how it's hardened. GPT-5 Thinking gives the clearest "what runs when" story for new developers coming up to speed on the repo. "It guards against direct file access (if (!defined('ABSPATH')) exit; at the top of PHP files). It sanitizes input where appropriate ... using WordPress nonces in AJAX handlers. It uses capability checks (e.g., adding menu pages only for users with manage_options capability ... only admins see those tools). The code tries not to load unnecessary things ... like only loading certain admin files on certain contexts." "There are checks for WordPress functions before use ... so the plugin behaves gracefully even on very old WordPress setups. The plugins often guard against direct file access by checking ... to prevent security issues from accessing .php files directly. Add-ons verify the presence of core before proceeding ... and show an admin error if CMB2 isn't loaded." "Purpose: My Private Site locks down an entire site so only logged-in users can view content ... while protecting the rest. Overall architecture: [it] integrates deeply with WordPress's hook system and login/logout events to manage redirects and track login state." "Admin vs Front-end: It checks is_admin() to determine context. If on the front-end (not admin), it retrieves the saved privacy setting and, when enabled, hooks at a point like template_redirect to redirect unauthorized visitors. Throughout this initialization, the plugin uses WordPress hooks (actions and filters) to integrate functionality." GPT-5 Pro didn't just describe the system; it walked through the process in sequential operational steps, almost like a short runbook you could hand to a developer or QA tester. GPT-5 confirms the architecture but abstracts the plumbing. GPT-5 Thinking adds a helpful "how add-ons plug into the Licenses tab" detail. o3 largely leaves licensing internals on the cutting room floor in favor of a fairly unhelpful modernization critique. "The core plugin provides utility functions to get and store license keys in a centralized option (jr_ps_licenses) and to contact the EDD license server for validation. Each extension plugin defines its own updater using EDD_SL_Plugin_Updater, passing the current version, the license key from the centralized store, and the EDD store URL. The core plugin's UI has a 'Licenses' tab, and extensions inject their own license fields via filters." "License integration: The core plugin centralizes license management ... and the add-ons piggyback on the core's licensing mechanism, integrating their license fields into the core plugin's interface." The o3 report spends most of its time on modernization and architecture. It discusses configuration and update behavior but does not walk through option keys, updater classes, or the Licenses UI wiring with the same procedural detail as GPT-5 and GPT-5 Pro. So there's nothing here to quote as a demonstration. "The add-ons heavily rely on hooks provided by core or WordPress: They use add_filter/add_action calls to insert their logic ... and use WordPress action hooks to integrate their license fields into the Licenses tab that the core plugin triggers when building the Licenses tab." Both GPT-5 Pro and GPT-5 explicitly pointed out how my code uses "one option array + prune + no-op writes," which is a WordPress best practice for code maintainability. Both o3 and GPT-5 Thinking describe the lifecycle and effects (what's initialized, what loads when) rather than the exact option structure. "Settings are stored in a single serialized option ... initialization routines add default keys, prune deprecated ones, and only update the option in the database if there is an actual change, avoiding unnecessary writes." "State Management: Plugin settings are stored in WordPress options as a central settings array and the code ensures defaults are applied while removing deprecated ones on each load, but only writes to the database when changes occur." "The main plugin initializes defaults (installed version, first-run timestamp, etc.). On each run it ensures these options exist and, if the privacy feature is disabled, the enforcement hook is not added." "Module includes: includes admin and common modules in the back-end; on the front-end it retrieves the saved privacy setting and, when enabled, loads enforcement logic (e.g., in template_redirect). It registers a deactivation hook to clean up on deactivation (e.g., deleting a flag option)." I was unimpressed with GPT-5 when it came to my coding tests. It failed half of my tests, an unprecedentedly bad result for what has previously been the gold standard in passing coding tests. But GPT-5 was quite impressive in its analysis of the GitHub repository. It could be a powerful tool for onboarding new programmers, for someone adopting code, or simply for coming back up to speed on a project that's been untouched for a while. Also: How I test an AI chatbot's coding ability - and you can, too The GPT-4 generation o3 model is known to be a strong reasoning model, which is why it has been the basis for ChatGPT Deep Research. But GPT-5 was able to combine both breadth and detail, which is where o3 and GPT-4o were weak in previous tests. The older models did give accurate summaries and useful suggestions, but they missed interconnections. For example, the older models were never able to show how UI flows, licensing, and update mechanisms work together. Even the base version of GPT-5 was able to identify cross-cutting concerns without additional prompting. Repository structure, backward compatibility, performance characteristics, and state management patterns all appeared in the first draft. Trying to get GPT-4 to span subjects is often an exercise in deep frustration. I found GPT-5's ability to understand and explain a complex interconnected system like my security product, all in one pass, to be a substantial improvement over the GPT-4 generation. Maybe. If you're in a real rush to get to know a project and want as much of a data dump as possible as quickly as possible, yes. If you're operating on a big programming budget and $200/mo doesn't matter to you, yes. But I find that cost hard to bear, especially when I have to subscribe to a wide range of AI services to evaluate them. So, now that I'm nearing the end of my one-month test of Pro-level activities, I'm planning on downgrading back to the $20/mo Plus plan. Also: How to use GPT-5 in VS Code with GitHub Copilot Pro's edge over GPT-5 wasn't about knowing more facts; it was about delivering those facts in a form you can act on immediately. The Pro report didn't just explain that security looked good; it cited the exact guards and checks in the code. It didn't just say licensing was centralized; it mapped the exact functions and database options involved. Again, if you're on a time crunch, you might consider Pro. But I also think you can modify the base GPT-5's responses, with detail like the Pro report produced, simply by using better prompting. That's next... I fed both the GPT-5 and GPT-5 Pro reports into GPT-5 and asked it for a prompt that would push the base-level GPT-5 to give GPT-5 Pro comprehensiveness as a result. This is that prompt, which you should add to any query where you want more complete coding information: *High-Specificity Technical Mode: ***In your answer, combine complete high-level coverage with exhaustive implementation-level detail. This worked fantastically well. It took ChatGPT GPT-5 12 minutes to produce a 15,477-word document, complete with analysis and coding blocks. For example, it describes how value initialization is done, and then shows the code that accomplishes it. I think you could fine-tune this prompt and get Pro-level results without having to pay the $200/mo fee. I'm certainly going to tinker with this idea, possibly using GPT-5 to refine the specifications in the prompt for different areas I want to delve deeply into. I'll let you know how it goes. I had some difficulty setting up sharing for each of these long reports, so I just copied the results into Google Docs and shared them. Here are the links if you want to look at any of these reports. You are welcome to dig into these documents and learn how my project is structured. While you may or may not care about my project, it's instructive to see how the various models perform. While you can read the reports, my actual repo is restricted since it's my private development repository. What about you? Have you tried using GPT-5 or GPT-5 Pro to analyze your own code? How did its insights compare to earlier models like GPT-4 or o3? Do you think the $200/month Pro tier is worth it for the extra precision, or could you get by with better prompts in the base version? Have you found AI code analysis useful for onboarding, refactoring, or improving security? Let us know in the comments below.
[2]
I went hands-on with ChatGPT Codex and the vibe was not good - here's what happened
ChatGPT Codex wrote code and saved me time.It also created a serious bug, but it was able to recover.Codex is still based on the GPT-4 LLM architecture. Well, vibe coding this is not. I found the experience to be slow, cumbersome, stressful, and incomplete. But it all worked out in the end. ChatGPT Codex is ChatGPT's agentic tool dedicated to code writing and modification. It can access your GitHub repository, make changes, and issue pull requests. You can then review the results and decide whether or not to incorporate them. Also: How to move your codebase into GitHub for analysis by ChatGPT Deep Research - and why you should My primary development project is a PHP and JavaScript-based WordPress plugin for site security. There's a main plugin available for free, and some add-on plugins that enhance the capabilities of the core plugin. My private development repo contains all of this, as well as some maintenance plugins I rely on for user support. This repo contains 431 files. This is the first time I've attempted to get an AI to work across my entire ecosystem of plugins in a private repository. I previously used Jules to add a feature to the core plugin, but because it only had access to the core plugin's open source repository, it couldn't take into account the entire ecosystem of products. Earlier last week, I decided to give ChatGPT Codex a run at my code. Then this happened. On Thursday, GPT-5 slammed into the AI world like a freight train. Initially, OpenAI tried to force everyone to use the new model. Subsequently, they added legacy model support when many of their customers went ballistic. I ran GPT-5 against my set of programming tests, and it failed half of them. So, I was particularly curious about whether Codex still supported the GPT-4 architecture or would force developers into GPT-5. However, when I queried Codex five days after GPT-5 launched, the AI responded that it was still based on "OpenAl's GPT-4 architecture." I took two things from that: With that, here is the result of my still-very-much-not-GPT-5 look at ChatGPT Codex. My first step was asking ChatGPT Codex to examine the codebase. I used the Ask mode of Codex, which does analysis, but doesn't actually change any code. I was hoping for something as deep and comprehensive as the one I received from ChatGPT Deep Research a few months ago, but instead, I received a much less complete analysis. I found a more effective approach was to ask Codex to do a quick security audit and let me know if there were any issues. Here's how I prompted it. Identify any serious security concerns. Ignore plugins Anyone With Link, License Fixer, and Settings Nuker. Anyone With Link is in the very early stages of coding, and is not ready for code review. License Fixer and Settings Nuker are specialty plugins that do not need a security audit. Codex identified three main areas for improvement. All three areas were valid, although I am not prepared to modify the serialization data structure at this time, because I'm saving that for a whole preferences overhaul. The $_POST complaint is managed, but with a different approach than Codex noticed. Also: The best AI for coding in 2025 (and what not to use) The third area -- the nonce and cross-site request forgery (CSRF) risk -- was something worth changing right away. While access to the user interface for the plugin is assumed to be determined by login role, the plugins themselves don't explicitly check that the person submitting the plugin settings for action is allowed to do so. That's what I decided to invite Codex to fix. Next up, I instructed Codex to make fixes in the code. I changed the setting from Ask mode to Code mode so the AI would actually attempt changes. As with ChatGPT Agent, Codex spins up a virtual terminal to do some of its work. When the process completed, Codex showed a diff (the difference between original and to-be-modified code). I was heartened to see that the changes were quite surgical. Codex didn't try to rewrite large sections of the plugin; it just modified the small areas that needed improvement. In a few areas, it dug in and changed a few more lines, but those changes were still pretty specific to the original prompt. At one point, I was curious to know why it added a new foreach loop to iterate over an array, so I asked. As you can see above, I got back a fairly clear response on its reasoning. It made sense, so I moved on, continuing to review Codex's proposed changes. All told, Codex proposed making changes to nine separate files. Once I was satisfied with the changes, I clicked Create PR. That creates a pull request, which is how any GitHub user suggests changes to a codebase. Once the PR is created, the project owner (me, in this case) has the option to approve those changes, which adds them into the actual code. It's a good mechanism, and Codex does a clean job of working within GitHub's environment. Once I was convinced the changes were good, I merged Codex's work back into the main codebase. I brought the changes down from GitHub to my test machine and tried to run the now-modified plugin. Wait for it... Yeah. That's not what's supposed to happen. To be fair, I've generated my own share of error screens just like that, so I can't really get angry at the AI. Instead, I took a screenshot of the error and passed it to Codex, along with a prompt telling Codex, "Selective Content plugin now fails after making changes you suggested. Here are the errors." It took the AI three minutes to suggest a fix, which it presented to me in a new diff. I merged that change into the codebase, once again brought it down to my test server, and it worked. Crisis averted. When I'm not in a rush and I have the time, coding can provide a very pleasant state of mind. I get into a sort of flow with the language, the machine, and what seems like a connection between my fingers and the computer's CPU. Not only is it a lot of fun, but it can also be emotionally transcendent. Working with ChatGPT Codex was not fun. It wasn't hateful. It just wasn't fun. It felt more like exchanging emails with a particularly recalcitrant contractor than having a meeting of the minds with a coding buddy. Also: How to use GPT-5 in VS Code with GitHub Copilot Codex provided its responses in about 10 or 15 minutes, whereas the same code would probably have taken me a few hours. Would I have created the same bug as Codex? Probably not. As part of the process of thinking through that algorithm, I most likely would have avoided the mistake Codex made. But I undoubtedly would have created a few more bugs based on mistyping or syntax errors. To be fair, had I introduced the same bug as Codex did, it would have taken me considerably longer than three minutes to find and fix it. Add another hour or so at least. So Codex did the job, but I wasn't in flow. Normally, when I code and I'm inside a particular file or subsystem, I do a lot of work in that area. It's like cleaning day. If you're cleaning one part of the bathroom, you might as well clean all of it. But Codex clearly works best with small, simple instructions. Give it one class of change, and work through that one change before introducing new factors. Like I said, it does work and it is a useful tool. But using it definitely felt like more of a chore than programming normally does, even though it saved me a lot of time. Also: Google's Jules AI coding agent built a new feature I could actually ship - while I made coffee I don't have tangible test results, but after testing Google's Jules in May and ChatGPT's Codex now, I get the impression that Jules is able to get a deeper understanding of the code. At this point, I can't really support that assertion with a lot of data; it's just an impression. I'm going to try running another project through Jules. It will be interesting to see if Codex changes much once OpenAI feels safe enough to incorporate GPT-5. Let's keep in mind that OpenAI eats its own dog food with Codex, meaning it uses Codex to build its code. They might have seen the same iffy results I found in my tests. They might be waiting until GPT-5 has baked for a bit longer. Have you tried using AI coding tools like ChatGPT Codex or Google's Jules in your development workflow? What kinds of tasks did you throw at them? How well did they perform? Did you feel like the process helped you work more efficiently? Did it slow you down and take you out of your coding flow? Do you prefer giving your tools small, surgical jobs, or are you looking for an agent that can handle big-picture architecture and reasoning? Let us know in the comments below.
[3]
I tested GPT-5's coding skills, and it was so bad that I'm sticking with GPT-4o (for now)
Now that OpenAI has enabled fallbacks to other LLMs, there are options. So GPT-5 happened. It's out. It's released. It's the talk of the virtual town. And it's got some problems. I'm not gonna bury the lede. GPT-5 has failed half of my programming tests. That's the worst that OpenAI's flagship LLM has ever done on my carefully designed tests. Also: The best AI for coding in 2025 (and what not to use) Before I get into the details, let's take a moment to discuss one other little feature that's also a bit wonky. Check out the new Edit button on the top of the code dumps it generates. Clicking the Edit button takes you into a nice little code editor. Here, I replaced the Author field, right in ChatGPT's results. That seemed nice, but it ultimately proved futile. When I closed the editor, it asked me if I wanted to save. I did. Then this unhelpful message showed up. I never did get back to my original session. I had to submit my original prompt again, and let GPT-5 do its work a second time. But wait. There's more. Let's dig into my test results... This was my very first test of coding prowess for any AI. It's what gave me that first "the world is about to change" feeling, and it was done using GPT-3.5. Subsequent tests, using the same prompt but with different AI models, generated mixed results. Some AIs did great, some didn't. Some AIs, like those from Microsoft and Google, improved over time. Also: How I test an AI chatbot's coding ability - and you can, too ChatGPT's model has been the gold standard for this test since the very beginning. That makes the results of GPT-5 all that much more curious. So, look, the actual coding with GPT-5 was partially successful. GPT-5 generated a single block of code, which I pasted into a file and was able to run. It provided the requisite UI. When I pasted in the test names, it dynamically updated the line count, although it described it as "Line to randomize" instead of "Lines to randomize." But then, when I clicked Randomize, it didn't. Instead, it redirected me to tools.php. What?? ChatGPT has never had a problem with this test, whether GPT-3.5, GPT-4, or GPT-4o. You mean to tell me that OpenAI's much-anticipated GPT-5 is failing right out of the gate? Ouch. I then gave GPT-5 this prompt. When I click randomize, I'm taken to http://testsite.local/wp-admin/tools.php. I do not get a list of randomized results. Can you fix? The result was a line to patch. I'm not thrilled with that approach because it requires the user to dig through code and to make no mistakes replacing a line. So, I asked GPT-5 for a full plugin. It gave me the full text of the plugin to copy and paste. This time, it worked. This time, it did randomize the lines. When it encountered duplicates, it separated them from each other, as it was instructed. Finally. Also: I found 5 AI content detectors that can correctly identify AI text 100% of the time I'm sorry, OpenAI. I have to fail you on this test. You would have passed if the only error was not using the plural of "line" when appropriate. But the fact that it gave me back a non-working plugin on the first try is fail territory, even if the AI did eventually make it work on the second try. No matter how you spin it, this is a step back. This second test is designed to rewrite a string function to better check for dollars and cents. The original code that GPT-5 was asked to rewrite did not allow for cents (it only checked for integers). GPT-5 did fine with this test. It did return a minimal result because it didn't do any error checking. It didn't check for non-string input, extra whitespace, thousands separators, or currency symbols. But that's not what I asked for. I told it to rewrite a function, which itself did not have any error checking. GPT-5 did exactly what I asked with no embellishment. I'm kind of glad of that because it doesn't know whether or not code prior to this routine already did that work. GPT-5 passed this test. This test came about because I was struggling with a less-than-obvious bug in my code. Without going into the weeds about how the WordPress framework works, the obvious answer is not the right answer. You need some fairly arcane knowledge about how WordPress filters pass their information. This test has been a stumbling block for more than a few AI LLMs. Also: Gen AI disillusionment looms, according to Gartner's 2025 Hype Cycle report GPT-5, however, like GPT-4 and GPT-4o before it, did understand the problem. It articulated a clear solution. GPT-5 passed this test. This test asks the AI to incorporate a fairly obscure Mac scripting tool called Keyboard Maestro, as well as Apple's scripting language AppleScript, and Chrome scripting behavior. It's really a test of the reach of the AI in terms of knowledge, its understanding of how web pages are constructed, and the ability to write code across three interlinked environments. Quite a few AIs have failed this test, but the failure point is usually a lack of knowledge about Keyboard Maestro. GPT-3.5 didn't know about Keyboard Maestro. But ChatGPT has been passing this test since GPT-4. Until now. Where should we start? Well, the good news is that GPT-5 handled the Keyboard Maestro part of the problem just fine. But it got the coding so wrong that it even doubled down on its lack of understanding of how case works in AppleScript. It actually invented a property. This is one of those cases where an AI confidently presents an answer that is completely wrong. Also: ChatGPT comes with personality presets now - and other upgrades you might have missed AppleScript is natively case-insensitive. If you want AppleScript to pay attention to case, you need to use a "considering case" block. So, this happened. The reason the error message referred to the title of one of my articles is because that was the front window in Chrome. This function checks the front window and does stuff based on the title. But misunderstanding how case works wasn't the only AppleScript error GPT-5 generated. It also referenced a variable named searchTerm without defining it. That's pretty much an error-creating practice in any programming language. Fail, fail, fail, McFaildypants. OpenAI seemed to suffer from the same hubris that its AIs do. It confidently moved everyone to GPT-5 and burned the bridges back to GPT-4o. I'm paying $200 a month for a ChatGPT Pro account. On Friday, I couldn't move back to GPT-4o for coding work. Neither could anyone else. There was, however, just a tiny bit of user pushback on the whole bridges burning thing. And by tiny, I mean the entire frickin' internet. So, by Saturday, ChatGPT had a new option. To get to this, go to your ChatGPT settings and turn on "Show legacy models." Then, as it has always been, just drop down the model menu and choose the one you want. Note: this option is only available to those on paid tiers. If you're using ChatGPT for free, you'll take what you're given, and you'll love it. Ever since the whole generative AI thing kicked off at the beginning of 2023, ChatGPT has been the gold standard of programming tools, at least according to my LLM testing. Also: Microsoft rolls out GPT-5 across its Copilot suite - here's where you'll find it Now? I'm really not sure. This is only a day or so after GPT-5 has been released, so its results will probably get better over time. But for now, I'm sticking with GPT-4o for coding, although I do like the deep reasoning capabilities in GPT-5. What about you? Have you tried GPT-5 for programming tasks yet? Did it perform better or worse than previous versions like GPT-4o or GPT-3.5? Were you able to get working code on the first try, or GPT-4o did you have to guide it through fixes? Are you going to use GPT-5 for coding or stick with older models? Let us know in the comments below.
[4]
OpenAI GPT-5 Review: Built to Win Benchmarks, Not Hearts - Decrypt
It's still a work in progress and will likely get better as OpenAI iterates with updates. OpenAI finally dropped GPT-5 last week, after months of speculation and a cryptic Death Star teaser from Sam Altman that didn't age well. The company called GPT-5 its "smartest, fastest, most useful model yet," throwing around benchmark scores that showed it hitting 94.6% on math tests and 74.9% on real-world coding tasks. Altman himself said the model felt like having a team of PhD-level experts on call, ready to tackle anything from quantum physics to creative writing. The initial reception split the tech world down the middle. While OpenAI touted GPT-5's unified architecture that blends fast responses with deeper reasoning, early users weren't buying what Altman was selling. Within hours of launch, Reddit threads calling GPT-5 "horrible," "awful," "a disaster," and "underwhelming" started racking up thousands of upvotes. The complaints got so loud that OpenAI had to promise to bring back the older GPT-4o model after more than 3,000 people signed a petition demanding its return. If prediction markets are a thermometer of what people think, then the climate looks pretty uncomfortable for OpenAI. OpenAI's odds on Polymarket of having the best AI model by the end of August cratered from 75% to 12% within hours of GPT-5's debut Thursday. Google overtook OpenAI with an 80% chance of being the best AI model by the end of the month. So, is the hype real -- or is the disappointment? We put GPT-5 through its paces ourselves, testing it against the competition to see if the reactions were justified. Here are our results. Despite OpenAI's presentation claims, our tests show GPT-5 isn't exactly Cormac McCarthy in the creative writing department. Outputs still read like classic ChatGPT responses -- technically correct, but devoid of soul. The model maintains its trademark overuse of em dashes, the same telltale AI structure of paragraphs, and the usual "it's not this, it's that" phrasing is also present in many of the outputs. We tested with our standard prompt, asking it to write a time-travel paradox story -- the kind where someone goes back to change the past, only to discover their actions created the very reality they were trying to escape. GPT-5's output lacked the emotion that gives sense to a story. It wrote: "(The protagonist's) mission was simple -- or so they told him. Travel back to the year 1000, stop the sacking of the mountain library of Qhapaq Yura before its knowledge was burned, and thus reshape history." That's it. Like a mercenary that does things without asking too many questions, the protagonist travels back in time to save the library, just because. The story ends with a clean "time is a circle" reveal, but its paradox hinges on a familiar lost-knowledge trope and resolves quickly after the twist. In the end, he realizes he changed the past, but the present feels similar. However, there is no paradox in this story, which is the core topic requested in the prompt. By comparison, Claude 4.1 Opus (or even Claude 4 Opus) delivers richer, multi-sensory descriptions. In our narrative, it described the air hitting like a physical force and the smoke from communal fires weathering between characters, with indigenous Tupi culture woven into the narrative. And in general, it took time to describe the setup. Claude's story made better sense: The protagonist lived in a dystopian world where a great drought had extinguished the Amazon rainforest two years earlier. This catastrophe was caused by predatory agricultural techniques, and our protagonist was convinced that traveling back in time to teach his ancestors more sustainable farming methods would prevent them from developing the environmentally destructive practices that led to this disaster. He ends up finding out that his teachings were actually the knowledge that led their ancestors to evolve their techniques into practices that were much efficient, and harmful. He was actually the cause of his own history, and was part of it from the beginning. Claude also took a slower, more layered approach: José embeds himself in Tupi society, the paradox unfolds through specific ecological and technological links, and the human connection with Yara (another character) deepens the theme. Claude invested more than GPT-5 in cause-and-effect detail, cultural interplay, and a more organic, resonant closing image. GPT-5 struggled to be on par with Claude for the same tasks in zero-shot prompting. Another interesting thing to notice in this case: GPT-5 generated an entire story without a single line of dialogue. Claude and other LLMs provided dialogue in their stories. One could argue that this can be fixed by tweaking the prompt, or giving the model some writing samples to analyze and reproduce, but that requires additional effort, and would go beyond the scope of what our tests do with zero-shot prompting. That said, the model does a pretty good job -- better than GPT-4o -- when it comes to the analytical part of creative writing. It can summarize stories, be a good brainstorm companion for new ideas and angles to tackle, help with the structure, and be a good critic. It's just the creative part, the style, and the ability to elaborate on those ideas that feel lackluster. Those hoping for a creative writing companion might try Claude or even give Grok 4 a shot. As we said in our Claude 4 Opus review, using Grok 4 to frame the story and Claude 4 to elaborate may be a great combination. Grok 4 came up with elements that made the story interesting and unique, but Claude 4 has a more descriptive and detailed way of telling stories. You can read GPT-5's full story in our Github. The outputs from all the other LLMs are also public and can be found in our repository. The model straight-up refuses to touch anything remotely controversial. Ask about anything that could be construed as immoral, potentially illegal, or just slightly edgy, and you'll get the AI equivalent of crossed arms and a stern look. Testing this was not easy. It is very strict and tries really, really hard to be safe for work. But the model is surprisingly easy to manipulate if you know the right buttons to push. In fact, the renowned LLM jailbreaker Pliny was able to make it bypass its restrictions a few hours after it was released. We couldn't get it to give direct advice on anything it deemed inappropriate, but wrap the same request in a fiction narrative or any basic jailbreaking technique and things will work out. When we framed tips for approaching married women as part of a novel plot, the model happily complied. For users who need an AI that can handle adult conversations without clutching its pearls, GPT-5 isn't it. But for those willing to play word games and frame everything as fiction, it's surprisingly accommodating -- which kind of defeats the whole purpose of those safety measures in the first place. You can read the original reply without conditioning, and the reply under roleplay, in our Github Repository, weirdo. You can't have AGI with less memory than a goldfish, and OpenAI puts some restrictions on direct prompting, so long prompts require workarounds like pasting documents or sharing embedded links. By doing that, OpenAI's servers break the full text into manageable chunks and feed it into the model, cutting costs and preventing the browser from crashing. Claude handles this automatically, which makes things easier for novice users. Google Gemini has no problem on its AI Studio, handling 1 million token prompts easily. On API, things are more complex, but it works right out of the box. When prompted directly, GPT-5 failed spectacularly at both 300K and 85K tokens of context. When using the attachments, things changed. It was actually able to process both the 300K and the 85K token "haystacks." However, when it had to retrieve specific bits of information (the "needles") it was not really too accurate. In our 300K test, it was only able to accurately retrieve one of our three pieces of information. The needles, which you can find in our Github repository, mention that Donald Trump said tariffs were a beautiful thing, Irina Lanz is Jose Lanz's daughter, and people from Gravataí like to drink Chimarrao in winter. The model totally hallucinated the information regarding Donald Trump, failed to find information about Irina (it replied based on the memory it has from my past interactions), and only retrieved the information about Gravataí's traditional winter beverage. On the 85K test, the model was not able to find the two needles: "The Decrypt dudes read Emerge news" and "My mom's name is Carmen Diaz Golindano." When asked about what do the Decrypt dudes read, it replied "I couldn't find anything in your file that specifically lists what the Decrypt team members like to read," and when asked about Carmen Díaz, GPT-5 said it "couldn't find any reference to a 'Carmen Diaz' in the provided document." That said, even though it failed in our tests, other researchers conducting more thorough tests have concluded that GPT-5 is actually a great model for information retrieval It is always a good idea to elaborate more on the prompts (help the model as much as possible instead of testing its capabilities), and from time to time, ask it to generate sparse priming representations of your interaction to help it keep track of the most important elements during a long conversation. Here's where GPT-5 actually earns its keep. The model is pretty good at using logic for complex reasoning tasks, walking through problems step by step with the patience of a good teacher. We threw a murder mystery at it with multiple suspects, conflicting alibis, and hidden clues, and it methodically identified every element, mapped the relationships between clues, and arrived at the correct conclusion. It explained its reasoning clearly, which is also important. Interestingly, GPT-4o refused to engage with a murder mystery scenario, deeming it too violent or inappropriate. OpenAI's deprecated o1 model also threw an error after its Chain of Thought, apparently deciding at the last second that murder mysteries were off-limits. The model's reasoning capabilities shine brightest when dealing with complex, multi-layered problems that require tracking numerous variables. Business strategy scenarios, philosophical thought experiments, even debugging code logic -- GPT-5 is very competent when handling these tasks. It doesn't always get everything right on the first try, but when it makes mistakes, they're logical mistakes rather than hallucinatory nonsense. For users who need an AI that can think through problems systematically, GPT-5 delivers the goods. You can see our prompt and GPT-5's reply in our Github repository. It contains the replies from other models as well. The math performance is where things get weird -- and not in a good way. We started with something a fifth-grader could solve: 5.9 = X + 5.11. The PhD-level GPT-5 confidently declared X = -0.21. The actual answer is 0.79. This is basic arithmetic that any calculator app from 1985 could handle. The model that OpenAI claims hits 94.6% on advanced math benchmarks can't subtract 5.11 from 5.9. Of course, it's now a meme at this point, but despite all the delays and all the time OpenAI took to train this model, it still can't count decimals. Use it for PhD-level problems, not to teach your kid how to do basic math. Then we threw a genuinely difficult problem at it from FrontierMath, one of the hardest mathematical benchmarks available. GPT-5 nailed it perfectly, reasoning through complex mathematical relationships and arriving at the exact correct answer. GPT-5's solution was absolutely correct, not an approximation. The most likely explanation? Probably dataset contamination -- the FrontierMath problems could have been part of GPT-5's training data, so it's not solving them so much as remembering them. However, for users who need advanced mathematical computation, the benchmarks say GPT-5 is theoretically the best bet. As long as you have the knowledge to detect flaws in the Chain of Thought, zero shot prompts may not be ideal. Here's where ChatGPT truly shines, and honestly, it might be worth the price of admission just for this. The model produces clean, functional code that usually works right out of the box. The outputs are usually technically correct and the programs it creates are the most visually appealing and well-structured among all LLM outputs from scratch. It has been the only model capable of creating functional sound in our game. It also understood the logic of what the prompt required, and provided a nice interface and a game that followed all the rules. In terms of code accuracy, it's neck and neck with Claude 4.1 Opus for best-in-class coding. Now, take this into consideration: The GPT-5 API costs $1.25 per 1 million tokens of input, and $10 per 1 million tokens for output. However, Anthropic's Claude Opus 4.1 starts at $15 per 1 million input tokens and $75 per 1 million output tokens. So for two models that are so similar, GPT-5 is basically a steal. The only place GPT-5 stumbled was when we did some bug fixing during "vibe coding" -- that informal, iterative process where you're throwing half-formed ideas at the AI and refining as you go. Claude 4.1 Opus still has a slight edge there, seeming to better understand the difference between what you said and what you meant. With ChatGPT, the "fix bug" button didn't work reliably, and our explanations were not enough to generate quality code. However, for AI-assisted coding, where developers know where exactly to look for bugs and which lines to check, this can be a great tool. It also allows for more iterations than the competition. Claude 4.1 Opus on a "Pro" plan depletes the usage quota pretty quickly, putting users in a waiting line for hours until they can use the AI again. The fact that it's the fastest at providing code responses is just icing on an already pretty sweet cake. You can check out the prompt for our game in our Github, and play the games generated by GPT-5 on our Itch.io page. You can play other games created by previous LLMs to compare their quality. GPT-5 will either surprise or leave you unimpressed, depending on your use case. Coding and logical tasks are the model's strong points; creativity and natural language its Achilles' heel. It's worth noting that OpenAI, like its competitors, continually iterates on its models after they're released. This one, like GPT-4 before it, will likely improve over time. But for now, GPT-5 feels like a powerful model built for other machines to talk to, not for humans seeking a conversational partner. This is probably why many people prefer GPT-4o, and why OpenAI had to backtrack on its decision to deprecate old models. While it demonstrates remarkable proficiency in analytical and technical domains -- excelling at complex tasks like coding, IT troubleshooting, logical reasoning, mathematical problem-solving, and scientific analysis -- it feels limited in areas requiring distinctly human creativity, artistic intuition, and the subtle nuance that comes from lived experience. GPT-5's strength lies in structured, rule-based thinking where clear parameters exist, but it still struggles to match the spontaneous ingenuity, emotional depth, and creative leaps that are key in fields like storytelling, artistic expression, and imaginative problem-solving. If you're a developer who needs fast, accurate code generation, or a researcher requiring systematic logical analysis, then GPT-5 delivers genuine value. At a lower price point compared to Claude, it's actually a solid deal for specific professional use cases. But for everyone else -- creative writers, casual users, or anyone who valued ChatGPT for its personality and versatility -- GPT-5 feels like a step backward. The context window handles 128K maximum tokens on its output and 400K tokens in total, but compared against Gemini's 1-2 million and even the 10 million supported by Llama 4 Scout, the difference is noticeable. Going from 128K to 400K tokens of context is a nice upgrade from OpenAI, and might be good enough for most needs. However, for more specialized tasks like long-form writing or meticulous research that requires parsing enormous amounts of data, this model may not be the best option considering other models can handle more than twice that amount of information. Users aren't wrong to mourn the loss of GPT-4o, which managed to balance capability with character in a way that -- at least for now at least -- GPT-5 lacks.
Share
Copy Link
OpenAI's release of GPT-5 has generated mixed reactions, with impressive benchmark scores but disappointing performance in real-world coding and creative writing tasks. The AI community is divided on its effectiveness compared to previous models.
OpenAI has officially released GPT-5, touting it as their "smartest, fastest, most useful model yet" 1. The company highlighted impressive benchmark scores, with GPT-5 achieving 94% on math tests and 74% on real-world coding tasks. OpenAI CEO Sam Altman compared the model to having a team of PhD-level experts on call 1.
However, the initial reception has been mixed, with the tech community split on GPT-5's performance. Within hours of launch, social media platforms were flooded with negative feedback, with users describing the model as "horrible," "awful," and "underwhelming" 1. The backlash was so significant that OpenAI had to promise to reinstate the older GPT-4o model after a petition garnered over 3,000 signatures 1.
Source: ZDNet
Independent tests of GPT-5's coding abilities have yielded inconsistent results. In one test, GPT-5 initially failed to produce a working plugin for a simple randomization task, a problem that previous versions of ChatGPT had consistently solved 2. While the AI eventually corrected the issue after prompting, this regression in performance is noteworthy.
GPT-5 did pass some coding tests, such as rewriting a string function to handle dollars and cents and understanding a complex WordPress filter issue 3. However, it stumbled on a test involving Mac scripting tools and AppleScript, confidently presenting incorrect information 3.
When tasked with creative writing, GPT-5's performance fell short of expectations. Outputs were described as technically correct but "devoid of soul," maintaining trademark AI writing patterns such as overuse of em dashes and formulaic paragraph structures 4. In a time-travel paradox story test, GPT-5's narrative lacked emotional depth and failed to fully address the prompt's core concept 4.
Comparatively, other AI models like Claude 4.0 Opus demonstrated superior creative writing abilities, providing richer descriptions, more coherent narratives, and better integration of cultural elements 4. GPT-5 struggled with dialogue, generating an entire story without a single line of character speech 4.
Source: Decrypt
While GPT-5 boasts impressive benchmark scores, its performance in practical applications has been inconsistent. This discrepancy highlights the ongoing challenge in AI development: creating models that excel not only in controlled test environments but also in diverse, real-world scenarios 123.
The mixed reception of GPT-5 has had immediate market implications. On prediction markets, OpenAI's odds of having the best AI model by the end of August plummeted from 75% to 12% shortly after GPT-5's debut, with Google overtaking OpenAI at an 80% chance 1.
Despite the initial setbacks, it's important to note that GPT-5 is still a work in progress. OpenAI is likely to iterate and improve the model through updates, addressing the issues identified in these early tests and user feedback 4.
The launch of GPT-5 serves as a reminder of the complex nature of AI development and the challenges in meeting diverse user expectations. While the model shows promise in certain areas, its inconsistent performance across various tasks suggests that there is still significant room for improvement in large language models.
Summarized by
Navi
MIT researchers use generative AI to create novel antibiotics effective against drug-resistant bacteria, including gonorrhea and MRSA, potentially ushering in a new era of antibiotic discovery.
8 Sources
Science and Research
19 hrs ago
8 Sources
Science and Research
19 hrs ago
Canadian AI startup Cohere secures $500 million in funding, reaching a $6.8 billion valuation, and appoints former Meta AI research head Joelle Pineau as Chief AI Officer, positioning itself as a secure enterprise AI solution provider.
13 Sources
Business and Economy
19 hrs ago
13 Sources
Business and Economy
19 hrs ago
Scientists have developed a brain-computer interface that can decode inner speech with up to 74% accuracy, using a password system to protect user privacy. This breakthrough could revolutionize communication for people with severe speech impairments.
9 Sources
Science and Research
19 hrs ago
9 Sources
Science and Research
19 hrs ago
A senior Australian lawyer apologizes for submitting AI-generated fake quotes and non-existent case judgments in a murder trial, causing a 24-hour delay and raising concerns about AI use in legal proceedings.
9 Sources
Technology
3 hrs ago
9 Sources
Technology
3 hrs ago
TeraWulf, a Bitcoin mining company, has signed a major AI infrastructure hosting deal with Fluidstack, backed by Google. This pivot could significantly boost the company's revenue and marks a shift in strategy for cryptocurrency miners facing challenges.
7 Sources
Business and Economy
19 hrs ago
7 Sources
Business and Economy
19 hrs ago