Curated by THEOUTPOST
On Fri, 23 Aug, 12:05 AM UTC
2 Sources
[1]
LLM Inference - Consumer GPU performance
In our ongoing effort to assess hardware performance for AI and machine learning workloads, today we're publishing results from the built-in benchmark tool of llama.cpp, focusing on a variety NVIDIA GeForce GPUs, from the RTX 4090 down to the now-ancient (in tech terms) GTX 1080 Ti. Although this round of testing is limited to NVIDIA graphics cards, we plan to expand our scope in future benchmarks to include AMD offerings. If you're interested in how NVIDIA's professional GPUs performed using this benchmark, then follow this link to check out those results. Llama.cpp build 3140 was utilized for these tests, using CUDA version 12.2.0, and Microsoft's Phi-3-mini-4k-instruct model in 4-bit GGUF. Both the prompt processing and token generation tests were performed using the default values of 512 tokens and 128 tokens respectively with 25 repetitions apiece, and the results averaged. Within the latest GeForce RTX 4000 series, the rankings of the cards in the prompt processing test are as we would expect based on the models' positioning within their product stack. We found that the RTX 4090 was 28.5% faster than the RTX 4080 SUPER, which was only 6.2% faster than the standard RTX 4080. Especially with it's hefty 24GB of VRAM, the RTX 4090 continues to be a great choice for LLMs, but the RTX 4080 SUPER is also likely worth considering since it actually has a lower MSRP than the RTX 4080. Although, at the 16GB mark, the RTX 4070 Ti SUPER is a worthwhile contender to the RTX 4080 SUPER, with an overall lower cost, but similar price/performance ratio. This all assumes, however, that prompt processing performance is your main concern over token generation, which is unlikely scenario in many workflows. Interestingly, we find that last-generation's 3080 Ti came out ahead of the RTX 4070 SUPER and RTX 4070, and the venerable RTX 2080 Ti managed to edge out the RTX 4060 Ti variant. Finally, with its complete lack of tensor cores, the GTX 1080 Ti truly shows its age, scoring five times slower than its closest competition, the RTX 4060. When we compare these results with the technical specifications of these GPUs, then it becomes clear that FP16 performance has a direct impact on how quickly they are able process prompts in the llama.cpp benchmark. FP16 performance is almost exclusively a function of both the number of tensor cores and which generation of tensor core the GPUs were manufactured with. This explains why, with it's complete lack of tensor cores, the GTX 1080 Ti's FP16 performance is anemic compared to the rest of the GPUs tested. However, the fact that the RTX 3080 Ti was able to come out ahead against the RTX 4070 SUPER indicates that FP16 performance is not the only factor in effect during prompt processing, and the following section should shine some light on what else we should be considering. Once again, the RTX 4090 continues to show its dominance by landing at the top of the token generation chart, but surprisingly, the RTX 3080 Ti took second place, jumping up four positions compared to the prompt processing results, with a score functionally equivalent to the much newer RTX 4080 SUPER. Again, if we refer back to the technical specifications of these GPUs, we can see how the older RTX 3080 Ti was able to achieve this result: through its notable memory bandwidth. Although the two RTX 4080 variants have considerably more FP16 compute capability compared to the RTX 3080 Ti (~50TFLOPS vs. ~35 TFLOPS), the roughly 25% more memory bandwidth on the RTX 3080 Ti allows it to come out just ahead of the newer GPUs during token generation. Compared to the prompt processing results, we also see that the token generation test narrowed the performance gap between certain models. For example, in prompt processing, the percentage increase from an RTX 4070 Ti SUPER to the RTX 4080 SUPER was about 22%, but during token generation, the increase was only 8%. Likewise the percentage increase from an RTX 4070 to an RTX 4070 Ti shrinks from 25% during prompt processing, to the two cards achieving nearly identical token generation scores. But similar to the prompt processing results, if we look at where older cards like the RTX 2080 Ti and GTX 1080 Ti land on the chart, we can see that memory bandwidth is not the end-all-be-all specification for token generation, and FP16 compute performance still has a role to play. Otherwise, we'd expect to see the GTX 1080 Ti achieve a better result, considering its memory bandwidth is comparable to the RTX 4070 and its variants. One somewhat anomalous result is the unexpectedly low tokens per second that the RTX 2080 Ti was able to achieve. The RTX 2080 Ti has more memory bandwidth and FP16 performance compared to the RTX 4060 series GPUs, but achieves similar results. We expect this to be a result of either software optimizations for the newer generations of GPUs or increased overhead from using a higher number of less capable tensor cores (544 second gen cores vs. 96-136 fourth gen cores). These results emphasize an important consideration when choosing GPUs for LLM usage: while raw memory capacity is very important, it is not the only factor that should be taken into account. It's also important to consider the memory bandwidth and overall compute performance of a GPU in order to get a comprehensive understanding of a GPU's suitability for LLMs. This is just the starting point for our LLM testing series. Future updates will include more topics, such as inference with larger models, multi-GPU configurations, testing with AMD & Intel GPUs, and model training as well. We're eager to hear from you - if there's a specific aspect of LLM performance you'd like us to investigate, please let us know in the comments!
[2]
LLM Inference - Professional GPU performance
As part of our goal to evaluate benchmarks for AI & machine learning tasks in general and LLMs in particular, today we'll be sharing results from llama.cpp's built-in benchmark tool across a number of GPUs within NVIDIA's professional lineup. Because we were able to include the llama.cpp Windows CUDA binaries into a benchmark series we were already performing for other purposes, this round of testing only includes NVIDIA GPUs, but, we do intend to include AMD cards in future benchmarks. If you're interested in how NVIDIA's consumer GPUs performed using this benchmark and system configuration, then follow this link to check out those results. However, it's worth mentioning that maximizing performance or price to performance are not typically the main reasons why someone would choose a professional GPU over a consumer oriented model. The primary value propositions that both NVIDIA's and AMD's pro-series cards offer are improved reliability (both in terms of hardware and drivers), higher VRAM capacity, and designs more appropriate for multi-GPU configurations. If raw performance is your main deciding factor, then outside of multi-GPU configurations, a top-end consumer GPU is almost always going to be the better option. Llama.cpp build 3140 was utilized for these tests, using CUDA version 12.2.0, and Microsoft's Phi-3-mini-4k-instruct model in 4-bit GGUF. Both the prompt processing and token generation tests were performed using the default values of 512 tokens and 128 tokens respectively with 25 repetitions apiece, and the results averaged. Starting with the prompt processing portion of the benchmark, the Ada GPU results are not particularly surprising, with the RTX 6000 Ada achieving the top result and the RTX 4000 Ada with the lowest score. It's interesting to see that the older RTX 6000 is essentially dead even with the RTX 4500 Ada, despite nominally being a much higher-end model. Once we dig into the cards' specifications (table below), the picture starts to become more clear. Here, we find that the prompt processing results track closely with the cards' FP16 performance, which is based almost entirely upon both the number of tensor cores and which generation of tensor cores the GPUs were manufactured with. So ultimately, we find that prompt processing appears to be constrained by the compute performance of the GPU and not by other factors like memory bandwidth. In contrast to the prompt processing results, we find that token generation scales more closely with the GPUs' memory bandwidth (table below) than tensor core count. Although the RTX 6000 Ada is still the clear winner, the older RTX 6000 is able to move up into second place, ahead of the Ada models that outperformed it during the prompt processing phase of the benchmark. However, by comparing the RTX A6000 and the RTX 5000 Ada, we can also see that the memory bandwidth is not the only factor in determining performance during token generation. Although the RTX 5000 Ada only has 75% of the memory bandwidth of the RTX A6000, it's still able to achieve 90% of the performance of the older card. This indicates that compute performance still plays a role during token generation, just not to the same degree as during prompt processing. This benchmark helps highlight an important point, which is that there are several GPU specifications to consider when deciding which GPU or GPUs are the most appropriate option for use with LLMs. These results help show that GPU VRAM capacity should not be the only characteristic to consider when choosing GPUs for LLM usage. A lot of emphasis is placed on maximizing VRAM, which is an important variable for certain, but it's also important to consider the performance characteristics of that VRAM, notably the memory bandwidth. Furthermore, beyond the specifications of the VRAM, it's still important to consider the raw compute performance of GPUs as well, in order to get a more holistic view of how the cards stack up against each other. This is only the beginning of our LLM testing, and we plan to do much more in the future. Larger models, multi-GPU configurations, including AMD/Intel GPU, and model training are all on the horizon. If there is anything else you would like us to report on, please let us know in the comments!
Share
Share
Copy Link
A comprehensive analysis of GPU performance for Large Language Model (LLM) inference, comparing consumer and professional graphics cards. The study reveals surprising results and practical implications for AI enthusiasts and professionals.
In a recent study conducted by Puget Systems, consumer GPUs have demonstrated remarkable capabilities in Large Language Model (LLM) inference tasks. The analysis, which focused on popular models like Llama 2 and Mistral, revealed that high-end consumer cards such as the RTX 4090 and 4080 performed exceptionally well, often matching or surpassing their professional counterparts 1.
While professional GPUs like the RTX 6000 Ada and A6000 showed strong performance, they didn't always justify their higher price tags in terms of LLM inference capabilities. The study found that in many cases, consumer cards offered better price-to-performance ratios, challenging the notion that professional GPUs are always superior for AI workloads 2.
One crucial factor in GPU performance for LLM inference is the available VRAM. Larger models require more memory, and this is where some professional GPUs shine. Cards like the RTX 6000 Ada, with its 48GB of VRAM, can handle larger models that consumer cards simply cannot load 2. However, for models that fit within consumer GPU memory limits, the performance gap narrows significantly.
Perhaps the most unexpected result was the strong showing of older GPU architectures. The previous-generation RTX 3090, for instance, proved to be highly competitive, often outperforming newer, more expensive options in certain scenarios 1. This finding suggests that users may not always need the latest hardware for effective LLM inference.
These results have significant implications for both AI enthusiasts and professionals. For many users, high-end consumer GPUs like the RTX 4090 offer an excellent balance of performance and cost-effectiveness for LLM inference tasks 1. However, those working with very large models or requiring ECC memory may still benefit from professional-grade options.
The study also highlighted the importance of software optimization in LLM inference performance. Different inference frameworks and quantization techniques can significantly impact results, sometimes more so than hardware differences 2. This underscores the need for users to consider both hardware and software aspects when setting up their LLM inference environments.
As LLM technology continues to evolve rapidly, the landscape of GPU performance for inference tasks is likely to change. The current findings suggest a trend towards more accessible and cost-effective solutions for AI workloads, potentially democratizing access to powerful LLM capabilities 12. However, the development of larger, more complex models may continue to push the boundaries of what consumer hardware can handle.
Reference
[1]
[2]
NVIDIA's new Blackwell GPUs set records in MLPerf Inference v5.0 benchmarks, while AMD's Instinct MI325X shows competitive performance against NVIDIA's H200 in specific tests.
3 Sources
3 Sources
NVIDIA's latest Blackwell B200 GPU demonstrates unprecedented AI performance in the MLPerf Inference 4.1 benchmarks, outperforming its predecessor and competitors. The results showcase significant advancements in generative AI and large language model processing.
4 Sources
4 Sources
Recent studies by Puget Systems evaluate GPU performance in professional video editing software DaVinci Resolve Studio 18.6 and AI-powered video enhancement tool Topaz Video AI 5.1, offering insights for content creators and video professionals.
2 Sources
2 Sources
MLCommons introduces new benchmarks for generative AI, with Nvidia's GPUs leading in most tests. The benchmarks highlight the industry's focus on efficient hardware for AI applications.
3 Sources
3 Sources
AMD's Ryzen AI 9 HX 375 processor demonstrates superior performance in large language model (LLM) workloads compared to Intel's Core Ultra 7 258V, showcasing up to 27% faster token generation in LM Studio benchmarks.
5 Sources
5 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved