AMD Rome Second Generation EPYC Review: 2x 64-core Benchmarked
by Johan De Gelas on August 7, 2019 7:00 PM ESTBetter Core in Zen 2
Just in case you have missed it, in our microarchitecture analysis article Ian has explained in great detail why AMD claims that its new Zen2 is significantly better architecture than Zen1:
- a different second-stage branch predictor, known as a TAGE predictor
- doubling of the micro-op cache
- doubling of the L3 cache
- increase in integer resources
- increase in load/store resources
- support for two AVX-256 instructions per cycle (instead of having to combine two 128 bit units).
All of these on-paper improvements show that AMD is attacking its key markets in both consumer and enterprise performance. With the extra compute and promised efficiency, we can surmise that AMD has the ambition to take the high-performance market back too. Unlike the Xeon, the 2nd gen EPYC does not declare lower clocks when running AVX2 - instead it runs on a power aware scheduler that supplies as much frequency as possible within the power constraints of the platform.
Users might question, especially with Intel so embedded in high performance and machine learning, why AMD hasn't gone with an AVX-512 design? As a snap back to the incumbent market leader, AMD has stated that not all 'routines can be parallelized to that degree', as well as a very clear signal that 'it is not a good use of our silicon budget'. I do believe that we may require pistols at dawn. Nonetheless, it will be interesting how each company approaches vector parallelisation as new generations of hardware come out. But as it stands, AMD is pumping its FP performance without going full-on AVX-512.
In response to AMD's claims of an overall 15% IPC increase for Zen 2, we saw these results borne out of our analysis of Zen 2 in the consumer processor line, which was released last month. In our analysis, Andrei checked and found that it is indeed 15-17% faster. Along with the performance improvements, there have been also security hardening updates, improved virtualization support, and new but proprietary instructions for cache and memory bandwidth Quality of Service (QoS). (The QoS features seem very similar to what Intel has introduced in Broadwell/Xeon E5 version 4 and Skylake - AMD is catching up in that area).
Rome Layout: Simple Makes It a Lot Easier
When we analyzed AMD's first generation of EPYC, one of the big disadvantages was the complexity. AMD had built its 32-core Naples processors by enabling four 8-core silicon dies, and attaching each one to two memory channels, resulting in a non-uniform memory architecutre (NUMA). Due to this 'quad NUMA' layout, a number of applications saw quite a few NUMA balancing issues. This happened in almost every OS, and in some cases we saw reports that system administrators and others had to do quite a bit optimization work to get the best performance out of the EPYC 7001 series.
The New 2nd Gen EPYC, Rome, has solved this. The CPU design implements a central I/O hub through which all communications off-chip occur. The full design uses eight core chiplets, called Core Complex Dies (CCDs), with one central die for I/O, called the I/O Die (IOD). All of the CCDs communicate with this this central I/O hub through dedicated high-speed Infinity Fabric (IF) links, and through this the cores can communicate to the DRAM and PCIe lanes contained within, or other cores.
The CCDs consist of two four-core Core CompleXes (1 CCD = 2 CCX). Each CCX consist of a four cores and 16 MB of L3 cache, which are at the heart of Rome. The top 64-core Rome processors overall have 16 CCX, and those CCX can only communicate with each other over the central I/O die. There is no inter-chiplet CCD communication.
This is what this diagram shows. On the left we have Naples, first Gen EPYC, which uses four Zepellin dies each connected to the other with an IF link. On the right is Rome, with eight CCDs in green around the outside, and a centralized IO die in the middle with the DDR and PCIe interfaces.
As Ian reported, while the CCDs are made at TSMC, using its latest 7 nm process technology. The IO die by contrast is built on GlobalFoundries' 14nm process. Since I/O circuitry, especially when compared to caching/processing and logic circuitry, is notoriously hard to scale down to smaller process nodes, AMD is being clever here and using a very mature process technology to help improve time to market, and definitely has advantages.
This topology is clearly visible when you take off the hood.
Main advantage is that the 2nd Gen 'EPYC 7002' family is much easier to understand and optimize for, especially from a software point of view, compared to Naples. Ultimately each processor only has one memory latency environment, as each core has the same latency to speak to all eight memory channels simultanously - this is compared to the first generation Naples, which had two NUMA regions per CPU due to direct attached memory.
As seen in the image below, this means that in a dual socket setup, a Naples processor will act like a traditional NUMA environment that most software engineers are familiar with.
Ultimately the only other way to do this is with a large monolithic die, which for smaller process nodes is becoming less palatable when it comes to yields and pricing. In that respect, AMD has a significant advantage in being able to develop small 7nm silicon with high yields and also provide a substantial advantage when it comes to binning for frequency.
How a system sees the new NUMA environment is quite interesting. For the Naples EPYC 7001 CPUs, this was rather complicated in a dual socket setup:
Here each number shows the 'weighting' given to the delay to access each of the other NUMA domains. Within the same domain, the weighting is light at only 10, but then a NUMA domain on the same chip was given a 16. Jumping off the chip bumped this up to 32.
This changed significantly on Rome EPYC 7002:
Although there are situations where the EPYC 7001 CPUs communicated faster, but the fact that the topology is much simpler from the software point of view is worth a lot. It makes getting good performance out of the chip much easier for everyone that has to used it, which will save a lot of money in Enterprise, but also help accelerate adoption.
180 Comments
View All Comments
AnonCPU - Friday, August 9, 2019 - link
The gain in hmmer on EPYC with GCC8 is not due to TAGE predictor.Hmmer gains a lot on EPYC only because of GCC8.
GCC8 vectorizer has been improved in GCC8 and hmmer gets vectorized heavily while it was not the case for GCC7. The same run on an Intel machine would have shown the same kind of improvement.
JohanAnandtech - Sunday, August 11, 2019 - link
Thanks, do you have a source for that? Interested in learning more!AnonCPU - Monday, August 12, 2019 - link
That should be due to the improvements on loop distribution:https://gcc.gnu.org/gcc-8/changes.html
"The classic loop nest optimization pass -ftree-loop-distribution has been improved and enabled by default at -O3 and above. It supports loop nest distribution in some restricted scenarios;"
There are also some references here in what was missing for hmmer vectorization in GCC some years ago:
https://gcc.gnu.org/ml/gcc/2017-03/msg00012.html
And a page where you can see that LLVM was missing (at least in 2015) a good loop distribution algo useful for hmmer:
https://www.phoronix.com/scan.php?page=news_item&a...
AnonCPU - Monday, August 12, 2019 - link
And more:https://community.arm.com/developer/tools-software...
just4U - Friday, August 9, 2019 - link
I guess the question to ask now is can they churn these puppies out like no tomorrow? Is the demand there? What about other Hardware? Motherboards and the like..Do they have 100 000 of these ready to go? The window of opportunity for AMD is always fleeting.. and if their going to capitalize on this they need to be able to put the product out there.
name99 - Friday, August 9, 2019 - link
No obvious reason why not. The chiplets are not large and TSMC ships 200 million Apple chips a year on essentially the same process. So yields should be there.Manufacturing the chiplet assembly also doesn't look any different from the Naples assembly (details differ, yes, but no new envelopes being pushed: no much higher frequency signals or denser traces -- the flip side to that is that there's scope there for some optimization come Milan...)
So it seems like there is nothing to obviously hold them back...
fallaha56 - Saturday, August 10, 2019 - link
Perhaps Hypertheading should be off on the Intel systems to better reflect eg Google’s reality / proper security standards now we know Intel isn’t secure?Targon - Monday, August 12, 2019 - link
That is why Google is going to be buying many Epyc based servers going forward. Mitigations do not mean a problem has been fixed.imaskar - Wednesday, August 14, 2019 - link
Why do you think AWS, GCP, Azure, etc. mitigated the vulnerabilities? They only patched Meltdown at most. All other things are too costly and hard to execute. They just don't care so much for your data. Too loose 2x cloud capacity for that? No way. And for security conscious serious customers they offer private clusters, so your workloads run on separate servers.ballsystemlord - Saturday, August 10, 2019 - link
Spelling and grammar errors:"This happened in almost every OS, and in some cases we saw reports that system administrators and others had to do quite a bit optimization work to get the best performance out of the EPYC 7001 series."
Missing "of":
"This happened in almost every OS, and in some cases we saw reports that system administrators and others had to do quite a bit of optimization work to get the best performance out of the EPYC 7001 series."
"...to us it is simply is ridiculous that Intel expect enterprise users to cough up another few thousand dollars per CPU for a model that supports 2 TB,..."
Excess "is" and missing "s":
"...to us it is simply ridiculous that Intel expects enterprise users to cough up another few thousand dollars per CPU for a model that supports 2 TB,..."
"Although the 225W TDP CPUs needs extra heatspipes and heatsinks, there are still running on air cooling..."
Excess "s" and incorrect "there",
"Although the 225W TDP CPUs need extra heatspipes and heatsinks, they're still running on air cooling..."
"The Intel L3-cache keeps latency consistingy low as long as you stay within the L3-cache."
"consistently" not "consistingy":
"The Intel L3-cache keeps latency consistently low as long as you stay within the L3-cache."
"For example keeping a large part of the index in the cache improve performance..."
Missing comma and missing "s" (you might also consider making cache plural, but you seem to be talking strictly about the L3):
"For example, keeping a large part of the index in the cache improves performance..."
"That is a real thing is shown by the fact that Intel states that the OLTP hammerDB runs 60% faster on a 28-core Intel Xeon 8280 than on EPYC 7601."
Missing "it":
"That it is a real thing is shown by the fact that Intel states that the OLTP hammerDB runs 60% faster on a 28-core Intel Xeon 8280 than on EPYC 7601."
In general, the beginning of the sentance appears quite poorly worded, how about:
"That L3 cache latency is a matter for concern is shown by the fact that Intel states that the OLTP hammerDB runs 60% faster on a 28-core Intel Xeon 8280 than on EPYC 7601."
"In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM."
Missing "s":
"In NPS4, the NUMA domains are reported to software in such a way as its chiplets always access the near (2 channels) DRAM."
"The fact that the EPYC 7002 has higher DRAM bandwidth is clearly visible."
Wrong numbers (maybet you ment, series?):
"The fact that the EPYC 7742 has higher DRAM bandwidth is clearly visible."
"...but show very significant improvements on EPYC 7002."
Wrong numbers (maybet you ment, series?):
"...but show very significant improvements on EPYC 7742."
"Using older garbage collector because they happen to better at Specjbb"
Badly worded.
"Using an older garbage collector because it happens to be better at Specjbb"
"For those with little time: at the high end with socketed x86 CPUs, AMD offers you up to 50 to 100% higher performance while offering a 40% lower price."
"Up to" requires 1 metric, not 2. Try:
"For those with little time: at the high end with socketed x86 CPUs, AMD offers you from 50 up to 100% higher performance while offering a 40% lower price."