The Ice Lake Benchmark Preview: Inside Intel's 10nm
by Dr. Ian Cutress on August 1, 2019 9:00 AM EST- Posted in
- CPUs
- Intel
- GPUs
- 10nm
- Core
- Ice Lake
- Cannon Lake
- Sunny Cove
- 10th Gen Core
Security Updates, Improved Instruction Performance and AVX-512 Updates
With every new microarchitecture update, there are goals on several fronts: add new instructions, decrease the latency of current instructions, increase the throughput of current instructions, and remove bugs. The big headline addition for Sunny Cove and Ice Lake is AVX-512, which hasn’t yet appeared on a mainstream widely distributed consumer processor – technically we saw it in Cannon Lake, but that was a limited run CPU. Nonetheless, a lot of what went into Cannon Lake also shows up in the Sunny Cove design. To complicate matters, AVX-512 comes in plenty of different flavors. But on top of that, Intel also made a significant number of improvements to a number of instructions throughout the design.
Big thanks to InstLatX64 for his help in analyzing the benchmark results.
Security
On security, almost all the documented hardware security fixes are in place with Sunny Cove. Through the CPUID results, we can determine that SSBD is enabled, as is IA32_ARCH_CAPABILITIES, L1D_FLUSH, STIBP, IBPB/IBRS and MD_CLEAR.
This aligns with Intel’s list of Sunny Cove security improvements:
Sunny Cove Security | |||
AnandTech | Description | Name | Solution |
BCB | Bound Check Bypass | Spectre V1 | Software |
BTI | Branch Target Injection | Spectre V2 | Hardware+OS |
RDCL | Rogue Data Cache Load | V3 | Hardware |
RSSR | Rogue System Register Read | V3a | Hardware |
SSB | Speculative Store Bypass | V4 | Hardware+OS |
L1TF | Level 1 Terminal Fault | Foreshadow | Hardware |
MFBDS | uArch Fill Buffer Data Sampling | RIDL | Hardware |
MSBDS | uArch Store Buffer Data Sampling | Fallout | Hardware |
MLPDS | uArch Load Port Data Sampling | - | Hardware |
MDSUM | uArch Data Sampling Uncachable Memory | - | Hardware |
Aside from Spectre V1, which has no suitable hardware solution, almost all of the rest have been solved through hardware/firmware (Intel won’t distinguish which, but to a certain extent it doesn’t matter for new hardware). This is a step in the right direction, but of course it may have a knock-on effect, plus for anything that gets performance improvements being moved from firmware to hardware will be rolled into any advertised IPC increase.
Also on the security side is SGX, or Intel’s Software Guard Instructions. Sunny Cove now becomes Intel’s first public processor to enable both AVX-512 and SGX in the same design. Technically the first chip with both SGX and AVX-512 should have been Skylake-X, however that feature was ultimately disabled due to failing some test validation cases. But it now comes together for Sunny Cove in Ice Lake-U, which is also a consumer processor.
Instruction Improvements and AVX-512
As mentioned, Sunny Cove pulls a number of key improvements from the Cannon Lake design, despite the Cannon Lake chip having the same cache configuration as Skylake. One of the key points here is the 64-bit division throughput, which goes from a 97-cycle latency to an 18-cycle latency, blowing past AMD’s 45-cycle latency. As an ex-researcher with no idea about instruction latency or compiler options, working on high-precision math code, this speedup would have been critical.
- IDIV -> 97-cycle to 18-cycle
For the general purpose registers, we see a lot of changes, and most of them quite sizable.
Sunny Cove GPR Changes | |||
AnandTech | Instruction | Skylake | Sunny Cove |
Complex LEA | Complex Load Effective Address | 3 cycle latency 1 per cycle |
1 cycle latency 2 per cycle |
SHL/SHR | Shift Left/Right | 2 cycle latency 0.5 per cycle |
1 cycle latency 1 per cycle |
ROL/ROR | Rotate Left/Right | 2 cycle latency 0.5 per cycle |
1 cycle latency 1 per cycle |
SHLD/SHRD | Double Precision Shift Left/Right | 4 cycle latency 0.5 per cycle |
4 cycle latency 1 per cycle |
4*MOV | Four repated string MOVS | Limited instructions | 104 bits/clock All MOVS* Instructions |
In the past we’ve seen x87 instructions being regressed, made slower, as they become obsolete. For whatever reason, Sunny Cove decreases the FMUL latency from 5 cycles to 4 cycles.
The SIMD units also go through some changes:
Sunny Cove SIMD | |||
AnandTech | Instruction | Skylake | Sunny Cove |
SIMD Packing | SIMD Packing now slower | 1 cycle latency 1 per cycle |
3 cycle latency 1 per cycle |
AES* | AES Crypto Instructions (for 128-bit / 256-bit) |
4 cycle latency 2 per cycle |
3 cycle latency 2 per cycle |
CLMUL | Carry-Less Multiplication | 7 cycle latency 1 per cycle |
6 cycle latency 1 per cycle |
PHADD/PHSUB | Packed Horizontal Add/Subtract and Saturate |
3 cycle latency 0.5 per cycle |
2 cycle latency 1 per cycle |
VPMOV* xmm | Vector Packed Move | 2 cycle latency 0.5 per cycle |
2 cycle latency 1 per cycle |
VPMOV* ymm | Vector Packed Move | 4 cycle latency 0.5 per cycle |
2 cycle latency 1 per cycle |
VPMOVZX/SX* xmm | Vector Packed Move | 1 cycle latency 1 per cycle |
1 cycle latency 2 per cycle |
POPCNT | Microcode 50% faster than SW (under L1-D size) | ||
REP STOS* | Repeated Store String | 62 bits/cycle | 54 bits/cycle |
VPCONFLICT | Still Microcode Only |
We’ve already gone through all of the new AVX-512 instructions in our Sunny Cove microarchitecture disclosure. These include the following families:
- AVX-512_VNNI (Vector Neural Network Instructions)
- AVX-512_VBMI (Vector Byte Manipulation Instructions)
- AVX-512_VBMI2 (second level VBMI)
- AVX-512_ BITALG (bit algorithms)
- AVX-512_IFMA (Integer Fused Multiply Add)
- AVX-512_VAES (Vector AES)
- AVX-512_VPCLMULQDQ (Carry-Less Multiplacation of Long Quad Words)
- AVX-512+GFNI (Galois Field New Instructions)
- SHA (not AVX-512, but still new)
- GNA (Gaussian Neural Accelerator)
(Intel also has the GMM (Gaussian Mixture Model) inside the core since Skylake, but I’ve yet to see any information on this outside a single line in the coding manual.)
For all these new AVX-512 instructions, it’s worth noting that they can be run in 128-bit, 256-bit, or 512-bit mode, depending on the data types passed to it. Each of these can have corresponding latencies and throughputs, which often get worse when going for the 512-bit mode, but overall assuming you can fill the register with a 512-bit data type, then the overall raw processing will be faster, even with the frequency differential. This doesn’t take into account any additional overhead for entering the 512-bit power state, it should be noted.
Most of these new instructions are relatively fast, with most of them only 1-3 cycles of latency. We observed the following:
Sunny Cove Vector Instructions | |||||
AnandTech | Instruction | XMM | YMM | ZMM | |
VNNI | Latency | Vector Neural Network Instructions | 5-cycle | 5-cycle | 5-cycle |
Throughput | 2/cycle | 2/cycle | 1/cycle | ||
VPOPCNT* | Latency | Return the number of bits set to 1 | 3-cycle | 3-cycle | 3-cycle |
Throughput | 1/cycle | 1/cycle | 1/cycle | ||
VPCOMPRESS* | Latency | Store Packed Data | 3-cycle | 3-cycle | 3-cycle |
Throughput | 0.5/cycle | 0.5/cycle | 0.5/cycle | ||
VPEXPAND* | Latency | Load Packed Data | 5-cycle | 5-cycle | 5-cycle |
Throughput | 0.5/cycle | 0.5/cycle | 0.5/cycle | ||
VPSHLD* | Latency | Vector Shift | 1-cycle | 1-cycle | 1-cycle |
Throughput | 2/cycle | 2/cycle | 1/cycle | ||
VAES* | Latency | Vector AES Instructions | 3-cycle | 3-cycle | 3-cycle |
Throughput | 2/cycle | 2/cycle | 1/cycle | ||
VPCLMUL | Latency | Vector Carry-Less Multiply | 6-cycle | 8-cycle | 8-cycle |
Throughput | 1/cycle | 0.5/cycle | 0.5/cycle | ||
GFNI | Latency | Galois Field New Instructions | 3-cycle | 3-cycle | 3-cycle |
Throughput | 2/cycle | 2/cycle | 1/cycle |
For all of the common AVX2 instructions, xmm/ymm latencies and throughputs are identical to Skylake, however zmm is often a few cycles slower for DIV/SQRT variants.
Other Noticeable Observations
From our testing, we were also able to prove some of the other parts of the core, such as the added store ports and shuffle units.
Our data shows that the second store port is not identical to the first, which explains the imbalance when it comes to writes: rather than supporting 2x64-bit with loads, it only supports either 1x64-bit write, or 1x32-bit write, or 2x16-bit writes. This means we mainly see speed ups with GPR/XMM data, and the result is only a small improvement for 512-bit SCATTER instructions. Otherwise, it seems not to work with any 256-bit or 512-bit operand (you can however use it with 64-bit AVX-512 mask registers). This is going to cause a slight headache for anyone currently limited by SCATTER stores.
The new shuffle unit is only 256-bit wide. It will handle a number of integer instructions (UNPCK, PSLLDQ, SHUF*, MOVSHDUP, but not PALIGNR or PACK), but only a couple of floating point instructions (SHUFPD, SHUFPS).
261 Comments
View All Comments
zodiacfml - Friday, August 2, 2019 - link
Yes and No. Intel at 10nm should have made AMD nervous but products only at 4 cores, there is nothing or little benefit with 10nm. I reckon, AMD's 7nm mobile parts will mostly start at 6 cores.Kevin G - Thursday, August 1, 2019 - link
Those 3D particle movement tests seem to be too good to be true. There should be a gigantic jump due to an optimized AVX-512 code path and ICL's enhanced caching structure but it is beyond that in the comparison. I'm not actually suspecting the ICL system given the disclosures in the article (odd that the note about AVX-512 intrinistics for the 3DPM test is mentioned around SPEC compiler settings) but rather the other test systems. Where the Whisky Lake or Kaby Lake systems power or thermal constrained at all? On those Hauwei laptops, were you able to set their fan to a fixed 100% to match that of the ICL system?Ian Cutress - Thursday, August 1, 2019 - link
The AVX-512 tests were similar when we compared Cannon Lake to Kaby Lake at the same frequency. Against unoptimized SSE code, AVX-512 is killer.Kevin G - Friday, August 2, 2019 - link
Getting a bit more than double the performance from AVX2 vs. AVX-512 should be possible using some of the new Ice Lake extensions and the obvious doubling of SIMD width. But going from a score of 1802 in Whiskey Lake 25W to 9242 for Ice Lake 25W, over a factor of 5! Ice Lake would have to remove some other bottleneck that the 3DPM test hits really hard (division?).Looking back at your previous reviews ( https://www.anandtech.com/show/13400/intel-9th-gen... ), you can see a similar speed up from AVX-512 between the i9 9900K and the i9 7820X but that is explained from Skylake-X having both double the SIMD width and double the number of SIMD execution units. The client version of Ice Lake shouldn't have the same AVX-512 throughput as Sky Lake server.
CSMR - Thursday, August 1, 2019 - link
> the one area where Ice Lake excels in is graphics. Moving from 24 EUs to 64 EUs, plus an increase in memory bandwidth to >50 GB/s, makes for some easy reading.I don't understand the comparison here and in this article. If you say a high-end intel processor update excels in graphics, you should compare to previous high-end processors (e.g. i7-8559U with Iris Plus 655). These have 48EUs not 24 and have 128MB EDRAM at 100 GB/s unlike the Ice Lake.
I am very interested in how the best Ice Lake processors compare to the best previous-gen processors, not how they compare to mediocre previous-gen processors.
Could the article be updated with some appropriate comparisons?
eastcoast_pete - Thursday, August 1, 2019 - link
Agree on adding the best previous generation graphics to the comparison. Also, while the over 1 TFlops for the 64EU Gen 11 sounds (and is) impressive (within the Intel iGPU world) , didn't the 48EU with Crystal Well get close to that already?Rudde - Thursday, August 1, 2019 - link
The first apu with 1TFlops performance statement is full of asteriks. First, you have to exclude AMD; second, you have to exclude Intel Iris gpus with eDRAM.Phynaz - Thursday, August 1, 2019 - link
AMD mobile chips are hot garbageeva02langley - Friday, August 2, 2019 - link
Your opinion is not a fact... and it is garbage for real.Phynaz - Friday, August 2, 2019 - link
Hahaha. It’s a fact. It’s why they have 0% market share.