Hot Chips 2020 Live Blog: Next Gen Intel Xeon, Ice Lake-SP (9:30am PT)
by Dr. Ian Cutress on August 17, 2020 11:15 AM EST- Posted in
- CPUs
- Intel
- Xeon
- Enterprise CPUs
- 10nm
- Live Blog
- Ice Lake
- Ice Lake-SP
- Hot Chips 32
12:09PM EDT - Our first talk of the day is from Intel, about its next-generation Ice Lake Xeon Scalable processor.
12:10PM EDT - We're 20 minutes from the Intel talk starting, but Hot Chips will commence with a 15-minute intro talk to the conference, which we'll cover here
12:10PM EDT - This is the first 'Virtual' Hot Chips, due to COVID. Last year's attendance was 1200-1400 or so (I'm still waiting on exact numbers)
12:10PM EDT - With the conference going virtual, they cut prices, which means there has been an uptick in signups I'm told
12:11PM EDT - Highest cost for the conference and tutorials was $160. Bargain
12:11PM EDT - Tutorials were yesterday, whereas the main conference starts today
12:12PM EDT - Today there's a lot of talks on CPU and GPU. Intel, IBM, AMD, more Intel, then NVIDIA A100, Intel Xe, and Xbox Series X to finish around 6pm PT
12:18PM EDT - And here we go with the intro to the conference
12:19PM EDT - Record registration numbers. 2100+ as of this morning, still growing
12:20PM EDT - Intel is the Rhodium sponsor
12:20PM EDT - That paid for some of the equipment for streaming, and provided the studio for the event
12:20PM EDT - Platinum sponsor is AMD
12:21PM EDT - Now going through some of the attendee info - links to help with logins and such
12:23PM EDT - Presentations and recordings are usually made public by end-of-year
12:29PM EDT - Two keynotes, one from Raja
12:32PM EDT - Questions through slack through the event
12:32PM EDT - And now the first session begins
12:33PM EDT - First up is Intel Ice Lake Xeon
12:34PM EDT - Speaker was lead on Nehalem-EX, and featured in Sandy, Ice
12:34PM EDT - 10+ process
12:34PM EDT - New 2-socket whitley
12:34PM EDT - Uses Sunny Cove
12:35PM EDT - New ISA
12:35PM EDT - 384 OoO window, 128+72 in flight loads/stores
12:35PM EDT - vs cascade
12:35PM EDT - 48 kB L1D
12:36PM EDT - 1.25 MB L2 cache
12:36PM EDT - ~18% IPC over Cascade
12:36PM EDT - second FMA
12:37PM EDT - New instructions
12:37PM EDT - AVX-512 IFMA, VPMADD52
12:37PM EDT - Vector AES, GFNI, SHA-NI
12:37PM EDT - VBMI, VPOPCNT*
12:38PM EDT - (not much more detail than what's on the slides)
12:38PM EDT - Updating current software to boost perfomance
12:40PM EDT - New infrastructure architecture
12:40PM EDT - New control structure
12:40PM EDT - Distributed control and telemetry fabric
12:41PM EDT - One new fabric dedicated for power, one for other
12:41PM EDT - P-Unit for power
12:41PM EDT - Communication streamlined
12:42PM EDT - Control is IP independent
12:42PM EDT - Building new SoCs becomes easier
12:43PM EDT - Migration from Cascade to Ice
12:43PM EDT - 28 core to 28 core
12:43PM EDT - Move from 6x3 ring to 7x3 ring
12:43PM EDT - Memory is now 2 channels per segment, not 3
12:43PM EDT - So 8 memory channels total
12:44PM EDT - IOs on north and south of die
12:44PM EDT - PCIe Gen 4 (x64?)
12:45PM EDT - New IO virtualization implementation, up to 3x bw scaling
12:45PM EDT - larger TLBs and large page sizes
12:45PM EDT - 3 UPI links, independently clocked
12:45PM EDT - Doesn't say if 10.2 GT/s
12:46PM EDT - Each UPI agent has its own fabric stop for better comms to other sockets
12:46PM EDT - New memory controller design with optimizations - built from ground up, built with efficiency in mind
12:47PM EDT - Best efficiency across all frequencies. Supports top DDR4 speeds (3200 at 2DPC?)
12:47PM EDT - TME using AES-XTS 128-bit, enabled by BIOS
12:47PM EDT - When enabled, entire memory is encrypted. Key is not accessible from BIOS or software. HW generated key
12:47PM EDT - Overhead is a few percent perf impact
12:48PM EDT - Support for Optane-200 DCPMM
12:48PM EDT - At top DDR4 speed? DDR4-3200? I thought 200 was 2666 only
12:48PM EDT - New mechnaisms for latency and coherence
12:49PM EDT - Dynamic prefetch throttling - modulates prefetching under memory bandwidth to enable faster speeds rather than overloading the prefetchers
12:50PM EDT - Non-Temporal Write optimization helps low core count writes by not waiting for snoop responses - pull data from core early
12:52PM EDT - OSB - opportunitistic snoop broadcast updated, support for new opcodes to reduce latency for socket cache-to-cache by ~70ns
12:54PM EDT - Bandwidth increases compared to Cascade
12:54PM EDT - Now power management latency
12:55PM EDT - P-state and C-state transition latency were hurting performance
12:55PM EDT - New PLL design allows for not locking
12:55PM EDT - Allows transitions almost not-visible
12:56PM EDT - Latency spikes disappear when P-states change
12:56PM EDT - Also new Fabric frequency change - used to drain buffers and restart clocks. Now no longer needed, reduces latency by 3x
12:56PM EDT - Latencies on bottom right of slide
12:57PM EDT - AVX512 frequency is low compared to SSE - now some improvements
12:57PM EDT - Better power analysis of specific AVX512 instructions
12:57PM EDT - AVX512 now has smarter mapping between instructions and maps
12:57PM EDT - 3 new power levels for AVX512
12:58PM EDT - For specific instructions, end up with better frequency for 256-bit and 512-bit instructions
12:58PM EDT - Provides software writers more incentive to use AVX-512
12:59PM EDT - Speed Select Features
12:59PM EDT - SST-PP: Performance Profile
12:59PM EDT - SST-BF: Base Frequency
12:59PM EDT - SST-CP: Core Power
01:00PM EDT - SST-TF: Turbo Frequency
01:00PM EDT - Select Ice Lake SKUs will have Intel SST enabled, allowing customers to change the performance profile of the CPU based on cooling or requirements
01:00PM EDT - Dynamically adjusted at runtime
01:02PM EDT - Wrap up - Sunny Cove in Xeon on 10nm. Better infrastructure and fabric control
01:03PM EDT - Ice Lake: A Balanced CPU for All Server Usages
01:04PM EDT - Now Q&A
01:04PM EDT - Q: What is the perf impact when TME enabled? A: Target was to be less than 5%. We are seeing 1-2% on pre-prod samples. Not more than that.
01:05PM EDT - Q: How will base frequency scale for AVX-512. Only turbo in presentation A: Similar improvements will apply. Less loss of freq for similar instructions
01:06PM EDT - Q: Support additional crypto? A: Reach out to Intel if you want additional algorithms
01:06PM EDT - Q: What change in PCIe for VM improvement? A: New Virtualization engine design. Increased TLB. VT-D IOMMU running at double speed. Large page support for translation requests as well. All new, that's how 2x
01:07PM EDT - Q: 18% IPC at iso-core. How does it compare with Cascade/Cooper A: They were the same arch, cascade/cooper. No comment on SoC level performance. We will see substantial improvements at SoC level.
01:08PM EDT - That's a wrap. Next talk is IBM, head on over to that live blog
24 Comments
View All Comments
Spunjji - Tuesday, August 18, 2020 - link
I'm genuinely interested to see whether this ends up being one of their "shipped for revenue" releases that they stop talking about once a successor rolls around, or whether it actually gets out there in volume.JayNor - Tuesday, August 18, 2020 - link
tomshardware picked up on this tidbit on IF type features added in Ice Lake Server's fabric:"Intel redesigned the chip to support two new sideband fabrics, one controlling power management and the other used for general-purpose management traffic. These provide telemetry data and control to the various IP blocks, like execution cores, memory controllers, PCIe/UPI controllers, and the like. This is akin to AMD's Infinity Fabric, which also features a sideband telemetry/control mechanism for SoC structures."
https://www.tomshardware.com/news/intel-10nm-xeon-...
JayNor - Tuesday, August 18, 2020 - link
there was also this tidbit in the toms article ... 3x fabric bandwidth seems signifcant:"The die includes a separate peer-to-peer (P2P) fabric to improve bandwidth between cores, and the I/O subsystem was also virtualized, which Intel says offers up to three times the fabric bandwidth compared to Cascade Lake."
TomWomack - Tuesday, August 25, 2020 - link
Silly question, but how wide are the FP registers into which the unit is renaming? Are they full 256-bit SIMD with other instructions using a power-gated bottom quarter, or are SIMD instructions renaming into multiple registers?