big.LITTLE: Heterogeneous ARM MP

The Cortex A15 is going to be a significant step forward in performance for ARM architectures. ARM hopes it will be enough to actually begin to threaten the low end of the x86 space, which gives you an idea of just how powerful these cores are going to be. The A15 will also find its way into smartphones and tablets, ultimately replacing the Cortex A9s used by high-end devices today. 

For heavy workloads, the Cortex A15 is expected to be more power efficient than the A9. The core may draw more instantaneous power, but it will do so for a shorter period of time thus allowing the CPU(s) to get to sleep quicker and reducing average power.

As ARM has often argued (particularly against Intel) however, these big out-of-order microprocessor architectures are inefficient at dealing with lightweight mobile workloads. In particular, things like background tasks running on your phone while it’s locked in your pocket simply don’t demand the performance of a Cortex A15. ARM further argues that the power consumed by an A15 running these tasks, even though only for a short period of time, is greater than it would be on a much simpler in-order architecture. This is where the A7 comes into play.

Although the Cortex A7 is fully capable of being used on its own (and it most definitely will be), ARM’s partners are free to integrate Cortex A7 cores alongside Cortex A15 cores in a big.LITTLE (or little.BIG?) configuration. 

Since the A7 and A15 are equally capable of executing the same ARM instruction set, any applications running on one core can just as easily be migrated to run on the other. In the example above there are a pair of A15s and a pair of A7s on a single SoC. In this particular configuration, the OS only believes there are two cores in the machine. ARM’s own power management firmware determines which core cluster to activate depending on performance states requested by the OS. If the OS wants a high performance state, ARM returns the A15 cores at a high p-state. If it wants a low performance state, the chip will put the A15s to sleep and schedule everything on the A7s. Cache coherency is guaranteed via the CCI-400 interconnect, so any data invalidated by one core cluster will be reflected in the other cluster’s cache. ARM claims it can switch between core clusters in this configuration in as quick as 20 microseconds.

If everything works the way ARM has described it, a big.LITTLE configuration should be perfectly transparent to the OS (similar to what NVIDIA is promising with Kal-el). ARM did add that SoC vendors are free to expose all cores to the OS if they would like, although doing so would obviously require OS awareness of the different core types.

Core Configurations, Process Technology & Final Words

ARM’s Cortex A7 will be available in 1 - 4 core configurations, both as the primary CPU in an SoC as well as in a big.LITTLE configuration alongside some A15s. ARM expects that we will see some 40nm A7 designs as early as the end of next year for use in low end smartphones (~$100). Most smartphone configurations, even at these price points will likely use dual-core A7 implementations. It’s only in emerging markets that ARM is expecting to see single core Cortex A7 smartphone devices. This is pretty big news as it means that even value smartphones will be dual-core by 2013.

Costs will keep the A7 on 40nm for a while although the cores will be offered at 28nm for integration into A15 designs as well as for even higher performance/lower power implementations.

I have to say that I’m pretty excited about the Cortex A7 announcement across the board. It looks like this core will not only enable much better performance at the value end of the device spectrum but it should bring battery life improvements at the high end as well. Chip architects have argued for years that we were going to see heterogeneous computing as the next phase in the evolution of microprocessors, it’s fascinating to see that we may get the first consumer application of it in ultra mobile devices.

Comments Locked


View All Comments

  • kgardas - Wednesday, October 19, 2011 - link

    While this is nice article I really hate people usually forget Marvell Armada 628 -- heterogenous triple-core and just talk about NVidia Kal-El.
    Ptherwise combination of single-core A7 + 4-8 cores of A15 looks like the killer SoC. :-)
  • fteoath64 - Thursday, October 20, 2011 - link

    @Kgardas: While the chip looks good on paper, it supports USB3 which is no use in mobile, so it is likely to be used in NAS, MediaPlayer and STB applications. The chip also looks big in size so I guess its power consumption on full bore is significant in ARM terms.
  • introiboad - Thursday, October 20, 2011 - link

    The OMAP family also does this with Cortex-A and Cortex-M cores put together in a die. It's quite similar to what ARM describes here, except of course the instruction sets are not the same
  • lancedal - Wednesday, October 19, 2011 - link

    But is very hard to realize. Today's OS are not application aware. Meaning it would not know if a thread is from application X and need 1GHz vs. another thread from application Y that only require 1MHz. As such, it would not be able to dynamically moving thread from one core to another without guarantee not missing deadline.
    If the small core is dedicate to do housekeeping thread only (i.e. sync, standby etc), that is all good but there is no need to do that anyway because such tasks are so infrequency (every hundred ms or so). Therefore, you can wakeup the big core, and shut it down.
  • hechacker1 - Wednesday, October 19, 2011 - link

    I don't think it's so hard. The scheduler would start all processes on the slow core, and if the CPU utilization doesn't exceed its maximum over a very short period of time, keep it there because it obviously doesn't need the extra processing power.

    The Android schedulers (CFS and BFS) are nano-second time aware, so the latency penalty could be managed.

    Of course, it would be best if the programmer could explicitly place their program into a core, but you can already do that with sched_realtime, sched_fifo, and sched_batch policies. The question is really how far Android optimizes for this sort of thing. Right now I think they treat everything as realtime fifo queues, instead of letting the built-in Linux schedulers do their thing.
  • lancedal - Thursday, October 20, 2011 - link

    Keep in mind that there are performance and latency penalty when power on/off the big core. When you power up the big core, it would take time (to power up, reset and boot). Its L2$ are empty, so it would need to be heated up. All of this added to performance impact that is very hard to take into account by the O.S.

    as you said, if programmer can specify performance/deadline/cpu requirement, then everything would be simple. However, we can't expect that from million of developers out there. It's just not practical.
  • Penti - Thursday, October 20, 2011 - link

    Boot? They would lay there as cores in a low power state, for which there would be several, it won't take tens of seconds to start up. I'm sure they can handle the power states, clocks and scheduling pretty well in the os's. The OS would know when it's in a power saving mode or not or when it needs the performance or not. It would very much depend on power profiles and so on. I would not expect them to be in a deep sleep mode at every time when resources are needed. But it's point is of course battery-life not performance. A flag on programs that want to run on the big cores would probably be easy to implement on a system like Android, I wouldn't think as it's not just pure Linux ELFs. But we will see what kind of schemes there will be soon I guess.
  • lancedal - Tuesday, October 25, 2011 - link

    Leakage is the reason they have "big-small" setup. So if the big core is not used, it will be "shut-down". So it will boot up from ROM when woken up. In 28 and 20nm, leakage will be the dominating power factor.
  • metafor - Thursday, October 20, 2011 - link

    You'd be surprised. Most modern OS's (including Android) have not only profiling but API support for applications to poll for resources such as CPU. Most of the time, you won't be seeing something like Pandora take the CPU up to 100% even if it could, in theory, burst process a lot of data and then go to sleep.

    The problem is that a lot of things can indeed be done faster -- web browser rendering is one primary example of something that would hog up as much CPU as it can.

    And there's not really a way for a user to specify "hey, I don't mind if the page renders slower, stop using so much power".
  • lancedal - Thursday, October 20, 2011 - link

    True, the OSs do profiling, but it is hardly accurate to guarantee real-time performance. Yes, the API would help, but not many application, if any, that specify the resource it needs as that would depend on system. It's just not practical to require software developer to specify the resource needed given how many developers and applications we have today. Big guy like Pandora, sure, but not million of little guys out there.
    Deciding when to switch back and forth between the LITTLE and BIG core is hard because it's not free. It cost power and performance (latency). If you switch to often, then you end up costing more power. The problem is there is no fix criteria to switch.
    If you have the little core to handle "system tasks" and the big core to handle application (like Tegra-3), then it may work. However, that only help standby power and wont' do much for extend web-browsing time.

Log in

Don't have an account? Sign up now