OpenCL Programming Model and Suitability for FPGAs

OpenCL is an open-standard programming interface developed by the Khronos group designed particularly for parallel and heterogeneous computing.  OpenCL can be used to program various types of hardware including CPUs, GPGPUs, FPGAs and many-core coprocessors like Xeon Phi or Adapteva's Epiphany.  Each hardware vendor that wants its hardware to be exposed to OpenCL needs to provide an OpenCL driver for its hardware. For example, OpenCL drivers are available for various CPUs and GPUs for Windows, Linux and Mac and Altera is now providing OpenCL drivers and associated development tools for their FPGAs.  Prior to OpenCL, there were no standard programming languages that exposed coprocessors on an equal footing. Thanks to the rise of GPGPU, the idea of accelerators and coprocessors has entered the mainstream and having a standard interface to accelerators is a big win for FPGA vendors.  In many ways, the concepts that apply to programming a discrete GPGPU placed on a PCIe board also apply to FPGAs.

We go over some OpenCL terminology.

Host and devices: The CPU is called the host, and each hardware that has an OpenCL driver is called a device.  Each device can have one or more compute units and each compute unit can have multiple processing element.  This is shown visually below (figure from Hands on OpenCL course).  

For example, Nvidia Titan GPU contains 15 SMX units and each SMX unit corresponds to a compute unit in OpenCL and each SMX has 192 processing elements.  On FPGAs, the number and complexity of each compute unit is not fixed and instead is customized to your application.

Unlike say C, C++, Java or Python, OpenCL cannot be used standalone. Instead, the main program runs on the CPU (the host) as usual and typically only the computationally intensive parts of the program are written in OpenCL and called from the main program.  However, work is not automatically distributed across various devices. Instead, the application program can query the OpenCL runtime for the list of all OpenCL compatible devices in a system and can choose the appropriate device for each computation. 

Device memory:  Each device has its own memory space where it can allocate arrays of data (called buffers) that can be read/written from OpenCL programs. In a discrete GPU or an FPGA, the buffer objects will typically reside in the RAM placed on the PCIe based board that contains the GPU or FPGA chip. For example, in a GPU such as Radeon 7970, the buffer objects will typically be placed in the GDDR5 RAM. OpenCL provides functions to copy data between host (CPU) memory and device memory. Some vendors also allow transferring data between multiple devices in a system directly without CPU intervention.

Kernels:  OpenCL programs consist of kernels, which are similar to functions in C. Kernels can read/write from buffer objects that are passed as arguments to the kernel. Kernels are written in a C-like programming language. The OpenCL driver for a given hardware compiles it to the appropriate format. For CPUs and GPUs, the vendor's OpenCL driver will compile it to the native instruction set of the processor.  We will get into how kernels are compiled by Altera's SDK in the next section.

Work-items, work-groups and parallelism: Unlike say C, where usually a function call leads to execution of a single instance of a function, the host launches the kernels across a 1D, 2D or 3D grid of "work-items". Each work-item can be thought of a conceptual thread and each work-item executes the same kernel function. However, each work-item knows its index in the thread and will typically compute different parts of a solution.

For example, let us say you wanted to add two vectors of length N.  This is how you will do it in plain C:

You can write a kernel where each work-item adds one element of the vector corresponding to its index. Here is the sample OpenCL kernel.

In this case, each work-item is performing the work done by one loop iteration in the C code. Thus, if you wanted to add vectors of size 1000, you will launch this kernel with 1000 parallel work-items. OpenCL is an inherently parallel API and particularly suited for highly parallel problems.

Work-items are organized into work-groups, which are small grids of say 8x8 work-items,  and items within a work-group can synchronize with each other but items from different work-groups cannot. This work-item and work-group organization maps particularly well to GPUs. FPGAs also prefer highly parallel workloads but the way they get compiled to FPGAs is very different and we will get to that soon.

 

Local memory:  Accessing memory is an expensive operation. CPUs include hardware-managed caches with the hope that the data that is reused in the program can be brought into the cache once and then read/written multiple times from the cache. However, some architectures such as all recent desktop GPUs from AMD, Nvidia and Intel include small amount of fast memory on-chip that acts as a software managed cache. OpenCL provides a construct called "local memory" to expose such software managed caches. Each work-group can allocate local memory (typically upto 32 or 64kB per work-group) and all work-items in the work-group can read/write from the local memory. Local memory is implemented via the software managed cache on GPUs while CPUs allocate it in regular RAM and hope that it will be end up in the cache during program execution. FPGAs also include on-chip memory that can be used to implement OpenCL's local memory construct in hardware. Some members of the Stratix V series include upto 52Mbit (~6.5MB) of on-chip memory that can be used as local memory. In comparison, Radeon HD 7970 includes about 2MB of local memory on-chip and a GTX Titan includes about 450kB of local memory.

You can learn more about OpenCL at the official page at Khronos or look over some tutorials such as the recently released Hands on OpenCL. Overall, the OpenCL programming model looks to be a surprisingly good fit for FPGAs. Concepts such as host/device separation, device memory vs CPU memory, inherently parallel programming model and finally the local memory abstraction all look to be very well suited to FPGAs.

Introduction: FPGAs and Altera's Products Altera's OpenCL Implementation Details
POST A COMMENT

56 Comments

View All Comments

  • Atiom - Wednesday, October 9, 2013 - link

    Great article. I was thinking about using FPGAs in my projects, with I mainly use microcontrolers, but I still havent done it because of the VHDL language that I havent had the time to learn. But now with the OpenCL, things my get more interesting, just hope these devices get more affordable. It would be nice if you could keep up this kind of articles. Reply
  • Jon Tseng - Wednesday, October 9, 2013 - link

    Tx for the piece. Interesting Altera say much the same thing about high performance compute when I speak to them also.

    Rahul, curious on your thoughts about whether CUDA is a barrier to adoption here. NVIDIA have done a lot driving adoption and supported users. Is this a barrier to switching code to OpenCL? Or are you thinking about FPGA for stuff currently running on x86 or greenfield work?
    Reply
  • Todd Thompson - Wednesday, October 9, 2013 - link

    Rahul, thanks for this article...you did a great job of messaging the value and use-case for using an FPGA for compute. Please keep up the good work and write more about FPGAs and OpenCL! Reply
  • Todd Thompson - Wednesday, October 9, 2013 - link

    As an aside, I'm working on the Zedboard/Zynq/ARM platform to experiment with using FPGA as a co-processor on an SOC. I will be doing some benchmarking by comparing results of b+ tree database indexing with and without Zynq as co-proc. I cannot wait for Xilinx to support OpenCL and overall OpenCL support for less expensive FPGA products. Reply
  • dneto - Wednesday, October 9, 2013 - link

    Hi, this is David from Altera. :-)

    Good article, and thanks for the shout-out.

    Regarding the development cycle. One of the great things about a standard like OpenCL is that you can prototype your code on a CPU or a GPU and then port it to the FPGA. You do have to watch that you use a common subset of the features available on all platforms, but this will get you a long way toward a more comfortable development flow. You focus on getting a *working* program on CPU/GPU, and then move to the Altera FPGA to run and optimize. Altera publishes a programming guide to help you optimize for our devices. For OpenCL in general, it is well known that optimizing a kernel for absolute best results often requires recoding or restructuring your device code or data.

    Legalese FYI: The official name of our SDK is the "Altera SDK for OpenCL". OpenCL is a trademark of Apple, on license to Khronos.
    Reply
  • Araemo - Wednesday, October 9, 2013 - link

    I am actually really surprised I see no mention of LLVM in this article. It seems like this is the kind of job that LLVM is well-suited for, based on how many other implementations I've seen of taking one programming language in, and outputting another, more specific language.

    I wonder if LLVM IS involved, and they just aren't talking about it, or if LLVM isn't actually well-suited to this work, but merely easy to extend to arbitrary languages.
    Reply
  • dneto - Wednesday, October 9, 2013 - link

    David from Altera here.
    Yes, LLVM is part of our compiler toolchain. It's one of many technologies, open source and proprietary, used in our SDK.
    LLVM is a compiler toolkit, with some finished backends. Using LLVM gets you a long way to supporting an OpenCL C compiler. But it doesn't get you the whole way.
    Reply
  • Araemo - Wednesday, October 9, 2013 - link

    Thanks for the response - I definitely understand that you still have to write significant portions of it to make it output sensible (and efficient) Verilog, but like you said, LLVM is designed with the kind of modularity that makes swapping output backends to add, say, VHDL support easier, and based on other projects I've seen that were made 'possible' by LLVM, I would have been surprised if you ignored it and rolled your own entirely. :) Reply
  • MrSpadge - Wednesday, October 9, 2013 - link

    It could give Altera a huge push if your FPGAs could provide break-through efficiency in any BOINC projects using OpenCL. There are a few, POEM@home, Einstein@home and Collatz@home come to mind, but there are probably more. OpenCL itself is supported by BOINC and currently detects AMD, nVidia and Intel GPUs. But having integrated support for this many coprocessors I'd expect further additions to be smooth.

    Currently spending a few thousand bucks on hardware just for number crunching would be asking for a lot. Current GPUs only cost hundreds of $/€.. but there are quite a few people out there buying significantly more than 1 of them. So the money is there. And electricity cost is a serious concern: e.g. in Germany you pay approximately as much as the GPU cost each year just to keep it crunching 24/7.

    So if Altera can be more efficient than GPUs they could offer cheaper and smaller FPGAs, which might cost 100 - 500 $/€, perform as fast as a GPU (the chip could be smaller for a healthy profit margin, if the algorithm is suitable) and thereby consume significantly less energy.. they'd have a winner!
    Reply
  • MrSpadge - Wednesday, October 9, 2013 - link

    BTW: if the larger FPGAs could thereby be made cheaper there'd very probably also be a market for them. People are even buying Titans just for BOINC, despite them being significantly worse in cost per performance than smaller nVidias. Reply

Log in

Don't have an account? Sign up now