OpenCL Programming Model and Suitability for FPGAs

OpenCL is an open-standard programming interface developed by the Khronos group designed particularly for parallel and heterogeneous computing.  OpenCL can be used to program various types of hardware including CPUs, GPGPUs, FPGAs and many-core coprocessors like Xeon Phi or Adapteva's Epiphany.  Each hardware vendor that wants its hardware to be exposed to OpenCL needs to provide an OpenCL driver for its hardware. For example, OpenCL drivers are available for various CPUs and GPUs for Windows, Linux and Mac and Altera is now providing OpenCL drivers and associated development tools for their FPGAs.  Prior to OpenCL, there were no standard programming languages that exposed coprocessors on an equal footing. Thanks to the rise of GPGPU, the idea of accelerators and coprocessors has entered the mainstream and having a standard interface to accelerators is a big win for FPGA vendors.  In many ways, the concepts that apply to programming a discrete GPGPU placed on a PCIe board also apply to FPGAs.

We go over some OpenCL terminology.

Host and devices: The CPU is called the host, and each hardware that has an OpenCL driver is called a device.  Each device can have one or more compute units and each compute unit can have multiple processing element.  This is shown visually below (figure from Hands on OpenCL course).  

For example, Nvidia Titan GPU contains 15 SMX units and each SMX unit corresponds to a compute unit in OpenCL and each SMX has 192 processing elements.  On FPGAs, the number and complexity of each compute unit is not fixed and instead is customized to your application.

Unlike say C, C++, Java or Python, OpenCL cannot be used standalone. Instead, the main program runs on the CPU (the host) as usual and typically only the computationally intensive parts of the program are written in OpenCL and called from the main program.  However, work is not automatically distributed across various devices. Instead, the application program can query the OpenCL runtime for the list of all OpenCL compatible devices in a system and can choose the appropriate device for each computation. 

Device memory:  Each device has its own memory space where it can allocate arrays of data (called buffers) that can be read/written from OpenCL programs. In a discrete GPU or an FPGA, the buffer objects will typically reside in the RAM placed on the PCIe based board that contains the GPU or FPGA chip. For example, in a GPU such as Radeon 7970, the buffer objects will typically be placed in the GDDR5 RAM. OpenCL provides functions to copy data between host (CPU) memory and device memory. Some vendors also allow transferring data between multiple devices in a system directly without CPU intervention.

Kernels:  OpenCL programs consist of kernels, which are similar to functions in C. Kernels can read/write from buffer objects that are passed as arguments to the kernel. Kernels are written in a C-like programming language. The OpenCL driver for a given hardware compiles it to the appropriate format. For CPUs and GPUs, the vendor's OpenCL driver will compile it to the native instruction set of the processor.  We will get into how kernels are compiled by Altera's SDK in the next section.

Work-items, work-groups and parallelism: Unlike say C, where usually a function call leads to execution of a single instance of a function, the host launches the kernels across a 1D, 2D or 3D grid of "work-items". Each work-item can be thought of a conceptual thread and each work-item executes the same kernel function. However, each work-item knows its index in the thread and will typically compute different parts of a solution.

For example, let us say you wanted to add two vectors of length N.  This is how you will do it in plain C:

You can write a kernel where each work-item adds one element of the vector corresponding to its index. Here is the sample OpenCL kernel.

In this case, each work-item is performing the work done by one loop iteration in the C code. Thus, if you wanted to add vectors of size 1000, you will launch this kernel with 1000 parallel work-items. OpenCL is an inherently parallel API and particularly suited for highly parallel problems.

Work-items are organized into work-groups, which are small grids of say 8x8 work-items,  and items within a work-group can synchronize with each other but items from different work-groups cannot. This work-item and work-group organization maps particularly well to GPUs. FPGAs also prefer highly parallel workloads but the way they get compiled to FPGAs is very different and we will get to that soon.

 

Local memory:  Accessing memory is an expensive operation. CPUs include hardware-managed caches with the hope that the data that is reused in the program can be brought into the cache once and then read/written multiple times from the cache. However, some architectures such as all recent desktop GPUs from AMD, Nvidia and Intel include small amount of fast memory on-chip that acts as a software managed cache. OpenCL provides a construct called "local memory" to expose such software managed caches. Each work-group can allocate local memory (typically upto 32 or 64kB per work-group) and all work-items in the work-group can read/write from the local memory. Local memory is implemented via the software managed cache on GPUs while CPUs allocate it in regular RAM and hope that it will be end up in the cache during program execution. FPGAs also include on-chip memory that can be used to implement OpenCL's local memory construct in hardware. Some members of the Stratix V series include upto 52Mbit (~6.5MB) of on-chip memory that can be used as local memory. In comparison, Radeon HD 7970 includes about 2MB of local memory on-chip and a GTX Titan includes about 450kB of local memory.

You can learn more about OpenCL at the official page at Khronos or look over some tutorials such as the recently released Hands on OpenCL. Overall, the OpenCL programming model looks to be a surprisingly good fit for FPGAs. Concepts such as host/device separation, device memory vs CPU memory, inherently parallel programming model and finally the local memory abstraction all look to be very well suited to FPGAs.

Introduction: FPGAs and Altera's Products Altera's OpenCL Implementation Details
POST A COMMENT

56 Comments

View All Comments

  • MrSpadge - Wednesday, October 9, 2013 - link

    BTW 2: David, you might want to contact Slicker, the admin of Collatz@Home. His project is fairly simple (and not that useful.. but people like it nevertheless) and has regularly been at the forefront of new technology (CUDA, ATI Stream, OpenCL, Intel GPUs..). Usually he's also very responsive. I could imagine a deal like: you give him access to your hardware, and if he succeeds you could get loads of publicity (attracting buyers and further developers) and quite a few sales. Reply
  • viv32 - Wednesday, October 9, 2013 - link

    Application driven reconfigurable hardware is an exciting idea. I am not sure how dense the fpga should be to support the complexity of today's GPUs (If they want FPGAs to replace GPU ASICs). We design network processors and our fpga emulation boards need atleast 4 Stratixs for complete emulation. If the FPGA gate count can match the GPU then can they still be cost effective? My2c .. please correct me if I'm wrong (I'm no FPGA jockey). Reply
  • rahulgarg - Wednesday, October 9, 2013 - link

    Well it depends. The objective in this case isn't to emulate the GPU at all. If the GPU is actually already a very good fit for your application, then going to FGPAs won't gain you much. But let us say in an application that does not use GPU's texture units, you don't really want to generate texture units on an FPGA. The idea isn't to emulate GPU's units or its pipeline, rather it is to generate a *different* pipeline that is more suitable for your application. Reply
  • wyx087 - Wednesday, October 9, 2013 - link

    Benefit of using hardware description languages such as VHDL is just that, it describes the hardware, forcing you to think in terms of the cells gets placed down. OpenCL is a compute language, its programmers won't take into account something as simple as multipliers are very expensive in hardware unless done in powers of 2.

    Also, vast number of university courses do VHDL/Verilog/SystemVerilog as standard. Electronics is the course title. I have no doubt the number of HDL experts is much more than OpenCL experts on this planet.

    The way around "slow compile" is simulation. I see no mention of simulation tools for designing OpenCL on FPGA. Without simulation tools, it is impossible for this to take off. Simulation is the way we verify our design on a functional level.

    The "compile" (it is known as synthesis and implementation or map, place and route) time is indeed in hours for large designs. Remember you are not just generating a binary for a processor, you are generating a binary file that describes the actual hardware. Put it simply, you are generating THE processor.

    - Professional VHDL programmer
    Reply
  • loki1725 - Sunday, October 13, 2013 - link

    This is actually what I was going to say. When I was an undergrad in EE (1997-2001) our embedded electronics course used VHDL. I taught in the EE department of a different university from 2009 to 2013 and we offered several courses that used VHDL. While there may not be more VHDL courses then OpenCL, the numbers are probably comparable.

    Still, really cool article, and anything that helps drive the adoption of FPGAs is a good step forward.
    Reply
  • toyotabedzrock - Wednesday, October 9, 2013 - link

    So the compile time happens beforehand but how long does it take for the fpga to configure itself when you run a program. Reply
  • rahulgarg - Wednesday, October 9, 2013 - link

    Well once you have done the compilation, my understanding is that flashing the binary is actually very fast so that is not an issue. Reply
  • John32 - Wednesday, October 9, 2013 - link

    Are you saying Altera doesn't provide a simulation stage to testing functionality? That's done before generating the binary file for all designs. Generating the binary file is the last thing you do after verifying everything works functionally.

    You say Altera generates Verilog code in then I assume that goes through their standard synthesis, place and route tools. I don't see why you can't do a software and "hardware" (ie. the Verilog code) co-simulation. That's what is normally done during verification. I have C/C++ code that talks to the Verilog code. The C/C++ code is compiled to a binary file and the Verilog code is compiled within an HDL simulator software. Then the entire thing is simulated together. Once that checks out, I generate the binary file and load into the FPGA. I use the same C/C++ code but now with the actual FPGA.
    Reply
  • John32 - Wednesday, October 9, 2013 - link

    Also, the whole "will it fit into the FPGA" issue is probably going to be a big problem for the likely target audience for this. You have no idea how the OpenCL code is being translated into hardware (ie. gates, LUTs, flip-flops, etc.). That all depends on your code and Altera's software to hardware algorithm.

    This reminds me of Xilinx's System Generator for MATLAB. It's a nice and easy way to get scientists to test their algorithms in hardware to see a ballpark figure of how fast it can be but it's definitely not the way to go for a final product.
    Reply
  • John32 - Wednesday, October 9, 2013 - link

    I guess there's also the "will it meet timing" problem. What clock speeds does Altera use? Do they just use whatever clock speed they can achieve (ie. one design clocks at 400 MHz while another can only go 100 MHz)? Reply

Log in

Don't have an account? Sign up now