OpenCL™ – Programming for CPU Performance
This white paper is the third in a series of whitepapers on OpenCL™ describing how to best utilize underlying Intel hardware architecture using OpenCL. This white paper will go over programming considerations for host-side device orchestration, as well as OpenCL kernels for CPU.
Disclaimer: This article is based on self-experience as well as on conversations with the OpenCL team at Intel. It will provide you with insights into performance with the current Intel® OpenCL SDK. Intel may support OpenCL on future devices to bring you more performance on the platform, but no announcement has been made on specific platforms and release dates. Nevertheless, you can use today’s guidelines to scale to the next generation of Intel platforms.
The Intel® OpenCL SDK 1.1 implementation for CPU (Intel® Core™2 Duo or later CPUs) can be retrieved from http://software.intel.com/en-us/articles/opencl-sdk. It is still evolving alongside the OpenCL specification, so feel free to try it and provide feedback to us at the Intel OpenCL SDK Support Forum. At present, Intel OpenCL SDK 1.1 runs on Linux* 64 bit, Microsoft Windows 7* (with SP1) and Microsoft Windows Vista* operating systems (32-bit and 64-bit).
The inherently heterogeneous nature of OpenCL allows developers to target various devices that might have very different architectures. CPUs are traditionally great for large complex kernels, as they have large out-of-order cores and large caches.
Performance Considerations for Devices
OpenCL programming requires explicit host-side management of queues, contexts, and devices. Thus, to be efficient, the host-side logic needs to incorporate certain architecture knowledge to utilize any given target device in the best way. There are also different strategies available to divide work among multiple devices that also involve work coordination using events and asynchronous callbacks.
Let us first go over a top-level view of an OpenCL program.
OpenCL kernels and the host program both have to make sure that underlying hardware is getting efficiently utilized. Figure 1 tries to point out that communication to discrete graphics devices may be achieved over a PCI-E link, which is about 10x slower than communicating to the CPU using memory/cache hierarchy. If data is needed back at the main program, programmers need to take the costs of data transfer into consideration when evaluating the performance of algorithms.
Read full article
OpenCL™ – Using Events
This white paper is the fourth in a series of white papers on OpenCL describing how to set up and use events in multithreaded design. This white paper will go over various design choices using OpenCL™ user and command queue-related events for kernels running on CPUs.
The Intel® OpenCL 1.1 specification Beta implementation for CPU (Intel® Core™2 Duo or later CPUs) can be retrieved from http://software.intel.com/en-us/articles/opencl-sdk. It is still evolving into a mature product, so feel free to try it and provide feedback to us in the Intel® OpenCL SDK Support Forum. At present, Intel OpenCL 1.1 only runs on Linux* 64 bit, Microsoft Windows 7* (with SP1) and Microsoft Windows Vista* operating systems (32-bit and 64-bit).
Intel OpenCL 1.1 events are used primarily to synchronize commands in a context. Event objects can be used to track which one of the four states CL_QUEUED, CL_SUBMITTED, CL_RUNNING and CL_COMPLETE a given command is in for a given command queue. User events are used to trigger processing when host threads detect that certain conditions are met. Since user events can be triggered as and when needed, and commands in the command queue can wait on user events as needed, this makes user events the best way to organize command executions when commands are submitted to multiple command queues or to out-of-order command queues.
Non-user events start with initial state CL_QUEUED, and user events start with CL_SUBMITTED as their initial state. Since the OpenCL specification does not call out specifically what should happen when commands are terminated (behavior is implementation-specific), programmers need to utilize the context creation callback function to handle command termination errors effectively.
For single in-order command queue configuration, events are usually used to synchronize host thread memory management (e.g. managing buffer ownership in CL/GL CL/D3D10 sharing or clearing/recycling buffers) and kernel executions. Since all commands are executed by the command queue in order, there is no need to synchronize commands within the command queue. The host thread may either put clFinish() (wait for all submitted commands to finish), or use clWaitForEvents()/clGetEventInfo()/clEnqueueBarrier() (ensures all previous commands are finished), to synchronize memory management with kernel execution. clFinish() is a heavy-handed brute force way to make sure all work is done before proceeding further, as it does not return until all submitted work is done. Command clWaitForEvents() will also block the host thread, but only for commands listed in the event list.
A better way is to set up an event callback at CL_COMPLETE (only available in OpenCL 1.1) which sets up buffers in the callback function as needed when event completion occurs (this may happen asynchronously, so make sure it is thread safe). This will not block the host thread, and the host thread can freely do other tasks at hand.
Fig 1.0 Using Event and Event Callbacks in single in-order Command Queue
For single out-of-order command queue configuration, events are used to synchronize host thread memory management and kernel executions, as well as command execution order as required by algorithms within the command queue. The host thread may use clFinish() (wait for all submitted commands to finish), or use clWaitForEvents()/clGetEventInfo()/clEnqueueBarrier() (ensures all previous commands are finished) to synchronize memory management with kernel execution or to explicitly synchronize various set of commands.
Commands clFinish() will block the host thread, and it will not allow execution of other commands which can be executed while waiting for previous commands.
Command clWaitForEvents() is little better, as it will block the host thread, but only for commands listed in the event list. This explicit way of managing commands is not the best way to fully utilize the device.
Developers should submit commands with event wait lists configured as really needed by an algorithm. This way, the command queue has a lot more flexibility in deciding which commands can be executed while others are in pipeline. Here is a simple example of this approach.