OpenCL

OpenCL API
OpenCL logo
Original author(s) Apple Inc.
Developer(s) Khronos Group
Initial release August 28, 2009 (2009-08-28)
Stable release
2.2-3[1] / May 12, 2017 (2017-05-12)
Written in C/C++
Operating system Android (vendor dependent),[2] FreeBSD,[3] Linux, macOS, Windows
Platform ARMv7, ARMv8,[4] Cell, IA-32, POWER, x86-64
Type Heterogeneous computing API
License OpenCL specification license
Website www.khronos.org/opencl
OpenCL C/C++
Paradigm Imperative (procedural), structured, object-oriented (C++ only)
Family C
Stable release
OpenCL C++ 1.0 revision 24[5]

OpenCL C 2.0 revision 33[6]

/ May 12, 2017 (2017-05-12)
Typing discipline Static, weak, manifest, nominal
Implementation language Implementation specific
Filename extensions .cl
Website www.khronos.org/opencl
Major implementations
AMD, Apple, freeocl, Gallium Compute, IBM, Intel Beignet, Intel SDK, Nvidia, pocl
Influenced by
C99, CUDA, C++14

Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. OpenCL specifies programming languages (based on C99 and C++11) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism.

OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. Conformant implementations are available from Altera, AMD, Apple, ARM, Creative, IBM, Imagination, Intel, Nvidia, Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS.[7][8]

Overview

OpenCL views a computing system as consisting of a number of compute devices, which might be central processing units (CPUs) or "accelerators" such as graphics processing units (GPUs), attached to a host processor (a CPU). It defines a C-like language for writing programs. Functions executed on an OpenCL device are called "kernels".[9]:17 A single compute device typically consists of several compute units, which in turn comprise multiple processing elements (PEs). A single kernel execution can run on all or many of the PEs in parallel. How a compute device is subdivided into compute units and PEs is up to the vendor; a compute unit can be thought of as a "core", but the notion of core is hard to define across all the types of devices supported by OpenCL (or even within the category of "CPUs"),[10]:49–50 and the number of compute units may not correspond to the number of cores claimed in vendors' marketing literature (which may actually be counting SIMD lanes).[11]

In addition to its C-like programming language, OpenCL defines an application programming interface (API) that allows programs running on the host to launch kernels on the compute devices and manage device memory, which is (at least conceptually) separate from host memory. Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices.[12] The OpenCL standard defines host APIs for C and C++; third-party APIs exist for other programming languages and platforms such as Python,[13] Java and .NET.[10]:15 An implementation of the OpenCL standard consists of a library that implements the API for C and C++, and an OpenCL C compiler for the compute device(s) targeted.

In order to open the OpenCL programming model to other languages or to protect the kernel source from inspection, the Standard Portable Intermediate Representation (SPIR)[14] can be used as a target-independent way to ship kernels between a front-end compiler and the OpenCL back-end.

More recently Khronos Group has ratified SYCL,[15] a higher-level programming model for OpenCL as single-source DSEL based on pure C++14 to improve programming productivity.

Memory hierarchy

OpenCL defines a four-level memory hierarchy for the compute device:[12]

Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.

Devices may or may not share memory with the host CPU.[12] The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.

OpenCL C language

The programming language that is used to write compute kernels is called OpenCL C and is based on C99,[16] but adapted to fit the device model in OpenCL. Memory buffers reside in specific levels of the memory hierarchy, and pointers are annotated with the region qualifiers __global, __local, __constant, and __private, reflecting this. Instead of a device program having a main function, OpenCL C functions are marked __kernel to signal that they are entry points into the program to be called from the host program. Function pointers, bit fields and variable-length arrays are omitted, recursion is forbidden.[17] The C standard library is replaced by a custom set of standard functions, geared toward math programming.

OpenCL C is extended to facilitate use of parallelism with vector types and operations, synchronization, and functions to work with work-items and work-groups.[17] In particular, besides scalar types such as float and double, which behave similarly to the corresponding types in C, OpenCL provides fixed-length vector types such as float4 (4-vector of single-precision floats); such vector types are available in lengths two, three, four, eight and sixteen for various base types.[16] 6.1.2 Vectorized operations on these types are intended to map onto SIMD instructions sets, e.g., SSE or VMX, when running OpenCL programs on CPUs.[12] Other specialized types include 2-d and 3-d image types.[16]:10–11

Example: matrix-vector multiplication

Each invocation (work-item) of the kernel takes a row of the green matrix (A in the code), multiplies this row with the red vector (x) and places the result in an entry of the blue vector (y). The number of columns n is passed to the kernel as ncols; the number of rows is implicit in the number of work-items produced by the host program.

The following is a matrix-vector multiplication algorithm in OpenCL C.

// Multiplies A*x, leaving the result in y.
// A is a row-major matrix, meaning the (i,j) element is at A[i*ncols+j].
__kernel void matvec(__global const float *A, __global const float *x,
                     uint ncols, __global float *y)
{
    size_t i = get_global_id(0);              // Global id, used as the row index.
    __global float const *a = &A[i*ncols];    // Pointer to the i'th row.
    float sum = 0.f;                          // Accumulator for dot product.
    for (size_t j = 0; j < ncols; j++) {
        sum += a[j] * x[j];
    }
    y[i] = sum;
}

The kernel function matvec computes, in each invocation, the dot product of a single row of a matrix A and a vector x:

.

To extend this into a full matrix-vector multiplication, the OpenCL runtime maps the kernel over the rows of the matrix. On the host side, the clEnqueueNDRangeKernel function does this; it takes as arguments the kernel to execute, its arguments, and a number of work-items, corresponding to the number of rows in the matrix A.

Example: computing the FFT

This example will load a fast Fourier transform (FFT) implementation and execute it. The implementation is shown below.[18] The code asks the OpenCL library for the first available graphics card, creates memory buffers for reading and writing (from the perspective of the graphics card), JIT-compiles the FFT-kernel and then finally asynchronously runs the kernel. The result from the transform is not read in this example.

#include <stdio.h>
#include <time.h>
#include "CL\opencl.h"

#define NUM_ENTRIES 1024

int main() // (int argc, const char * argv[])
{
	// CONSTANTS
	const char *KernelSource = "fft1D_1024_kernel_src.cl";

	// Looking up the available GPUs
	const cl_uint num = 1;
	clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 0, NULL, (cl_uint*)num);

	cl_device_id devices[1];
	clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, num, devices, NULL);

	// create a compute context with GPU device
	cl_context context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

	// create a command queue
	clGetDeviceIDs(NULL, CL_DEVICE_TYPE_DEFAULT, 1, devices, NULL);
	cl_command_queue queue = clCreateCommandQueue(context, devices[0], 0, NULL);

	// allocate the buffer memory objects
	cl_mem memobjs[] = { clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float) * 2 * NUM_ENTRIES, NULL, NULL),
						 clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float) * 2 * NUM_ENTRIES, NULL, NULL) };
	// cl_mem memobjs[0] = // FIXED, SEE ABOVE
	// cl_mem memobjs[1] = // FIXED, SEE ABOVE

	// create the compute program
	// const char * fft1D_1024_kernel_src[1] = {  };
	cl_program program = clCreateProgramWithSource(context, 1, (const char **)& KernelSource, NULL, NULL);

	// build the compute program executable
	clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

	// create the compute kernel
	cl_kernel kernel = clCreateKernel(program, "fft1D_1024", NULL);

	// set the args values

	size_t local_work_size[1] = { 256 };

	clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]);
	clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]);
	clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0] + 1) * 16, NULL);
	clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0] + 1) * 16, NULL);

	// create N-D range object with work-item dimensions and execute kernel
	size_t global_work_size[1] = { 256 };
	
	global_work_size[0] = NUM_ENTRIES;
	local_work_size[0] = 64; //Nvidia: 192 or 256
	clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL);
}

The actual calculation (based on Fitting FFT onto the G80 Architecture):[19]

  // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into
  // calls to a radix 16 function, another radix 16 function and then a radix 4 function

  __kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
                          __local float *sMemx, __local float *sMemy) {
    int tid = get_local_id(0);
    int blockIdx = get_group_id(0) * 1024 + tid;
    float2 data[16];

    // starting index of data to/from global memory
    in = in + blockIdx;  out = out + blockIdx;

    globalLoads(data, in, 64); // coalesced global reads
    fftRadix16Pass(data);      // in-place radix-16 pass
    twiddleFactorMul(data, tid, 1024, 0);

    // local shuffle using local memory
    localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
    fftRadix16Pass(data);               // in-place radix-16 pass
    twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication

    localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));

    // four radix-4 function calls
    fftRadix4Pass(data);      // radix-4 function number 1
    fftRadix4Pass(data + 4);  // radix-4 function number 2
    fftRadix4Pass(data + 8);  // radix-4 function number 3
    fftRadix4Pass(data + 12); // radix-4 function number 4

    // coalesced global writes
    globalStores(data, out, 64);
  }

A full, open source implementation of an OpenCL FFT can be found on Apple's website.[20]

History

OpenCL was initially developed by Apple Inc., which holds trademark rights, and refined into an initial proposal in collaboration with technical teams at AMD, IBM, Qualcomm, Intel, and Nvidia. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008, the Khronos Compute Working Group was formed[21] with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details of the specification for OpenCL 1.0 by November 18, 2008.[22] This technical specification was reviewed by the Khronos members and approved for public release on December 8, 2008.[23]

OpenCL 1.0

OpenCL 1.0 released with Mac OS X Snow Leopard on August 28, 2009. According to an Apple press release:[24]

Snow Leopard further extends support for modern hardware with Open Computing Language (OpenCL), which lets any application tap into the vast gigaflops of GPU computing power previously available only to graphics applications. OpenCL is based on the C programming language and has been proposed as an open standard.

AMD decided to support OpenCL instead of the now deprecated Close to Metal in its Stream framework.[25][26] RapidMind announced their adoption of OpenCL underneath their development platform to support GPUs from multiple vendors with one interface.[27] On December 9, 2008, Nvidia announced its intention to add full support for the OpenCL 1.0 specification to its GPU Computing Toolkit.[28] On October 30, 2009, IBM released its first OpenCL implementation as a part of the XL compilers.[29]

OpenCL 1.1

OpenCL 1.1 was ratified by the Khronos Group on June 14, 2010[30] and adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including:

OpenCL 1.2

On November 15, 2011, the Khronos Group announced the OpenCL 1.2 specification,[31] which added significant functionality over the previous versions in terms of performance and features for parallel programming. Most notable features include:

OpenCL 2.0

On November 18, 2013, the Khronos Group announced the ratification and public release of the finalized OpenCL 2.0 specification.[33] Updates and additions to OpenCL 2.0 include:

OpenCL 2.1

The ratification and release of the OpenCL 2.1 provisional specification was announced on March 3, 2015 at the Game Developer Conference in San Francisco. It was released on November 16, 2015.[34] It introduced the OpenCL C++ kernel language, based on a subset of C++14, while maintaining support for the preexisting OpenCL C kernel language. Vulkan and OpenCL 2.1 share SPIR-V as an intermediate representation allowing high-level language front-ends to share a common compilation target. Updates to the OpenCL API include:

AMD, ARM, Intel, HPC, and YetiWare have declared support for OpenCL 2.1.[35][36]

OpenCL 2.2

OpenCL 2.2 brings the OpenCL C++ kernel language into the core specification for significantly enhanced parallel programming productivity.[37][38][39] It was released on 16 May 2017.[40]

Future

The International Workshop on OpenCL (IWOCL) held by the Khronos Group.

When releasing OpenCL version 2.2, the Khronos Group announced that OpenCL would be merging into Vulkan in the future.[41]

Implementations

OpenCL consists of a set of headers and a shared object that is loaded at runtime. An installable client driver (ICD) must be installed on the platform for every class of vendor for which the runtime would need to support. That is, for example, in order to support Nvidia devices on a Linux platform, the Nvidia ICD would need to be installed such that the OpenCL runtime (the ICD loader) would be able to locate the ICD for the vendor and redirect the calls appropriately. The standard OpenCL header is used by the consumer application; calls to each function are then proxied by the OpenCL runtime to the appropriate driver using the ICD. Each vendor must implement each OpenCL call in their driver.[42]

The Apple,[43] Nvidia,[44] RapidMind[45] and Gallium3D[46] implementations of OpenCL are all based on the LLVM Compiler technology and use the Clang Compiler as its frontend.

MESA Gallium Compute 
An implementation of OpenCL (actual 1.1 incomplete, mostly done AMD Radeon GCN) for a number of platforms is maintained as part of the Gallium Compute Project,[47] which builds on the work of the Mesa project to support multiple platforms. Formerly this was known as CLOVER.[48]
BEIGNET 
An implementation by Intel for its Ivy Bridge + hardware was released in 2013.[49] This software from Intel's China Team, has attracted criticism from developers at AMD and Red Hat,[50] as well as Michael Larabel of Phoronix.[51] Actual Version 1.3.1 support OpenCL 1.2 complete (Ivy Bridge and higher) and OpenCL 2.0 optional for Skylake and newer.[52][53] support for Android has been added to Beignet.[54]
ROCm 
Created as part of AMD's GPUOpen, ROCm (Radeon Open Compute) is an open source Linux project built on OpenCL 1.2 with language support for 2.0. The system is compatible with all modern AMD CPUs and APUs, as well as Intel Gen7.5+ CPUs.[55][56]
POCL 
A CPU-only version building on Clang and LLVM, called pocl, is intended to be a portable OpenCL implementation.[57][58] With Version 0.14 OpenCL 1.2 is nearly fully implemented.[59][60] In Next Version an experimental Nvidia CUDA backend is available for using NVidia GPUs.[61]
Shamrock 
A Port of Mesa Clover for ARM with full support of OpenCL 1.2[62][63]
FreeOCL 
A CPU focused implementation of OpenCL 1.2 that implements an external compiler to create a more reliable platform[64]
triSYCL
free OpenCL 2.2 Implementation with SYCL 1.2.1 and 2.2 C++ Layer[65]
Khronos Conformance Test Suite (CTS)
CTS is for all developers and all actual OpenCL levels free (since 2017) available.[66]

Timeline of vendor implementations

Devices

As of 2016 OpenCL runs on Graphics processing units, CPUs with SIMD instructions, FPGAs, Movidius Myriad 2, Adapteva epiphany and DSPs.

Conformant products

The Khronos Group maintains an extended list of OpenCL-conformant products.[4]

Synopsis of OpenCL conformant products[4]
AMD APP SDK (supports OpenCL CPU and accelerated processing unit Devices), (GPU: Terascale 1: OpenCL 1.1, Terascale 2: 1.2, GCN 1: 1.2+, GCN 2+: 2.0+) X86 + SSE2 (or higher) compatible CPUs 64-bit & 32-bit,[106] Linux 2.6 PC, Windows Vista/7/8.x/10 PC AMD Fusion E-350, E-240, C-50, C-30 with HD 6310/HD 6250 AMD Radeon/Mobility HD 6800, HD 5x00 series GPU, iGPU HD 6310/HD 6250, HD 7xxx, HD 8xxx, R2xx, R3xx, RX 4xx AMD FirePro Vx800 series GPU and later, Radeon Pro
Intel SDK for OpenCL Applications 2013[107] (supports Intel Core processors and Intel HD Graphics 4000/2500) actual 2016 R3 with OpenCL 2.1 (Gen7+) Intel CPUs with SSE 4.1, SSE 4.2 or AVX support.[108][109] Microsoft Windows, Linux Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3, 3rd Generation Intel Core Processors with Intel HD Graphics 4000/2500 Intel Core 2 Solo, Duo Quad, Extreme Intel Xeon 7x00,5x00,3x00 (Core based)
IBM Servers with OpenCL Development Kit for Linux on Power running on Power VSX[110][111] IBM Power 755 (PERCS), 750 IBM BladeCenter PS70x Express IBM BladeCenter JS2x, JS43 IBM BladeCenter QS22
IBM OpenCL Common Runtime (OCR)

[112]

X86 + SSE2 (or higher) compatible CPUs 64-bit & 32-bit;[113] Linux 2.6 PC AMD Fusion, Nvidia Ion and Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3 AMD Radeon, Nvidia GeForce and Intel Core 2 Solo, Duo, Quad, Extreme ATI FirePro, Nvidia Quadro and Intel Xeon 7x00,5x00,3x00 (Core based)
Nvidia OpenCL Driver and Tools,[114] Chips: Tesla, Fermi : OpenCL 1.1(Driver 340+), Kepler, Maxwell, Pascal: OpenCL 1.2 (Driver 370+), OpenCL 2.0 beta (378.66) Nvidia Tesla C/D/S Nvidia GeForce GTS/GT/GTX, Nvidia Ion Nvidia Quadro FX/NVX/Plex, Quadro, Quadro K, Quadro M, Quadro P,

All standard-conformant implementations can be queried using one of the clinfo tools (there are multiple tools with the same name and similar feature set).[115][116][117]

Version support

Products and their version of OpenCL support include:[118]

OpenCL 2.2 support

None yet: Khronos Test Suite ready, with Driver Update all Hardware with 2.0 and 2.1 support possible

OpenCL 2.1 support

OpenCL 2.0 support

OpenCL 1.2 support

OpenCL 1.1 support

OpenCL 1.0 support

Portability, performance and alternatives

A key feature of OpenCL is portability, via its abstracted memory and execution model, and the programmer is not able to directly use hardware-specific technologies such as inline Parallel Thread Execution (PTX) for Nvidia GPUs unless they are willing to give up direct portability on other platforms. It is possible to run any OpenCL kernel on any conformant implementation.

However, performance of the kernel is not necessarily portable across platforms. Existing implementations have been shown to be competitive when kernel code is properly tuned, though, and auto-tuning has been suggested as a solution to the performance portability problem,[119] yielding "acceptable levels of performance" in experimental linear algebra kernels.[120] Portability of an entire application containing multiple kernels with differing behaviors was also studied, and shows that portability only required limited tradeoffs.[121]

A study at Delft University that compared CUDA programs and their straightforward translation into OpenCL C found CUDA to outperform OpenCL by at most 30% on the Nvidia implementation. The researchers noted that their comparison could be made fairer by applying manual optimizations to the OpenCL programs, in which case there was "no reason for OpenCL to obtain worse performance than CUDA". The performance differences could mostly be attributed to differences in the programming model (especially the memory model) and to NVIDIA's compiler optimizations for CUDA compared to those for OpenCL.[119]

Another study at D-Wave Systems Inc. found that "The OpenCL kernel’s performance is between about 13% and 63% slower, and the end-to-end time is between about 16% and 67% slower" than CUDA's performance.[122]

The fact that OpenCL allows workloads to be shared by CPU and GPU, executing the same programs, means that programmers can exploit both by dividing work among the devices.[123] This leads to the problem of deciding how to partition the work, because the relative speeds of operations differ among the devices. Machine learning has been suggested to solve this problem: Grewe and O'Boyle describe a system of support vector machines trained on compile-time features of program that can decide the device partitioning problem statically, without actually running the programs to measure their performance.[124]

See also

References

  1. "The OpenCL Specification". Khronos Group. 12 May 2017. Retrieved 17 May 2017.
  2. "Android Devices With OpenCL support". Google Docs. ArrayFire. Retrieved April 28, 2015.
  3. "FreeBSD Graphics/OpenCL". FreeBSD. Retrieved 23 December 2015.
  4. 1 2 3 4 5 "Conformant Products". Khronos Group. Retrieved May 9, 2015.
  5. Sochacki, Bartosz (12 May 2017). "The OpenCL C++ 1.0 Specification" (PDF). Khronos OpenCL Working Group. Retrieved 17 May 2017.
  6. Munshi, Aaftab; Howes, Lee; Sochaki, Barosz (13 April 2016). "The OpenCL C Specification Version: 2.0 Document Revision: 33" (PDF). Khronos OpenCL Working Group. Retrieved 29 April 2016.
  7. "Conformant Companies". Khronos Group. Retrieved April 8, 2015.
  8. Gianelli, Silvia E. (January 14, 2015). "Xilinx SDAccel Development Environment for OpenCL, C, and C++, Achieves Khronos Conformance". PR Newswire. Xilinx. Retrieved April 27, 2015.
  9. Howes, Lee (November 11, 2015). "The OpenCL Specification Version: 2.1 Document Revision: 23" (PDF). Khronos OpenCL Working Group. Retrieved November 16, 2015.
  10. 1 2 Gaster, Benedict; Howes, Lee; Kaeli, David R.; Mistry, Perhaad; Schaa, Dana (2012). Heterogeneous Computing with OpenCL: Revised OpenCL 1.2 Edition. Morgan Kaufmann.
  11. Tompson, Jonathan; Schlachter, Kristofer (2012). "An Introduction to the OpenCL Programming Model" (PDF). New York University Media Research Lab. Retrieved July 6, 2015.
  12. 1 2 3 4 Stone, John E.; Gohara, David; Shi, Guochin (2010). "OpenCL: a parallel programming standard for heterogeneous computing systems". Computing in Science & Engineering. doi:10.1109/MCSE.2010.69.
  13. Klöckner, Andreas; Pinto, Nicolas; Lee, Yunsup; Catanzaro, Bryan; Ivanov, Paul; Fasih, Ahmed (2012). "PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation". Parallel Computing. 38 (3): 157–174. arXiv:0911.3456Freely accessible. doi:10.1016/j.parco.2011.09.001.
  14. "SPIR - The first open standard intermediate language for parallel compute and graphics". Khronos Group.
  15. "SYCL - C++ Single-source Heterogeneous Programming for OpenCL". Khronos Group.
  16. 1 2 3 Aaftab Munshi, ed. (2014). "The OpenCL C Specification, Version 2.0" (PDF). Retrieved June 24, 2014.
  17. 1 2 "Introduction to OpenCL Programming 201005" (PDF). AMD. pp. 89–90. Archived from the original (PDF) on May 16, 2011. Retrieved August 8, 2017.
  18. "OpenCL" (PDF). SIGGRAPH2008. August 14, 2008. Retrieved August 14, 2008.
  19. "Fitting FFT onto G80 Architecture" (PDF). Vasily Volkov and Brian Kazian, UC Berkeley CS258 project report. May 2008. Retrieved November 14, 2008.
  20. "OpenCL on FFT". Apple. November 16, 2009. Retrieved December 7, 2009.
  21. "Khronos Launches Heterogeneous Computing Initiative" (Press release). Khronos Group. June 16, 2008. Retrieved June 18, 2008.
  22. "OpenCL gets touted in Texas". MacWorld. November 20, 2008. Retrieved June 12, 2009.
  23. "The Khronos Group Releases OpenCL 1.0 Specification" (Press release). Khronos Group. December 8, 2008. Retrieved December 4, 2016.
  24. "Apple Previews Mac OS X Snow Leopard to Developers" (Press release). Apple Inc. June 9, 2008. Retrieved June 9, 2008.
  25. "AMD Drives Adoption of Industry Standards in GPGPU Software Development" (Press release). AMD. August 6, 2008. Retrieved August 14, 2008.
  26. "AMD Backs OpenCL, Microsoft DirectX 11". eWeek. August 6, 2008. Retrieved August 14, 2008.
  27. "HPCWire: RapidMind Embraces Open Source and Standards Projects". HPCWire. November 10, 2008. Archived from the original on December 18, 2008. Retrieved November 11, 2008.
  28. "Nvidia Adds OpenCL To Its Industry Leading GPU Computing Toolkit" (Press release). Nvidia. December 9, 2008. Retrieved December 10, 2008.
  29. "OpenCL Development Kit for Linux on Power". alphaWorks. October 30, 2009. Retrieved October 30, 2009.
  30. "Khronos Drives Momentum of Parallel Computing Standard with Release of OpenCL 1.1 Specification". Retrieved 2016-02-24.
  31. "Khronos Releases OpenCL 1.2 Specification". Khronos Group. November 15, 2011. Retrieved June 23, 2015.
  32. 1 2 3 "OpenCL 1.2 Specification" (PDF). Khronos Group. Retrieved June 23, 2015.
  33. "Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing". Khronos Group. November 18, 2013. Retrieved February 10, 2014.
  34. "Khronos Releases OpenCL 2.1 and SPIR-V 1.0 Specifications for Heterogeneous Parallel Programming". Khronos Group. November 16, 2015. Retrieved November 16, 2015.
  35. "Khronos Announces OpenCL 2.1: C++ Comes to OpenCL". AnandTech. March 3, 2015. Retrieved April 8, 2015.
  36. "Khronos Releases OpenCL 2.1 Provisional Specification for Public Review". Kronos Group. March 3, 2015. Retrieved April 8, 2015.
  37. "OpenCL Overview". Unknown parameter |publihser= ignored (|publisher= suggested) (help)
  38. 1 2 "Khronos Releases OpenCL 2.2 Provisional Specification with OpenCL C++ Kernel Language for Parallel Programming". Khronos Group. 18 April 2016.
  39. Trevett, Neil (April 2016). "OpenCL – A State of the Union" (PDF). IWOCL. Vienna: Khronos Group. Retrieved 2017-01-02.
  40. "Khronos Releases OpenCL 2.2 With SPIR-V 1.2". Khronos Group. 16 May 2017.
  41. "Breaking: OpenCL Merging Roadmap into Vulkan".
  42. "OpenCL ICD Specification". Retrieved June 23, 2015.
  43. "Apple entry on LLVM Users page". Retrieved August 29, 2009.
  44. "Nvidia entry on LLVM Users page". Retrieved August 6, 2009.
  45. "Rapidmind entry on LLVM Users page". Retrieved October 1, 2009.
  46. "Zack Rusin's blog post about the Gallium3D OpenCL implementation". Retrieved October 1, 2009.
  47. "GalliumCompute". dri.freedesktop.org. Retrieved June 23, 2015.
  48. "Clover Status Update" (PDF).
  49. Larabel, Michael (January 10, 2013). "Beignet: OpenCL/GPGPU Comes For Ivy Bridge On Linux". Phoronix.
  50. Larabel, Michael (April 16, 2013). "More Criticism Comes Towards Intel's Beignet OpenCL". Phoronix.
  51. Larabel, Michael (December 24, 2013). "Intel's Beignet OpenCL Is Still Slowly Baking". Phoronix.
  52. "Beignet". Unknown parameter |publihser= ignored (|publisher= suggested) (help)
  53. "beignet - Beignet OpenCL Library for Intel Ivy Bridge and newer GPUs".
  54. "Intel Brings Beignet To Android For OpenCL Compute".
  55. "ROCm". GitHub.
  56. "RadeonOpenCompute/ROCm: ROCm - Open Source Platform for HPC and Ultrascale GPU Computing". GitHub.
  57. Jääskeläinen, Pekka; Sánchez de La Lama, Carlos; Schnetter, Erik; Raiskila, Kalle; Takala, Jarmo; Berg, Heikki (2014). "pocl: A Performance-Portable OpenCL Implementation". Int'l J. Parallel Programming. doi:10.1007/s10766-014-0320-y.
  58. "pocl: A Performance-Portable OpenCL Implementation" (PDF).
  59. "April 2017: pocl v0.14 released".
  60. "Issues - pocl/pocl". GitHub.
  61. "April 2017: NVIDIA GPU support via CUDA backend".
  62. "About". Git.Linaro.org.
  63. Gall, T.; Pitney, G. (2014-03-06). "LCA14-412: GPGPU on ARM SoC" (PDF). Amazon Web Services. Retrieved 2017-01-22.
  64. "zuzuf/freeocl". GitHub. Retrieved 2017-04-13.
  65. "triSYCL/triSYCL: An open source implementation of OpenCL SYCL from Khronos Group". GitHub.
  66. "KhronosGroup/OpenCL-CTL: The OpenCL Conformance Tests". GitHub.
  67. "OpenCL Demo, AMD CPU". December 10, 2008. Retrieved March 28, 2009.
  68. "OpenCL Demo, Nvidia GPU". December 10, 2008. Retrieved March 28, 2009.
  69. "Imagination Technologies launches advanced, highly-efficient POWERVR SGX543MP multi-processor graphics IP family". Imagination Technologies. March 19, 2009. Retrieved January 30, 2011.
  70. "AMD and Havok demo OpenCL accelerated physics". PC Perspective. March 26, 2009. Archived from the original on April 5, 2009. Retrieved March 28, 2009.
  71. "Nvidia Releases OpenCL Driver To Developers". Nvidia. April 20, 2009. Retrieved April 27, 2009.
  72. "AMD does reverse GPGPU, announces OpenCL SDK for x86". Ars Technica. August 5, 2009. Retrieved August 6, 2009.
  73. Moren, Dan; Snell, Jason (June 8, 2009). "Live Update: WWDC 2009 Keynote". MacWorld.com. MacWorld. Retrieved June 12, 2009.
  74. "ATI Stream Software Development Kit (SDK) v2.0 Beta Program". Archived from the original on August 9, 2009. Retrieved October 14, 2009.
  75. "S3 Graphics launched the Chrome 5400E embedded graphics processor". Archived from the original on December 2, 2009. Retrieved October 27, 2009.
  76. "VIA Brings Enhanced VN1000 Graphics Processor]". Retrieved December 10, 2009.
  77. "ATI Stream SDK v2.0 with OpenCL 1.0 Support". Retrieved October 23, 2009.
  78. "OpenCL". ZiiLABS. Retrieved June 23, 2015.
  79. "Intel discloses new Sandy Bridge technical details". Retrieved September 13, 2010.
  80. "WebCL related stories". Khronos Group. Retrieved June 23, 2015.
  81. "Khronos Releases Final WebGL 1.0 Specification". Khronos Group. Retrieved June 23, 2015.
  82. "OpenCL Development Kit for Linux on Power".
  83. "About the OpenCL Common Runtime for Linux on x86 Architecture".
  84. "Nokia Research releases WebCL prototype". Khronos Group. May 4, 2011. Retrieved June 23, 2015.
  85. KamathK, Sharath. "Samsung's WebCL Prototype for WebKit". Github.com. Retrieved June 23, 2015.
  86. "AMD Opens the Throttle on APU Performance with Updated OpenCL Software Development ". Amd.com. August 8, 2011. Retrieved June 16, 2013.
  87. "AMD APP SDK v2.6". Forums.amd.com. March 13, 2015. Retrieved June 23, 2015.
  88. "The Portland Group Announces OpenCL Compiler for ST-Ericsson ARM-Based NovaThor SoCs". Retrieved May 4, 2012.
  89. "WebCL Latest Spec". Khronos Group. November 7, 2013. Retrieved June 23, 2015.
  90. "Altera Opens the World of FPGAs to Software Programmers with Broad Availability of SDK and Off-the-Shelf Boards for OpenCL". Altera.com. Retrieved January 9, 2014.
  91. "Altera SDK for OpenCL is First in Industry to Achieve Khronos Conformance for FPGAs". Altera.com. Retrieved January 9, 2014.
  92. "Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing". Khronos Group. November 18, 2013. Retrieved June 23, 2015.
  93. "WebCL 1.0 Press Release". Khronos Group. March 19, 2014. Retrieved June 23, 2015.
  94. "WebCL 1.0 Specification". Khronos Group. March 14, 2014. Retrieved June 23, 2015.
  95. "Intel OpenCL 2.0 Driver".
  96. "AMD OpenCL 2.0 Driver". Support.AMD.com. June 17, 2015. Retrieved June 23, 2015.
  97. "Xilinx SDAccel development environment for OpenCL, C, and C++, achieves Khronos Conformance - khronos.org news". The Khronos Group. Retrieved 2017-06-26.
  98. "Release 349 Graphics Drivers for Windows, Version 350.12" (PDF). April 13, 2015. Retrieved February 4, 2016.
  99. "AMD APP SDK 3.0 Released". Developer.AMD.com. August 26, 2015. Retrieved September 11, 2015.
  100. "Khronos Releases OpenCL 2.1 and SPIR-V 1.0 Specifications for Heterogeneous Parallel Programming". Khronos Group. 16 November 2015.
  101. "What's new? Intel® SDK for OpenCL™ Applications 2016, R3". Intel Software.
  102. "NVIDIA 378.66 drivers for Windows offer OpenCL 2.0 evaluation support". Khronos Group. 17 February 2017.
  103. "NVIDIA enables OpenCL 2.0 beta-support".
  104. "NVIDIA beta-support for OpenCL 2.0 works on Linux too".
  105. "Khronos Releases OpenCL 2.2 With SPIR-V 1.2".
  106. "OpenCL and the AMD APP SDK". AMD Developer Central. developer.amd.com. Archived from the original on August 4, 2011. Retrieved August 11, 2011.
  107. "About Intel OpenCL SDK 1.1". software.intel.com. intel.com. Retrieved August 11, 2011.
  108. "Product Support". Retrieved August 11, 2011.
  109. "Intel OpenCL SDK – Release Notes". Archived from the original on July 17, 2011. Retrieved August 11, 2011.
  110. "Announcing OpenCL Development Kit for Linux on Power v0.3". Retrieved August 11, 2011.
  111. "IBM releases OpenCL Development Kit for Linux on Power v0.3 – OpenCL 1.1 conformant release available". OpenCL Lounge. ibm.com. Retrieved August 11, 2011.
  112. "IBM releases OpenCL Common Runtime for Linux on x86 Architecture". Retrieved September 10, 2011.
  113. "OpenCL and the AMD APP SDK". AMD Developer Central. developer.amd.com. Archived from the original on September 6, 2011. Retrieved September 10, 2011.
  114. "Nvidia Releases OpenCL Driver". Retrieved August 11, 2011.
  115. "clinfo by Simon Leblanc". Retrieved 27 January 2017.
  116. "clinfo by Oblomov". Retrieved 27 January 2017.
  117. "clinfo: openCL INFOrmation". Retrieved 27 January 2017.
  118. "Khronos Products". The Khronos Group. Retrieved 2017-05-15.
  119. 1 2 Fang, Jianbin; Varbanescu, Ana Lucia; Sips, Henk (2011). A Comprehensive Performance Comparison of CUDA and OpenCL (PDF). Proc. Int'l Conf. on Parallel Processing. doi:10.1109/ICPP.2011.45.
  120. Du, Peng; Weber, Rick; Luszczek, Piotr; Tomov, Stanimire; Peterson, Gregory; Dongarra, Jack (2012). "From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming". Parallel Computing. 38 (8): 391–407. doi:10.1016/j.parco.2011.10.002.
  121. Dolbeau, Romain; Bodin, François; de Verdière, Guillaume Colin (September 7, 2013). "One OpenCL to rule them all?". Archived from the original on January 16, 2014. Retrieved January 14, 2014.
  122. Karimi, Kamran; Dickson, Neil G.; Hamze, Firas (2011). "A Performance Comparison of CUDA and OpenCL". arXiv:1005.2581v3Freely accessible.
  123. A Survey of CPU-GPU Heterogeneous Computing Techniques, ACM Computing Surveys, 2015.
  124. Grewe, Dominik; O'Boyle, Michael F. P. (2011). A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL. Proc. Int'l Conf. on Compiler Construction. doi:10.1007/978-3-642-19861-8_16.
  125. "Coriander Project: Compile CUDA Codes To OpenCL, Run Everywhere". Phoronix.
  126. Perkins, Hugh (2017). "cuda-on-cl" (PDF). IWOCL. Retrieved 2017-08-08.
  127. "hughperkins/coriander: Build NVIDIA® CUDA™ code for OpenCL™ 1.2 devices". GitHub.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.