ILGPU

C#

Upgrade: v0.6 to v0.7.X

A new OpenCL backend has been added that supports OpenCL C 2.0 (or higher) compatible GPUs. The OpenCL backend does not require an OpenCL SDK to be installed/configured. There is the possibility to query all supported OpenCL accelerators via CLAccelerator.CLAccelerators. Since NVIDIA GPUs typically does not support OpenCL C 2.0 (or higher), they are usually not contained in this list. However, if you still want to access those devices via the OpenCL API you can query CLAccelerators.AllCLAccelerators. Note that the global list of all accelerators Accelerator.Accelerators will contain supported accelerators only. It is highly recommended to use the CudaAccelerator class for NVIDIA GPUs and the CLAccelerator class for Intel and AMD GPUs. Furthermore, it is not necessary to worry about native library dependencies regarding OpenCL (except, of course, for the actual GPU drivers).

The XMath class has been removed as it contained many software implementations for different platforms that are not related to the actual ILGPU compiler. However, there are several math functions that are supported on all platforms which are still exposed via the new IntrinsicMath class. There is also a class IntrinsicMath.CPU which contains implementations for all math functions for the CPUAccelerator. Please note that these functions are not supported on other accelerators except the CPUAccelerator. If you want to use the full range of math functions refer to the XMath class of the ILGPU.Algorithms library.

The new version of the ILGPU.Algorithms library offers support for a set of commonly used algorithms (like Scan or Reduce). Moreover, it offers GroupExtensions and WarpExtensions to support group/warp-wide reductions or prefix sums within kernels.

Refer to the updated samples in the GitHub repository for more information.

New Algorithms Library

The new ILGPU.Algorithms library comes in a separate nuget package. In order to use any of the exposed group/warp/math extensions you have to enable the library. This setups all internal ILGPU hooks and custom code-generators to emit code that realizes the extensions in the right places. This is achieved by using the new extension and intrinsic API.


using ILGPU.Algorithms;
class ...
{
    static void ...(...)
    {
        using var context = new Context();

        // Enable all algorithms and extension methods
        context.EnableAlgorithms();

        ...
    }
}


Math Functions

As mentioned here, the XMath class has been removed from the actual GPU compiler framework. Leverage the IntrinsicMath class to use math functions that are available on all supported accelerators. If you want to access all math functions use the newly designed XMath class of the ILGPU.Algorithms library.


class ...
{
    static void ...(...)
    {
        // Old way (obsolete and no longer supported)
        float x = ILGPU.XMath.Sin(...);

        // New way
        // 1) Don't forget to enable algorithm support ;)
        context.EnableAlgorithms();
        
        // 2) Use the new XMath class
        float x = ILGPU.Algorithms.XMath.Sin(...);
    }
}


Warp & Group Intrinsics

Previous versions of ILGPU had several warp-shuffle overloads to expose the native warp and group intrinsics to the programmer. However, these functions are typically available for int and float data types. More complex or larger types required programming of custom IShuffleOperation interfaces that had to be passed to the shuffle functions. This inconvenient way of programming is no longer required. The new warp and group intrinsics support generic data structures. ILGPU will automatically generate the required code for every target platform and use case.

The intrinsics Group.Broadcast and Warp.Broadcast have been added. In contrast to Warp.Shuffle, the Warp.Broadcast intrinsic requires that all participating threads read from the same lane. Warp.Shuffle supports different source lanes in every thread. Group.Broadcast works like Warp.Broadcast, but for all threads in a group.


class ...
{
    static void ...(...)
    {
        ComplexDataType y = ...;
        ComplexDataType x = Warp.Shuffle(y, threadIdx);

        ...

        ComplexDataType y = ...;
        ComplexDataType x = Group.Broadcast(y, groupIdx);
    }
}


Grid and Group Indices

It is no longer required to access grid and group indices via the GroupedIndex(|2|3) index parameter of a kernel. Instead, you can access the static properties Grid.Index(X|Y|Z) and Group.Index(X|Y|Z) from every function in the scope of a kernel. This simplifies programming of helper methods significantly. Furthermore, this also feels natural to experienced Cuda and OpenCL developers.


class ...
{
    static void ...(GroupedIndex index)
    {
        // Common ILGPU way (still supported)
        int gridIdx = index.GridIdx;
        int groupIdx = index.GroupIdx;

        // New ILGPU way
        int gridIdx = Grid.IndexX;
        int groupIdx = Group.IndexX;
    }
}


Upgrade: v0.3.X to v0.5.X

The ILGPU compiler has been redesigned and does not rely on LLVM any more. The Cuda SDK is also no longer required to compile and run ILGPU kernels. Therefore, it is not necessary to worry about native library dependencies (except, of course, for the actual GPU drivers) or environment variables.

ILGPU features a new parallel processing model. It allows parallel code generation and transformation phases to reduce compile time and improve overall performance. However, parallel code generation in the frontend module is disabled by default in the current release (beta version). It can be enabled via the enumeration flag ContextFlags.EnableParallelCodeGenerationInFrontend.

The flags enumeration CompileUnitFlags has been replaced by the enumeration ContextFlags. Unlike previous versions, these flags are passed directly to the main constructor ILGPU.Context and affect the entire code generation process. Note that the CompileUnitFlags.UseGPUMath flag has no counterpart in the new version, since all mathematical operations are automatically mapped to the corresponding XMath counterparts.

The GPUMath class has been renamed to XMath in order to reflect its cross platform math capabilities (community request). Note that XMath will be replaced by a new high performance math library in a future release.

The high-level algorithms Warp.Reduce and Warp.AllReduce have been moved to the ILGPU Lightning/Algorithm library.

New GPU debugging and profiling features have been added. Refer to the default Documentation for more information.

Refer to the updated samples in the GitHub repository for more information.

ArrayViews and VariableViews

The ArrayView and VariableView structures have been adapted to the C# 'ref' features. This renders explicit Load and Store methods obsolete. In addition, all methods that accept VariableView<X> parameter types have been adapted to the parameter types ref X. This applies, for example, to all methods of the class Atomic.


class ...
{
    static void ...(...)
    {
        // Old way (obsolete and no longer supported)
        ArrayView<int> someView = ...
        var variableView = someView.GetVariableView(X);
        Atomic.Add(variableView);
        ...
        variableView.Store(42);

        // New way
        ArrayView<int> someView = ...
        Atomic.Add(ref someView[X]);
        ...
        someView[X] = 42;

        // or
        ref var variable = ref someView[X];
        variable = 42;

        // or
        var variableView = someView.GetVariableView(X);
        variableView.Value = 42;
    }
}


Shared Memory

The general concept of shared memory has been redesigned. The previous model required SharedMemoryAttribute attributes on specific parameters that should be allocated in shared memory. The new model uses the static class SharedMemory to allocate this kind of memory procedurally in the scope of kernels. This simplifies programming, kernel-delegate creation and enables non-kernel methods to allocate their own pool of shared memory.

Note that array lengths must be constants in this ILGPU version. Hence, a dynamic allocation of shared memory is currently not supported.

The kernel loader methods LoadSharedMemoryKernelX and LoadSharedMemoryStreamKernelX have been removed. They are no longer required, since a kernel does not have to declare its shared memory allocations in the form of additional parameters.


class ...
{
    static void SharedMemoryKernel(GroupedIndex index, ...)
    {
        // Allocate an array of 32 integers
        ArrayView<int> sharedMemoryArray = SharedMemory.Allocate<int>(32);

        // Allocate a single variable of type long in shared memory
        ref long sharedMemoryVariable = ref SharedMemory.Allocate<long>();

        ...
    }
}


CPU Debugging

Starting a kernel in debug mode is a common task that developers go through many times a day. Although ILGPU has been optimized for performance, you may not wait a few milliseconds every time you start your program to debug a kernel on the CPU. For this reason, the context flag ContextFlags.SkipCPUCodeGeneration has been added. It suppresses IR code generation for CPU kernels and uses the .Net runtime directly. Warning: This avoids general kernel analysis/verification checks. It should only be used by experienced users.

Internals

The old LLVM-based concept of CompileUnit objects is obsolete and has been replaced by a completely new IR. The new IR leverages IRContext objects to manage IR objects that are derived from the class ILGPU.IR.Node. Unlike previous versions, an IRContext is not tied to a specific Backend instance and can be reused accross different hardware architectures.

The global optimization process can be controlled with the enumeration OptimizationLevel. This level can be specified by passing the desired level to the ILGPU.Context constructor. If the optimization level is not explicitly specified, the level is determined by the current build mode (either Debug or Release).

Upgrade: v0.1.X to v0.2.X

If you rely on the LightningContext class (of ILGPU.Lightning v0.1.X) for high-level kernel loading or other high-level operations, you will have to adapt your projects to the API changes. The new API does not require a LightningContext instance. Instead, all operations are extension methods to the ILGPU Accelerator class. This simplifies programming and makes the general API more consistent. Furthermore, kernel caching and convenient kernel loading are now included in the ILGPU runtime system and do not require any ILGPU.Lightning operations. Moreover, if you make use of the low-level kernel-loading functionality of the ILGPU runtime system (in order to avoid additional library dependencies to ILGPU.Lightning), you will also benefit from the new API changes.

Note that all functions from v0.1.X will still work to ensure backwards compatibility. However, they will be removed in future versions.

The Obsolete Lightning Context

The LightningContext class is obsolete and will be removed in future versions. It encapsulated an ILGPU Accelerator instance and provided useful kernel caching and loading features. Moreover, all extensions functions (like sorting, for example) were based on a LightningContext.

We recommend that you replace all occurances of a LightningContext with an ILGPU Accelerator. Furthermore, change the LightningContext creation code with an appropriate accelerator construction from ILGPU. Note that kernel caching and loading are now natively provided by an Accelerator object.


class ...
{
    public static void Main(string[] args)
    {
        // Create the required ILGPU context
        using (var context = new Context())
        {
            // Deprecated code snippets for creating a LightningContext
            var ... = LightningContext.CreateCPUContext(context);
            var ... = LightningContext.CreateCudaContext(context);
            var ... = LightningContext.Create(context, acceleratorId);

            // New version: use default ILGPU accelerators and perform
            // all required operations on an accelerator instance.
            var ... = new CPUAccelerator(context);
            var ... = new CudaAccelerator(context);
            var ... = Accelerator.Create(context, acceleratorId);


            // Old sample for an Initialize command
            var lc = LightningContext.Create(context, ...);
            lc.Initialize(targetView);

            // New version
            var accl = Accelerator.Create(context, acceleratorId);
            accl.Initialize(targetView);
        }
    }
}