Algorithms for ILGPU

A standard algorithms library for ILGPU

ILGPU Algorithms

An algorithms library for ILGPU

Real-world applications typically require a standard library and a set of standard algorithms that "simply work". The ILGPU Algorithms library meets these requirements by offering a set of auxiliary functions and high-level algorithms (e.g. sorting or prefix sum). All algorithms can be run on all supported accelerator types. The CPU accelerator support is especially useful for kernel debugging. And best of all, it's free! The ILGPU Algorithms library is released under the University of Illinois/NCSA Open Source License.

Refer to the samples for more information.

Lib architecture

Simplifies Programming

Common operations like initialize, transform or reduce can be expressed with a single line of code.

High-Level Algorithms

ILGPU Algorithms offers a set of essential high-level algorithms for accelerators like sort and scan.

Full CPU Accelerator Support

All functions in the scope of the algorithms library can also be executed in CPU-accelerator mode.

New Algorithms Library (v0.9.0) released on August 20th

A new version of the ILGPU.Algorithms library has been released. The biggest update concerns several critical issues in the RadixSort, Scan and XMath implementations and intrinsic functions.

Special thanks to MoFtZ for contributing to this release. MoFtZ.

Note that ILGPU.Algorithms v0.9 is fully compatible with ILGPU v0.9.X

New Algorithms Library (v0.8.0) released on May 15th

A new version of the ILGPU.Algorithms library has been released. The biggest update concerns the newly added math-functions implementations which have been added by MoFtZ.

Special thanks to MoFtZ for contributing to this release. MoFtZ worked on a huge variety of different issues in the OpenCL Backend, the PTX Backend and the internal IR.

Note that ILGPU.Algorithms v0.8 is fully compatible with ILGPU v0.8.X

New Algorithms Library (v0.7.0) released on December 1st

The first version of the ILGPU.Algorithms library has been released. It replaces the obsolete ILGPU.Lightning library and supports .Net 4.7, .Net Core 2.0 and .Net Standard 2.1 compatible frameworks (e.g. .Net Core 3.0). Furthermore, it offers a new XMath class that implements a huge variety of different math functions for all supported accelerators.

Note that ILGPU.Algorithms v0.7 is fully compatible with ILGPU v0.7.X

Current Main Features


Performs different types of radix-sort operations. Supported are ascending and descending sorting of bytes, unsigned shorts, unsigned ints and unsigned longs.


Implements different parallel scan functions like inclusive and exclusive scan. Supported types are (signed) bytes, (unsigned) shorts, (unsigned) ints and (unsigned) longs.


Realizes generic global reductions that work on buffers. You can select reduction algorithms with or without the use of atomic operations.


Initializes a buffer with a given value (Init operation) using a GPU kernel. This is especially useful if you want to initialize buffer with a certain value without the use of memory transfers.


Transforms generic values in a buffer with a given transfer function. This is an implementation of a functional map operation.


Creates different value sequences (e.g. 1...n). You can create sequences using default (available through ILGPU.Algorithms.Sequencers) or custom sequencers. Moreover, you can create batched, repeated or batched-repeated sequences.

Math Functions

Provides a huge variety of different general-purpose math functions.

Group Extensions

Provides useful functions that operate on the group level - like a group-wide reduction.

Warp Extensions

Provides useful functions that operate on the warp level - like a warp-wide prefix sum.

Using the Algorithms Library

The algorithms library has to be enabled in order to register all extensions methods and intrinsic operations. The following code snippet shows how to enable the algorithms library.

Note that the algorithms library must also be enabled in order to use the XMath class.

using ILGPU;
using ILGPU.Algorithms;
using (var context = new Context())
    // Enable the algorithms library

Performance vs. Convenience

All functions can be used either very conveniently to optimize coding efficiency or in a high-performance way to optimize runtime performance. The Algorithms Library exposes most high-level algorithms via extension methods that directly execute the requested operation:

accelerator.Initialize(bufferView, 42);
// or
accelerator.Initialize(stream, bufferView, 42);

This is highly convenient, however, involves a kernel compilation upon the first invocation of such a method. Moreover, ILGPU's new caching concept may dispose the created initialization kernel (in this example) during the next GC run, if there are no additional references to the initialization kernel. This can cause a reloading operation of the native kernel resources and an additional compilation step (depending on the disposed resources) during other invocations of the Initialize method.

For this reason, you should not use the convenient extensions methods is performance-critical parts of your application. If you need deterministic and high performance, you must use the The High Performance Approach.

The High Performance Approach

Unlike the previous approach, you can use additional extension methods to instantiate a desired algorithm and retrieve a delegate:

var initializer = accelerator.CreateInitializer<int>();
initializer(stream, bufferView, 42);

This can lead to a first compilation step when creating the initializer in this example. Afterwards, the compiled and loaded custom initializer is allways accessible via the created delegate. Note that new kernel caching concept of ILGPU disposes all resources of the initializer as soon as all references fall out of scope.

The Memory Cache

Some algorithms require additional temporary memory for their operations (e.g. reduce). By default, these algorithms leverage the default temporary buffer hosted by each accelerator instance. However, if you perform different operations in parallel (e.g. with an accelerator stream), it makes sense to use different temporary buffers.

// Create a new sort provider that performs all operations using an
// instantiated memory buffer.
// Note that the provider has to be disposed manually.
using (var radixSortProvider = accelerator.CreateRadixSortProvider())

Fork me on GitHub