Algorithms for ILGPU

A standard algorithms library for ILGPU

ILGPU Algorithms

An algorithms library for ILGPU


Real-world applications typically require a standard library and a set of standard algorithms that "simply work". The ILGPU Algorithms library meets these requirements by offering a set of auxiliary functions and high-level algorithms (e.g. sorting or prefix sum). All algorithms can be run on all supported accelerator types. The CPU accelerator support os especially useful for kernel debugging. And best of all, it's free! The ILGPU Algorithms library is released under the University of Illinois/NCSA Open Source License.

Earlier versions required customized LightningContext instances that wrapped Accelerator classes. As of version v0.2, this is no longer necessary because the library is now implemented with the help of extension methods that fit seamlessly into the ILGPU framework. Refer to the samples for more information.

Lib architecture

Simplifies Programming

Common operations like initialize, transform or reduce can be expressed with a single line of code.

High-Level Algorithms

ILGPU Algorithms offers a set of essential high-level algorithms for accelerators like sort and scan.

Full CPU Accelerator Support

All functions in the scope of the algorithms library can also be executed in CPU-accelerator mode.

Upcoming Changes in v0.3: .Net Standard 2.0 support

The algorithms library will feature .Net Standard 2.0 support in the next release. This simplifies integration into many different applications.

Upcoming Changes in v0.4: ILGPU.Algorithms

The current name of the algorithms library is ILGPU.Lightning. However, this name will be changed to ILGPU.Algorithms in one of the next releases. We will also remove the dependency to the native NVIDIA CUB library, which requires a custom recompilation for each target platform. Instead, we will implement ILGPU-based versions of RadixSort and ParallelScan.

Current Main Features

RadixSort

Performs different types of radix-sort operations. Supported are ascending and descending sorting of bytes, unsigned shorts, unsigned ints and unsigned longs.

Scan

Implements different parallel scan functions like inclusive and exclusive scan. Supported types are (signed) bytes, (unsigned) shorts, (unsigned) ints and (unsigned) longs.

Reduce

Realizes generic global reductions that work on buffers. You can select reduction algorithms with or without the use of atomic operations.

Initialize

Initializes a buffer with a given value (Init operation) using a GPU kernel. This is especially useful if you want to initialize buffer with a certain value without the use of memory transfers.

Transform

Transforms generic values in a buffer with a given transfer function. This is an implementation of a functional map operation.

Sequence

Creates different value sequences (e.g. 1...n). You can create sequences using default (available through ILGPU.Lightning.Sequencers) or custom sequencers. Moreover, you can create bachted, repeated or batched-repeated sequences.

GridExtensions

Provides useful functions that operate on the grid level - like the implementation of a grid-stride loop.

GroupExtensions

Provides useful functions that operate on the group level - like a group-wide reduction.

Random Number Generators

The algorithms library contains several RNGs that can be used in the scope of kernels.

Performance vs. Convenience

All functions can be used either very conveniently to optimize coding efficiency or in a high-performance way to optimize runtime performance. The Algorithms Library exposes most high-level algorithms via extension methods that directly execute the requested operation:

accelerator.Initialize(bufferView, 42);
// or
accelerator.Initialize(stream, bufferView, 42);
                    

This is highly convenient, however, involves a kernel compilation upon the first invocation of such a method. Moreover, ILGPU's new caching concept may dispose the created initialization kernel (in this example) during the next GC run, if there are no additional references to the initialization kernel. This can cause a reloading operation of the native kernel resources and an additional compilation step (depending on the disposed resources) during other invocations of the Initialize method.

For this reason, you should not use the convenient extensions methods is performance-critical parts of your application. If you need deterministic and high performance, you must use the The High Performance Approach.

The High Performance Approach

Unlike the previous approach, you can use additional extension methods to instantiate a desired algorithm and retrieve a delegate:

var initializer = accelerator.CreateInitializer<int>();
...
initializer(stream, bufferView, 42);
                    

This can lead to a first compilation step when creating the initializer in this example. Afterwards, the compiled and loaded custom initializer is allways accessible via the created delegate. Note that new kernel caching concept of ILGPU disposes all resources of the initializer as soon as all references fall out of scope.

The Memory Cache

Some algorithms require additional temporary memory for their operations (e.g. reduce). By default, these algorithms leverage the default temporary buffer hosted by each accelerator instance. However, if you perform different operations in parallel (e.g. with an accelerator stream), it makes sense to use different temporary buffers.

// Create a new sort provider that performs all operations using an
// instantiated memory buffer.
// Note that the provider has to be disposed manually.
using (var radixSortProvider = accelerator.CreateRadixSortProvider())
{
radixSortProvider.SortAscending(...);
}
                    

Fork me on GitHub