ILGPU

C#

Getting Started

Create a new C# or VB.Net project based on .Net Framework 4.7 or .Net Core 2.0 and install the required ILGPU Nuget package. We recommend that you disable the "Prefer 32bit" option in the application build-settings panel. This ensures that the application is typically executed in native-OS mode (e.g. 64bit on 64bit OS). Several functions are available only in native-OS mode since they require direct interaction with the graphics-driver API.

The ILGPU compiler has been redesigned and does not rely on native libraries. Therefore, it is not necessary to worry about such dependencies (except, of course, for the actual GPU drivers) or environment variables.

While GPU programming can be done using only the ILGPU package, we recommend using the ILGPU.Lightning library that realizes useful functions like scan, reduce and sort. Note that any version <0.5 of the lightning library is incompatible with the current ILGPU version.

Note that all available samples can be found in the GitHub repository.


// Create new C# or VB.Net project with .Net 4.7 (or higher) or .Net Core 2.0.
nuget install ILGPU

Kernels

Kernels are static functions that can work on value types and can invoke other functions that work on value types. Class (reference) types are currently not supported in the scope of kernels. Note that exception handling, boxing and recursive programs are also not supported and will be rejected by the ILGPU compiler. The type of the first parameter must always be a supported index type. The other parameters are uniform constants that are passed from the CPU to the GPU via constant memory. All parameter types must be value types and must not be passed by reference (e.g. via out or ref keywords in C#).

Since memory buffers are classes that are allocated and disposed on the CPU, they cannot be passed directly to kernels. However, you can pass array views to these buffers by value as kernel arguments (see Array Views).

Note that you must not pass pointers to non-accessible memory regions since these are also passed by value.


class ...
{
    static void Kernel(
        [IndexType] index,
        [Kernel Parameters]...)
    {
        // Kernel code
    }
}

Index Types

Index types implement (the often required) index computations and hide them from the user.

The pre-defined index types

Index
A simple 1D index of type int.
Index2
A simple 2D index consisting of two ints.
Index3
A simple 3D index consisting of three ints.
GroupedIndex
An index type that differentiates between global grid and group indices in 1D.
GroupedIndex2
An index type that differentiates between global grid and group indices in 2D.
GroupedIndex3
An index type that differentiates between global grid and group indices in 3D.

A Grouped Index

GridIdx
Index Type The grid index in the scope of the dispatched grid.
GroupIdx
Index Type The thread index in the scope of the current execution group.
Hint
Use index.ComputeGlobalIndex() to compute a global ungrouped-index to access global memory.

using namespace ILGPU;

...
Index i1 = 42;
Index2 i2 = new Index2(1, 2);
Index3 i3 = new Index2(1, 2, 3);

GroupedIndex gi1 = new GroupedIndex(i1, 23);
GroupedIndex2 gi2 = new GroupedIndex(i2, new Index2(3, 4));
GroupedIndex3 gi3 = new GroupedIndex(new Index3(4, 5), i3);

i1 = gi1.ComputeGlobalIndex();
i2 = gi2.ComputeGlobalIndex();
i3 = gi3.ComputeGlobalIndex();

var size1 = i1.Size; // i1.X;
var size2 = i2.Size; // i2.X * i2.Y;
var size3 = i3.Size; // i3.X * i3.Y * i3.Z;

Implicitly Grouped Kernels

Implicitly grouped kernels allow very convenient high-level kernel programming. They can be launched with automatically configured group sizes (that are determined by ILGPU) or manually defined group sizes.

Such kernels must not use shared memory, group or warp functionality since there is no guaranteed group size or thread participation inside a warp. The details of the kernel invocation are hidden from the programmer and managed by ILGPU. There is no way to access or manipulate the low-level peculiarities from the user's point of view. Use explicitly grouped kernels for full control over GPU-kernel dispatching.


class ...
{
    static void ImplicitlyGrouped_Kernel(
        [Index|Index2|Index3] index,
        [Kernel Parameters]...)
    {
        // Kernel code
    }
}

Explicitly Grouped Kernels

Explicitily grouped kernels offer the full kernel-programming power and behave similarly to Cuda kernels. These kernels receive grouped index types as first parameter that reflect the grid and group sizes. Moreover, these kernel offer access to shared memory, Group and other Warp-specific intrinsics. However, the kernel-dispatch dimensions have to be managed manually.


class ...
{
    static void ExplicitlyGrouped_Kernel(
        [GroupedIndex|GroupedIndex2|GroupedIndex3] index,
        [Kernel Parameters]...)
    {
        // Kernel code
    }
}

Simple Kernels

The simple kernel MyKernel on the right represents a simple kernel that works on float values. Note that this kernel relies on the high-level functionality of implicitly grouped kernels to avoid custom grouping and custom bounds checks (assuming the dispatched kernel dimension is equal to the minimum of the array views lengths a, b and c). In contrast to this high-level kernel, MyGroupedKernel realizes the same functionality with the help of explicitly grouped kernels. Note that the bounds check is required in general, since we cannot ensure at this point that the views a, b and c have the required dimensions that are a multiple of the dispatched group size. If we know that these views will always have the right dimensions, we can remove the bounds check.

Note that in Debug mode, every access to an ArrayView is bounds-checked. Hence, the kernel versions MyKernel and MyGroupedKernelAssert automatically rely on assertions in Debug mode. However, an out-of-bounds access in release mode causes undefined program behavior.


class ...
{
    static void MyKernel(
        Index idx,
        ArrayView<float> a, ArrayView<float> b, ArrayView<float> c,
        float d)
    {
        a[idx] = b[index] * c[index] + d;
    }

    static void MyGroupedKernel(
        GroupedIndex idx,
        ArrayView<float> a, ArrayView<float> b, ArrayView<float> c,
        float d)
    {
        var globalIdx = idx.ComputeGlobalIdx();
        if (globalIdx >= a.Length)
            return;
        a[globalIdx] = b[globalIdx] * c[globalIdx] + d;
    }

    static void MyGroupedKernelAssert(
        GroupedIndex idx,
        ArrayView<float> a, ArrayView<float> b, ArrayView<float> c,
        float d)
    {
        var globalIdx = idx.ComputeGlobalIdx();
        a[globalIdx] = b[globalIdx] * c[globalIdx] + d;
    }
}

TLDR - Quick Start

Create a new ILGPU Context instance that initializes ILGPU. Create Accelerator instances that target specific hardware devices. Compile and load the desired kernels and launch them with allocated chunks of memory. Retrieve the data and you're done :)

Refer to the related ILGPU sample for additional insights.


class ...
{
    static void MyKernel(
        Index index, // The global thread index (1D in this case)
        ArrayView<int> dataView, // A view to a chunk of memory (1D in this case)
        int constant) // A sample uniform constant
    {
        dataView[index] = index + constant;
    }

    public static void Main(string[] args)
    {
        // Create the required ILGPU context
        using (var context = new Context())
        {
            using (var accelerator = new CPUAccelerator(context))
            {
                // accelerator.LoadAutoGroupedStreamKernel creates a typed launcher
                // that implicitly uses the default accelerator stream.
                // In order to create a launcher that receives a custom accelerator stream
                // use: accelerator.LoadAutoGroupedKernel<Index, ArrayView<int> int>(...)
                var myKernel = accelerator.LoadAutoGroupedStreamKernel<Index, ArrayView<int> int>(MyKernel);

                // Allocate some memory
                using (var buffer = accelerator.Allocate<int>(1024))
                {
                    // Launch buffer.Length many threads and pass a view to buffer
                    myKernel(buffer.Length, buffer.View, 42);

                    // Wait for the kernel to finish...
                    accelerator.Synchronize();

                    // Resolve data
                    var data = buffer.GetAsArray();
                    // ...
                }
            }
        }
    }
}


ILGPU Context

All ILGPU classes and functions rely on the global ILGPU Context. Instances of classes that require a context reference have to be disposed before disposing of the main context. Note that all operations on a context and its children must be considered as not thread safe.

The ILGPU.Lightning library provides many useful functions to simplify GPU programming. However, they also require a valid ILGPU Context to work.


class ...
{
    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            // ILGPU functionality
            // Dispose all other classes before disposing the ILGPU context
        }
    }
}

Math Functions

The default math functions in .Net are realized with static methods from the Math class. However, many operations work on doubles by default (like Math.Sin) and there is often no float overload. This causes many floating-point operations to be performed on 64bit floats, even when this precision is not required. ILGPU offers the XMath class that includes 32bit-float overloads for all math functions. Invocation of these functions ensure that the operations are performed on 32bit-floats on the GPU hardware.

Fast-math can be enabled using the ContextFlags.FastMath flag and enables the use of fast (and unprecise) math functions. Unlike previous versions, the fast-math mode applies to all math instructions. Even to default math operations like x / y.

Your kernels might rely on third-party functions that are not under your control. These functions typically depend on the default .Net Math class, and thus, work on 64bit floating-point operations. You can force the use of 32bit floating-point operations in all cases using the ContextFlags.Force32BitMath flag. Caution: all doubles will be consideres as floats to circumvent issues with third-party code. However, this also affects the address computations of array-view elements. Avoid the use of this flag unless you know exactly what you are doing.



Accelerators

Accelerators represent hardware or software GPU devices. They store information about different devices and allow memory allocation and kernel loading on a particular device. A launch of a kernel on an accelerator is performed asynchronously by default. Synchronization with the accelerator or the associated stream is required in order to to wait for completion and to fetch results.

Note that instances of classes that depend on an accelerator reference have to be disposed before disposing of the associated accelerator object. However, this does not apply to automatically managed kernels, which are cached inside the accelerator object.


class ...
{
    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            using (var cpuAccelerator = new CPUAccelerator(context))
            { }

            using (var cudaAccelerator = new CudaAccelerator(context))
            { }

            foreach (var acceleratorId in Accelerator.Accelerators)
            {
                using (var accl = Accelerator.Create(context, acceleratorId))
                {
                    // Perform operations
                }
            }
        }
    }
}

Memory Buffers

MemoryBuffer represent allocated memory regions (allocated arrays) of a given value type on specific accelerators. Data can be copied to and from any accelerator using sync or async copy operations (see Streams). ILGPU supports linear, 2D and 3D buffers out of the box, whereas nD-buffers can also be allocated and managed using custom index types.

Note that MemoryBuffers have to be disposed manually and cannot be passed to kernels; only views to memory regions can be passed to kernels.


class ...
{
    public static void MyKernel(Index index, ...)
    {
        // ...
    }

    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            using (var accelerator = ... )
            {
                using (var buffer = accelerator.Allocat<int>(1024))
                {
                    ...
                }
            }
        }
    }
}

Array Views

ArrayViews realize views to specific memory-buffer regions. Views comprise pointers and length information. They can be passed to kernels and simplify index computations.

Similar to memory buffers, there are specialized views for 1D, 2D and 3D scenarios. However, it is also possible to use the generic structure ArrayView<Type, IndexType> to create views to nD-regions.

Accesses on ArrayViews are bounds-checked via Debug assertions. Hence, these checks are not performed in Release mode, which benefits performance.


class ...
{
    static void MyKernel(Index index, ArrayView<int> view1, ArrayView<float> view2)
    {
        ConvertToFloatSample(
            view1.GetSubView(0, view1.Length / 2),
            view2.GetSubView(0, view2.Length / 2));
    }

    static void ConvertToFloatSample(ArrayView<int> source, ArrayView<float> target)
    {
        for (Index i = 0, e = source.Extent; i < e; ++i)
            target[i] = source[i];
    }

    static void Main(string[] args)
    {
        ...
        using (var buffer = accelerator.Allocat<...>(...))
        {
            var mainView = buffer.View;
            var subView = mainView.GetSubView(0, 1024);
        }
    }
}

Variable Views

A VariableView is a specialized array view that points to exactly one element. VariableViews are useful since default 'C# ref' variables cannot be stored in structures, for instance.


class ...
{
    struct DataView
    {
        public VariableView<int> Variable;
    }

    static void MyKernel(Index index, DataView view)
    {
        // ...
    }

    static void Main(string[] args)
    {
        using (var buffer = accelerator.Allocat<...>(...))
        {
            var mainView = buffer.View;
            var firstElementView = mainView.GetVariableView(0);
        }
    }
}

Accelerator Streams

AcceleratorStreams represent async operation queues, which operations can be submitted to. Custom accelerator streams have to be synchronized manually. Using streams increases the parallellism of applications. Every accelerator encapsulates a default accelerator stream that is used for all operations by default.


class ...
{
    static void Main(string[] args)
    {
        ...

        var defaultStream = accelerator.DefaultStream;
        using (var secondStream = accelerator.CreateStream())
        {

            // Perform actions using default stream...

            // Perform actions on second stream...

            // Wait for results from the first stream.
            defaultStream.Synchronize();

            // Use results async compared to operations on the second stream...

            // Wait for results from the second stream
            secondStream.Synchronize();

            ...
        }
    }
}

Optimizations and Compile Time

ILGPU features a new parallel processing model. It allows parallel code generation and transformation phases to reduce compile time and improve overall performance. However, parallel code generation in the frontend module is disabled by default in the current release (beta version). It can be enabled via the enumeration flag ContextFlags.EnableParallelCodeGenerationInFrontend.

The global optimization process can be controlled with the enumeration OptimizationLevel. This level can be specified by passing the desired level to the ILGPU.Context constructor. If the optimization level is not explicitly specified, the level is determined by the current build mode (either Debug or Release).
The OptimizationLevel.Release level uses additional transformations that increase compile time but yield potentially better GPU code. For best performance, it is recommended using this mode in Release builds only.

Internal Caches

ILGPU uses a set of internal caches to speed up the compilation process. The KernelCache is based on WeakReferences and its own GC thread to avoid memory leaks. As soon as a kernel is disposed of by the .Net GC, the ILGPU GC thread can release the associated datastructures. Although each Accelerator instance is assigned a MemoryBufferCache instance, ILGPU does not use this cache anywhere. It was added to help users write custom accelerator extensions that require temporary memory. If you do not use the corresponding MemoryBufferCaches, you should not get into trouble regarding caching.

However, these caches could not be controlled, disabled or cleared in the past. Currently, there are different flags to control the automic caching and memory-mangement behavior. All kernel caches could be disabled using the enumeration flag ContextFlags.DisableKernelCaching. The flag ContextFlags.DisableAutomaticKernelDisposal disables automatic disposal of all kernels. The flag ContextFlags.DisableAutomaticBufferDisposal disables automatic disposal of all buffers. The flag ContextFlags.DisableAcceleratorGC disables the whole ILGPU GC thread of every accelerator. Note that this flag automatically disables kernel caching, automatic disposal of memory buffers and kernels.

Use Context.ClearCache(ClearCacheMode.Everything) to clear all internal caches to recover allocated memory. In addition, each accelerator has its own cache for type information and kernel arguments. Use Accelerator.ClearCache(ClearCacheMode.Everything) to clear the cache on the desired accelerator. Please note that clearing the caches is not thread-safe in general.

Debugging and Profiling

The best debugging experience can be achieved with the CPUAccelerator. Debugging with the software emulation layer is very convenient due to the very good properties of the .Net debugging environments. Currently, detailed kernel debugging is only possible with the CPU accelerator. The debugging capabilities will be extended in future versions.

Although ILGPU has been optimized for performance, you may not wait a few milliseconds every time you start your program to debug a kernel on the CPU. For this reason, the context flag ContextFlags.SkipCPUCodeGeneration has been added. It suppresses IR code generation for CPU kernels and uses the .Net runtime directly. Warning: This avoids general kernel analysis/verification checks. It should only be used by experienced users.

Assertions on GPU hardware devices can be enabled using the ContextFlags.EnableAssertions flag (disabled by default). Note that enabling assertions using this flag will cause them to be enabled in Release builds as well. Be sure to disable this flag if you want to get the best runtime performance.

Source-line based debugging information can be turned on via the flag ContextFlags.EnableDebugInformation (disabled by default). Note that only the new portable PBD format is supported. Enabling debug information is essential to identify problems and catch break points on GPU hardware. It is also very useful for kernel profiling as you can link the profiling insights to your source lines. You may want to disable inlining via ContextFlags.NoInlining to significantly increase the accuracy of your debugging information at the expence cost of runtime performance. Note that the inspection of variables, registers, and global memory is currently not supported.

Loading & Launching Kernels

Kernels have to be loaded by an accelerator first before they can be executed. See the ILGPU kernel sample for details. There are two possibilities in general: using the high-level (described here) and the low-level loading API. We strongly recommend to use the high-level API that simplifies programming, is less error prone and features automatic kernel caching and disposal.

An accelerator object offers different functions to load and configure kernels:

  • LoadAutoGroupedStreamKernel

    Loads an implicitly grouped kernel with an automatically determined group size (uses a the default accelerator stream)

  • LoadAutoGroupedKernel

    Loads an implicitly grouped kernel with an automatically determined group size (requires an accelerator stream)

  • LoadImplicitlyGroupedStreamKernel

    Loads an implicitly grouped kernel with a custom group size (uses the default accelerator stream)

  • LoadImplicitlyGroupedKernel

    Loads an implicitly grouped kernel with a custom group size (requires an accelerator stream)

  • LoadStreamKernel

    Loads explicitly and implicitly grouped kernels. However, implicitly grouped kernels will be launched with a group size that is equal to the warp size (uses the default accelerator stream)

  • LoadKernel

    Loads explicitly and implicitly grouped kernels. However, implicitly grouped kernels will be launched with a group size that is equal to the warp size (requires an accelerator stream)

Functions following the naming pattern LoadXXXStreamKernel use the default accelerator stream for all operations. If you want to specifiy the associated accelerator stream, you will have to use the LoadXXXKernel functions.

Every function returns a typed delegate (a kernel launcher) that can be called in order to invoke the actual kernel execution. These launchers are specialized methods that are dynamically generated and specialized for every kernel. They avoid boxing and realize high-performance kernel dispatching. In contrast to older versions of ILGPU, all kernels loaded with these functions will be managed by their associated accelerator instances.

Note that a kernel-loading operation will trigger a kernel compilation in the case of an uncached kernel. The compilation step will happen in the background and is transparent for the user. However, if you require custom control over the low-level kernel-compilation process refer to Advanced Low-Level Functionality.


class ...
{
    static void MyKernel(Index index, ArrayView<int> data, int c)
    {
        data[index] = index + c;
    }

    static void Main(string[] args)
    {
        ...
        var buffer = accelerator.Allocate<int>(1024);

         // Load a sample kernel MyKernel using one of the available overloads
        var kernelWithDefaultStream = accelerator.LoadAutoGroupedStreamKernel<
                     Index, ArrayView<int>, int>(MyKernel);
        kernelWithDefaultStream(buffer.Extent, buffer.View, 1);

         // Load a sample kernel MyKernel using one of the available overloads
        var kernelWithStream = accelerator.LoadAutoGroupedKernel<
                     Index, ArrayView<int>, int>(MyKernel);
        kernelWithStream(someStream, buffer.Extent, buffer.View, 1);

        ...
    }
}

Backends

A Backend represents target-specific code-generation functionality for a specific target device. It can be used to manually compile kernels for a specific platform.

Note that you do not have to create custom backend instances on your own when using the ILGPU runtime. Accelerators already carry associated and configured backends that are used for high-level kernel loading.


class ...
{
    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            // Creats a user-defined MSIL backend for .Net code generation
            using (var cpuBackend = new DefaultILBackend(context))
            {
                // Use custom backend
            }

            // Creates a user-defined backend for NVIDIA GPUs using compute capability 5.0
            using (var ptxBackend = new PTXBackend(
                context,
                PTXArchitecture.SM_50,
                TargetPlatform.X64))
            {
                // Use custom backend
            }
        }
    }
}

IRContext

An IRContext manages and caches intermediate-repesentation (IR) code, which can be reused during the compilation process. It can be created using a general ILGPU Context instance. An IRContext is not tied to a specific Backend instance and can be reused accross different hardware architectures.

Note that the main ILGPU Context already has an associated IRContext that is used for all high-level kernel-loading functions. Consequenctly, users are not required to manage their own contexts in general.


class ...
{
    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            using (var irContext = new IRContext(context))
            {
                // ...
            }
        }
    }
}

Compiling Kernels

Kernels can be compiled manually by requesting a code-generation operation from the backend yielding a CompiledKernel object. The resulting kernel object can be loaded by an Accelerator instance from the runtime system. Alternatively, you can cast a CompiledKernel object to its appropriate backend-specific counterpart method in order to access the generated and target-specific assembly code.

Note that the default MSIL backend does not provide additional insights, since the MSILBackend does not require custom assembly code.

We recommend that you use the high-level kernel-loading concepts of ILGPU instead of the low-level interface.


class ...
{
    public static void MyKernel(Index index, ...)
    {
        // ...
    }

    static void Main(string[] args)
    {
        using (var context = new Context())
        {
            using (Backend b = new PTXBackend(context, ...)
            {
                // Compile kernel using no specific KernelSpecialization settings
                var compiledKernel = b.Compile(
                    typeof(...).GetMethod(nameof(MyKernel), BindingFlags.Public | BindingFlags.Static),
                    default);

                // Cast kernel to backend-specific PTXCompiledKernel to access the PTX assembly
                var ptxKernel = compiledKernel as PTXCompiledKernel;
                System.IO.File.WriteAllBytes("MyKernel.ptx", ptxKernel.PTXAssembly);
            }
        }
    }
}

Loading Compiled Kernels

Compiled kernels have to be loaded by an accelerator first before they can be executed. See the ILGPU low-level kernel sample for details. Caution: manually loaded kernels have to be disposed before the associated accelerator object is disposed.

An accelerator object offers different functions to load and configure kernels:

  • LoadAutoGroupedKernel

    Loads an implicitly grouped kernel with an automatically determined group size

  • LoadImplicitlyGroupedKernel

    Loads an implicitly grouped kernel with a custom group size

  • LoadKernel

    Loads explicitly and implicitly grouped kernels. However, implicitly grouped kernels will be launched with a group size that is equal to the warp size


class ...
{
    static void Main(string[] args)
    {
        ...
        var compiledKernel = backend.Compile(...);

        // Load implicitly grouped kernel with an automatically determined group size
        var k1 = accelerator.LoadAutoGroupedKernel(compiledKernel);

        // Load implicitly grouped kernel with custom group size
        var k2 = accelerator.LoadImplicitlyGroupedKernel(compiledKernel);

        // Load any kernel (explicitly and implicitly grouped kernels).
        // However, implicitly grouped kernels will be dispatched with a group size
        // that is equal to the warp size of its associated accelerator
        var k3 = accelerator.LoadKernel(compiledKernel);

        ...

        k1.Dispose();
        k2.Dispose();
        k3.Dispose();
    }
}

Direct Kernel Launching

A loaded kernel can be dispatched using the Launch method. However, the dispatch method takes an object-array as an argument, all arguments are boxed upon invocation and there is not type-safety at this point. For performance reasons, we strongly recommend the use of typed kernel launchers that avoid boxing.


class ...
{
    static void MyKernel(Index index, ArrayView<int> data, int c)
    {
        data[index] = index + c;
    }

    static void Main(string[] args)
    {
        ...
        var buffer = accelerator.Allocate<int>(1024);

        // Load a sample kernel MyKernel
        var compiledKernel = backend.Compile(...);
        using (var k = accelerator.LoadAutoGroupedKernel(compiledKernel))
        {
            k.Launch(buffer.Extent, buffer.View, 1);

            ...

            accelerator.Synchronize();
        }

        ...
    }
}

Typed Kernel Launchers

Kernel launchers are delegates that provide an alternative to direct kernel invocations. These launchers are specialized methods that are dynamically generated and specialized for every kernel. They avoid boxing and realize high-performance kernel dispatching (Link). You can create a custom kernel launcher using the CreateLauncherDelegate method. It Creates a specialized launcher for the associated kernel. Besides all required kernel parameters, it also receives a parameter of type AcceleratorStream as an argument.

Note that high-level API kernel loading functionality that simply returns a launcher delegate instead of a kernel object. These loading methods work similarly to the these versions, e.g. LoadAutoGroupedStreamKernel loads a kernel with a custom delegate type that is linked to the default accelerator stream.


class ...
{
    static void MyKernel(Index index, ArrayView<int> data, int c)
    {
        data[index] = index + c;
    }

    static void Main(string[] args)
    {
        ...
        var buffer = accelerator.Allocate<int>(1024);

        // Load a sample kernel MyKernel
        var compiledKernel = backend.Compile(...);
        using (var k = accelerator.LoadAutoGroupedKernel(compiledKernel))
        {
            var launcherWithCustomAcceleratorStream = k.CreateLauncherDelegate<AcceleratorStream, Index, ArrayView<int>>();
            launcherWithCustomAcceleratorStream(someStream, buffer.Extent, buffer.View, 1);

            ...
        }

        ...
    }
}