Skip to content

Releases: m4rs-mt/ILGPU

Release v0.10.1

06 Apr 11:33
v0.10.1
3b9707a
Compare
Choose a tag to compare

The new stable version contains several bug fixes and improves the code quality of the generated kernel programs (get the Nuget package).

It is strongly recommended to upgrade to this version as soon as possible to avoid known bugs and some CPU-buffer deallocation issues.

Changes

  • Added CopySign intrinsic (#438).
  • Added intrinsic mappings for BitConverter functions (#437).
  • Added call stack recording during compilation for error reporting (#436).
  • Gracefully fail when loading symbols from in-memory assemblies (#435).
  • Fixed invalid detection of loop bodies (#452).
  • Fixed incorrect assertion on repeating successors (#447).
  • Fixed emitting switch statement with constant condition (#442).
  • Fixed invalid disposal of CPU buffers (#440).
  • Fixed applications blocking during tear-down by changing Accelerator GC thread to run in the background (#439).
  • Fixed bounds check on large views (#433).
  • Fixed retrieving field from structure types (#426).

Special thanks

Special thanks to @MoFtZ, @marcin-krystianc and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Release v0.10.0

14 Feb 22:56
v0.10.0
658ae3d
Compare
Choose a tag to compare

The new stable version offers significant performance improvements of the generated kernel programs and contains critical resource deallocation fixes (get the Nuget package).

It is strongly recommended to upgrade to this version as soon as possible to avoid resource and GC related deallocation issues.

Breaking changes

  • The inheritance hierarchy of the ExchangeBuffer class has been changed to avoid exposing internal memory buffers. If you previously relied on the immediate inheritance from ExchangeBufferBase on MemoryBuffer, you have to adapt your program to use the intermediate base class MemoryBuffer<T, TIndex> instead (see diff).
  • Properties exposing internal memory buffers of the high-level MemoryBufferXD classes have been removed to avoid ownership related GC-free issues (see diff).

Why are there breaking changes?

We have decided to remove dangerous properties from several memory buffer classes. The use of these properties can lead to program crashes, since buffers could be disposed asynchronously in the background by the GC without further notice.

Changes

  • Improved performance of kernel launchers by passing packed argument structures (#358, #372).
  • Graduated different optimizations from O2 to O1 (release mode) to improve performance in release builds using an additional of stable optimization passes (#344).
  • Graduated O2 optimizations in the Cuda backend to O1 pipeline to generate vectorized IO operations in release builds (#350).
  • Added support for managed sizeof IL instruction (#380).
  • Added PrintInformation method to Accelerator instances to print detailed accelerator information (#389).
  • Added enhanced assertions and out-of-bounds checks to all ArrayView accesses on GPU devices (Use flag ContextFlags.EnableAsserations or attach a debugger to your application to enable assertion checks. Make sure to use the portable debug information format for detailed source location information) (#375).
  • Added support for printf-like output in Kernels for CPU, Cuda and OpenCL accelerators (#342).
  • Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
  • Added new AlignTo alignment methods to explicitly align ArrayView instances to a particular alignment in bytes (#316).
  • Added enhanced support for local memory via a new LocalMemory class (#316).
  • Added support for several PopCount, CLZ and CTZ operations (#324).
  • Added new MemSet functions to all memory buffers (#338).
  • Added new IfConditionalConversion to fold nested and-also and or-else block chains to O2 pipeline (#328).
  • Added new local memory optimizations to simplify array accesses (#317).
  • Added simple 64-bit-based LongGlobalIndex helper to simplify correct computations using 64-bit integers (#337).
  • Added new CLPlatformVersion and fixed OpenCL 1.2 compatibility issues (#335).
  • Removed support for .NET Core 2.0 (#353).
  • Prevent using SharedMemory in implicitly grouped kernels (#354).
  • Prevent using CudaAccelerator and CLAccelerator instances to run on non-native OS .NET versions (#396).
  • Fixed critical GC-related resource deallocation issues (#376, #393).
  • Fixed returning correct length of dynamic shared memory buffers (#357).
  • Fixed invalid alignment information in the presence of reinterpret casts (#386).
  • Fixed invalid address computations of fixed array buffers (#361).
  • Fixed invalid PTX calling convention (#362).
  • Fixed edge cases in LoopUnrolling (#373).
  • Fixed invalid printf formats for int64 and uintX types (#391).
  • Fixed invalid DebugArrayView implementations (#345).
  • Fixed invalid initializations of local memory arrays (#287).

Major internal changes:

  • Removed singleton instance of RuntimeSystem to avoid concurrency/reflection-API issues (#393).
  • Updated default optimizations for ILGPU debug builds (#384).
  • Added support for unity tests running on. NET Framework 4.7 (#355).
  • Migrated from FxCop analyzers to .NET analyzers. (#352).
  • Redesigned internal address-space inference passes (#364).

Special thanks

Special thanks to @MoFtZ, @Ruberik and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Release v0.10.0-beta2

25 Jan 10:29
v0.10.0-beta2
64ff328
Compare
Choose a tag to compare
Release v0.10.0-beta2 Pre-release
Pre-release

This new beta version offers important bug fixes and performance improvements of the generated kernel programs and a set of new features (get the Nuget package).

  • Improved performance of kernel launchers by passing packed argument structures (#358, #372).
  • Added support for managed sizeof IL instruction (#380).
  • Added PrintInformation method to Accelerator instances to print detailed accelerator information (#389).
  • Added enhanced assertions and out-of-bounds checks to all ArrayView accesses on GPU devices (Use flag ContextFlags.EnableAsserations or attach a debugger to your application to enable assertion checks. Make sure to use the portable debug information format for detailed source location information) (#375).
  • Removed support for .NET Core 2.0 (#353).
  • Prevent using SharedMemory in implicitly grouped kernels (#354).
  • Prevent using CudaAccelerator and CLAccelerator instances to run on non-native OS .NET versions (#396).
  • Fixed critical GC-related resource deallocation issues (#376, #393).
  • Fixed returning correct length of dynamic shared memory buffers (#357).
  • Fixed invalid alignment information in the presence of reinterpret casts (#386).
  • Fixed invalid address computations of fixed array buffers (#361).
  • Fixed invalid PTX calling convention (#362).
  • Fixed edge cases in LoopUnrolling (#373).
  • Fixed invalid printf formats for int64 and uintX types (#391).

Major internal changes:

  • Removed singleton instance of RuntimeSystem to avoid concurrency/reflection-API issues (#393).
  • Updated default optimizations for ILGPU debug builds (#384).
  • Added support for unity tests running on. NET Framework 4.7 (#355).
  • Migrated from FxCop analyzers to .NET analyzers. (#352).
  • Redesigned internal address-space inference passes (#364).

Special thanks to @MoFtZ, @Ruberik for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Release v0.10.0-beta1

10 Dec 08:48
v0.10.0-beta1
634881d
Compare
Choose a tag to compare
Release v0.10.0-beta1 Pre-release
Pre-release

This new beta version offers significant performance improvements of the generated kernel programs and a set of new features (get the Nuget package).

  • Graduated different optimizations from O2 to O1 (release mode) to improve performance in release builds using an additional of stable optimization passes (#344).
  • Graduated O2 optimizations in the Cuda backend to O1 pipeline to generate vectorized IO operations in release builds (#350).
  • Added support for printf-like output in Kernels for CPU, Cuda and OpenCL accelerators (#342).
  • Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
  • Added new AlignTo alignment methods to explicitly align ArrayView instances to a particular alignment in bytes (#316).
  • Added enhanced support for local memory via a new LocalMemory class (#316).
  • Added support for several PopCount, CLZ and CTZ operations (#324).
  • Added new MemSet functions to all memory buffers (#338).
  • Added new IfConditionalConversion to fold nested and-also and or-else block chains to O2 pipeline (#328).
  • Added new local memory optimizations to simplify array accesses (#317).
  • Added simple 64-bit-based LongGlobalIndex helper to simplify correct computations using 64-bit integers (#337).
  • Added new CLPlatformVersion and fixed OpenCL 1.2 compatibility issues (#335).
  • Fixed invalid DebugArrayView implementations (#345).
  • Fixed invalid initializations of local memory arrays (#287).

Special thanks to @MoFtZ and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Release v0.9.2

22 Nov 12:28
v0.9.2
48ea882
Compare
Choose a tag to compare

The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Added new convenience Launch methods to Accelerator class to launch kernels without pre-loading/compiling them (#319).
  • Changed default inling behavior to AggressiveInlining to improve performance of (usually) performance critical GPU programs (#294).
  • Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag ContextFlags.EnhancedPTXBackendFeatures (#274, #303).
  • Added support for RTX 30xx cards (#302, #305, #311).
  • Added support for tuple-types in kernel functions (#266).
  • Added support for Span<T> in the scope of MemoryBuffer copy operations (#122, #276).
  • Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
  • Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
  • Added support for unrolling of loop nests to improve performance (#281).
  • Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
  • Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
  • Improved alignment of padding in fixed-size structures (#315).
  • Fixed invalid Unix OpenCL library names (#327).
  • Fixed calling ambiguous OpenCL 64-bit atomic functions (#321).
  • Fixed invalid unrolling of loops in some cases (#292).
  • Fixed invalid loading of unsigned fields from structures (#314).
  • Fixed invalid handling of FP16 types on unsupported devices (#312).
  • Fixed invalid constant folding of LHS constants in compare operations (#326).

Major internal changes:

  • Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
  • Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
  • Added additional debugging capabilities via new dumper methods (#282).

Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Release v0.9.2-beta1

01 Nov 23:20
v0.9.2-beta1
748af76
Compare
Choose a tag to compare
Release v0.9.2-beta1 Pre-release
Pre-release

This new beta version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Changed default inling behavior to AggressiveInlining to improve performance of (usually) performance critical GPU programs (#294).
  • Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag ContextFlags.EnhancedPTXBackendFeatures (#274, #303).
  • Added support for RTX 30xx cards (#302, #305).
  • Added support for tuple-types in kernel functions (#266).
  • Added support for Span<T> in the scope of MemoryBuffer copy operations (#122, #276).
  • Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
  • Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
  • Added support for unrolling of loop nests to improve performance (#281).
  • Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
  • Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
  • Fixed invalid unrolling of loops in some cases (#292).

Major internal changes:

  • Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
  • Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
  • Added additional debugging capabilities via new dumper methods (#282).

Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Release v0.9.1

01 Oct 21:22
v0.9.1
ce0809b
Compare
Choose a tag to compare

The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Added initial loop unrolling capabilities for innermost loops (#259).
  • Added new address-space specializer to infer the actual address spaces of memory accesses (#247).
  • Added several code simplification techniques to improve generated kernel programs (#268, #270, #271).
  • Added support for FP16x2 (Half2) types (#273).
  • Added support for non-capturing lambda kernels (#186).
  • Added additional copy operations to ExchangeBuffer (#255).
  • Enhanced generation of vectorized IO instructions in the PTX backend using new alignment rules (#247, #260).
  • Fixed invalid accelerator synchronization in OpenCL (#246).
  • Fixed invalid sign extension of byte and ushort values in the context of method calls (#239).
  • Fixed invalid handling of unsafe array buffers in several cases (#262, #263, #285).

Major internal changes:

  • Added new enhanced loop-analyses classes to get detailed insights about loops in ILGPU programs (#259).
  • Refactored the internal static-program analysis framework (#247).
  • Updated native DLL-interop API (#249).
  • Fixed code analysis warnings (#248).

Special thanks to @MoFtZ, @Yey007 and @LxBos for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Release v0.9.1-beta1

21 Sep 22:10
v0.9.1-beta1
4f0f587
Compare
Choose a tag to compare
Release v0.9.1-beta1 Pre-release
Pre-release

This new beta version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Added initial loop unrolling capabilities for innermost loops (#259).
  • Added new address-space specializer to infer the actual address spaces of memory accesses (#247).
  • Added several code simplification techniques to improve generated kernel programs (#268, #270, #271).
  • Added support for FP16x2 (Half2) types (#273).
  • Added support for non-capturing lambda kernels (#186).
  • Added additional copy operations to ExchangeBuffer (#255).
  • Enhanced generation of vectorized IO instructions in the PTX backend using new alignment rules (#247, #260).
  • Fixed invalid accelerator synchronization in OpenCL (#246).
  • Fixed invalid sign extension of byte and ushort values in the context of method calls (#239).
  • Fixed invalid handling of unsafe array buffers in several cases (#262, #263).

Major internal changes:

  • Added new enhanced loop-analyses classes to get detailed insights about loops in ILGPU programs (#259).
  • Refactored the internal static-program analysis framework (#247).
  • Updated native DLL-interop API (#249).
  • Fixed code analysis warnings (#248).

Special thanks to @MoFtZ, @Yey007 and @LxBos for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Release v0.9.0

03 Jan 03:03
v0.9.0
8235a32
Compare
Choose a tag to compare

This new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Fixed invalid range checks in memory buffer implementations.
  • Fixed invalid 32-bit offsets in memory buffer implementations.
  • Fixed if-conversion transformation generating invalid programs in some cases (#232, #233).
  • Fixed code-analyses issues that could cause invalid analysis results (#220).
  • Added support for 64-bit length buffers and views (#196, #210, #215, #216).
    Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information.
  • Added new if-conversion transformation to improve performance (#183).
  • Added support for 16-bit float (Half) types (#180, #208).
  • Added initial support for fixed array buffers (#200).
  • Added support for non-capturing lambda kernels (#79, #136).
  • Added support for multidimensional ExchangeBuffers (#148).
  • Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
  • Fixed invalid lowering of arrays in divergent control flow (#201).
  • Fixed invalid handling of prefixed IL instructions (#204, #211).

Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.

Release v0.9.0-beta1

03 Jan 03:02
v0.9.0-beta1
Compare
Choose a tag to compare
Release v0.9.0-beta1 Pre-release
Pre-release
  • Added support for 64-bit length buffers and views (#196, #210, #215, #216).
    Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information.
  • Added new if-conversion transformation to improve performance (#183).
  • Added support for 16-bit float (Half) types (#180, #208).
  • Added initial support for fixed array buffers (#200).
  • Added support for non-capturing lambda kernels (#79, #136).
  • Added support for multidimensional ExchangeBuffers (#148).
  • Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
  • Fixed invalid lowering of arrays in divergent control flow (#201).
  • Fixed invalid handling of prefixed IL instructions (#204, #211).

Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.