Skip to content

Release v0.9.2

Compare
Choose a tag to compare
@m4rs-mt m4rs-mt released this 22 Nov 12:28
· 1398 commits to master since this release
v0.9.2
48ea882

The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Added new convenience Launch methods to Accelerator class to launch kernels without pre-loading/compiling them (#319).
  • Changed default inling behavior to AggressiveInlining to improve performance of (usually) performance critical GPU programs (#294).
  • Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag ContextFlags.EnhancedPTXBackendFeatures (#274, #303).
  • Added support for RTX 30xx cards (#302, #305, #311).
  • Added support for tuple-types in kernel functions (#266).
  • Added support for Span<T> in the scope of MemoryBuffer copy operations (#122, #276).
  • Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
  • Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
  • Added support for unrolling of loop nests to improve performance (#281).
  • Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
  • Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
  • Improved alignment of padding in fixed-size structures (#315).
  • Fixed invalid Unix OpenCL library names (#327).
  • Fixed calling ambiguous OpenCL 64-bit atomic functions (#321).
  • Fixed invalid unrolling of loops in some cases (#292).
  • Fixed invalid loading of unsigned fields from structures (#314).
  • Fixed invalid handling of FP16 types on unsupported devices (#312).
  • Fixed invalid constant folding of LHS constants in compare operations (#326).

Major internal changes:

  • Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
  • Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
  • Added additional debugging capabilities via new dumper methods (#282).

Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.