Release v0.9.2
The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Added new convenience
Launch
methods toAccelerator
class to launch kernels without pre-loading/compiling them (#319). - Changed default inling behavior to
AggressiveInlining
to improve performance of (usually) performance critical GPU programs (#294). - Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag
ContextFlags.EnhancedPTXBackendFeatures
(#274, #303). - Added support for RTX 30xx cards (#302, #305, #311).
- Added support for tuple-types in kernel functions (#266).
- Added support for
Span<T>
in the scope ofMemoryBuffer
copy operations (#122, #276). - Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
- Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
- Added support for unrolling of loop nests to improve performance (#281).
- Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
- Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
- Improved alignment of padding in fixed-size structures (#315).
- Fixed invalid Unix OpenCL library names (#327).
- Fixed calling ambiguous OpenCL 64-bit atomic functions (#321).
- Fixed invalid unrolling of loops in some cases (#292).
- Fixed invalid loading of unsigned fields from structures (#314).
- Fixed invalid handling of FP16 types on unsupported devices (#312).
- Fixed invalid constant folding of LHS constants in compare operations (#326).
Major internal changes:
- Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
- Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
- Added additional debugging capabilities via new dumper methods (#282).
Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.