Releases: m4rs-mt/ILGPU
Release v0.10.1
The new stable version contains several bug fixes and improves the code quality of the generated kernel programs (get the Nuget package).
It is strongly recommended to upgrade to this version as soon as possible to avoid known bugs and some CPU-buffer deallocation issues.
Changes
- Added CopySign intrinsic (#438).
- Added intrinsic mappings for BitConverter functions (#437).
- Added call stack recording during compilation for error reporting (#436).
- Gracefully fail when loading symbols from in-memory assemblies (#435).
- Fixed invalid detection of loop bodies (#452).
- Fixed incorrect assertion on repeating successors (#447).
- Fixed emitting switch statement with constant condition (#442).
- Fixed invalid disposal of CPU buffers (#440).
- Fixed applications blocking during tear-down by changing Accelerator GC thread to run in the background (#439).
- Fixed bounds check on large views (#433).
- Fixed retrieving field from structure types (#426).
Special thanks
Special thanks to @MoFtZ, @marcin-krystianc and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
Release v0.10.0
The new stable version offers significant performance improvements of the generated kernel programs and contains critical resource deallocation fixes (get the Nuget package).
It is strongly recommended to upgrade to this version as soon as possible to avoid resource and GC related deallocation issues.
Breaking changes
- The inheritance hierarchy of the
ExchangeBuffer
class has been changed to avoid exposing internal memory buffers. If you previously relied on the immediate inheritance fromExchangeBufferBase
onMemoryBuffer
, you have to adapt your program to use the intermediate base classMemoryBuffer<T, TIndex>
instead (see diff). - Properties exposing internal memory buffers of the high-level
MemoryBufferXD
classes have been removed to avoid ownership related GC-free issues (see diff).
Why are there breaking changes?
We have decided to remove dangerous properties from several memory buffer classes. The use of these properties can lead to program crashes, since buffers could be disposed asynchronously in the background by the GC without further notice.
Changes
- Improved performance of kernel launchers by passing packed argument structures (#358, #372).
- Graduated different optimizations from
O2
toO1
(release mode) to improve performance in release builds using an additional of stable optimization passes (#344). - Graduated O2 optimizations in the
Cuda
backend toO1
pipeline to generate vectorized IO operations in release builds (#350). - Added support for managed
sizeof
IL instruction (#380). - Added
PrintInformation
method toAccelerator
instances to print detailed accelerator information (#389). - Added enhanced assertions and out-of-bounds checks to all
ArrayView
accesses on GPU devices (Use flagContextFlags.EnableAsserations
or attach a debugger to your application to enable assertion checks. Make sure to use theportable
debug information format for detailed source location information) (#375). - Added support for printf-like output in Kernels for
CPU
,Cuda
andOpenCL
accelerators (#342). - Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
- Added new
AlignTo
alignment methods to explicitly alignArrayView
instances to a particular alignment in bytes (#316). - Added enhanced support for local memory via a new
LocalMemory
class (#316). - Added support for several
PopCount
,CLZ
andCTZ
operations (#324). - Added new
MemSet
functions to all memory buffers (#338). - Added new IfConditionalConversion to fold nested and-also and or-else block chains to
O2
pipeline (#328). - Added new local memory optimizations to simplify array accesses (#317).
- Added simple 64-bit-based
LongGlobalIndex
helper to simplify correct computations using 64-bit integers (#337). - Added new
CLPlatformVersion
and fixed OpenCL 1.2 compatibility issues (#335). - Removed support for .NET Core 2.0 (#353).
- Prevent using
SharedMemory
in implicitly grouped kernels (#354). - Prevent using
CudaAccelerator
andCLAccelerator
instances to run on non-native OS .NET versions (#396). - Fixed critical GC-related resource deallocation issues (#376, #393).
- Fixed returning correct length of dynamic shared memory buffers (#357).
- Fixed invalid alignment information in the presence of reinterpret casts (#386).
- Fixed invalid address computations of fixed array buffers (#361).
- Fixed invalid PTX calling convention (#362).
- Fixed edge cases in
LoopUnrolling
(#373). - Fixed invalid
printf
formats forint64
anduintX
types (#391). - Fixed invalid
DebugArrayView
implementations (#345). - Fixed invalid initializations of local memory arrays (#287).
Major internal changes:
- Removed singleton instance of
RuntimeSystem
to avoid concurrency/reflection-API issues (#393). - Updated default optimizations for ILGPU debug builds (#384).
- Added support for unity tests running on. NET Framework 4.7 (#355).
- Migrated from FxCop analyzers to .NET analyzers. (#352).
- Redesigned internal address-space inference passes (#364).
Special thanks
Special thanks to @MoFtZ, @Ruberik and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
Release v0.10.0-beta2
This new beta version offers important bug fixes and performance improvements of the generated kernel programs and a set of new features (get the Nuget package).
- Improved performance of kernel launchers by passing packed argument structures (#358, #372).
- Added support for managed
sizeof
IL instruction (#380). - Added
PrintInformation
method toAccelerator
instances to print detailed accelerator information (#389). - Added enhanced assertions and out-of-bounds checks to all
ArrayView
accesses on GPU devices (Use flagContextFlags.EnableAsserations
or attach a debugger to your application to enable assertion checks. Make sure to use theportable
debug information format for detailed source location information) (#375). - Removed support for .NET Core 2.0 (#353).
- Prevent using
SharedMemory
in implicitly grouped kernels (#354). - Prevent using
CudaAccelerator
andCLAccelerator
instances to run on non-native OS .NET versions (#396). - Fixed critical GC-related resource deallocation issues (#376, #393).
- Fixed returning correct length of dynamic shared memory buffers (#357).
- Fixed invalid alignment information in the presence of reinterpret casts (#386).
- Fixed invalid address computations of fixed array buffers (#361).
- Fixed invalid PTX calling convention (#362).
- Fixed edge cases in
LoopUnrolling
(#373). - Fixed invalid
printf
formats forint64
anduintX
types (#391).
Major internal changes:
- Removed singleton instance of
RuntimeSystem
to avoid concurrency/reflection-API issues (#393). - Updated default optimizations for ILGPU debug builds (#384).
- Added support for unity tests running on. NET Framework 4.7 (#355).
- Migrated from FxCop analyzers to .NET analyzers. (#352).
- Redesigned internal address-space inference passes (#364).
Special thanks to @MoFtZ, @Ruberik for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
Release v0.10.0-beta1
This new beta version offers significant performance improvements of the generated kernel programs and a set of new features (get the Nuget package).
- Graduated different optimizations from
O2
toO1
(release mode) to improve performance in release builds using an additional of stable optimization passes (#344). - Graduated O2 optimizations in the
Cuda
backend toO1
pipeline to generate vectorized IO operations in release builds (#350). - Added support for printf-like output in Kernels for
CPU
,Cuda
andOpenCL
accelerators (#342). - Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
- Added new
AlignTo
alignment methods to explicitly alignArrayView
instances to a particular alignment in bytes (#316). - Added enhanced support for local memory via a new
LocalMemory
class (#316). - Added support for several
PopCount
,CLZ
andCTZ
operations (#324). - Added new
MemSet
functions to all memory buffers (#338). - Added new IfConditionalConversion to fold nested and-also and or-else block chains to
O2
pipeline (#328). - Added new local memory optimizations to simplify array accesses (#317).
- Added simple 64-bit-based
LongGlobalIndex
helper to simplify correct computations using 64-bit integers (#337). - Added new
CLPlatformVersion
and fixed OpenCL 1.2 compatibility issues (#335). - Fixed invalid
DebugArrayView
implementations (#345). - Fixed invalid initializations of local memory arrays (#287).
Special thanks to @MoFtZ and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
Release v0.9.2
The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Added new convenience
Launch
methods toAccelerator
class to launch kernels without pre-loading/compiling them (#319). - Changed default inling behavior to
AggressiveInlining
to improve performance of (usually) performance critical GPU programs (#294). - Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag
ContextFlags.EnhancedPTXBackendFeatures
(#274, #303). - Added support for RTX 30xx cards (#302, #305, #311).
- Added support for tuple-types in kernel functions (#266).
- Added support for
Span<T>
in the scope ofMemoryBuffer
copy operations (#122, #276). - Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
- Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
- Added support for unrolling of loop nests to improve performance (#281).
- Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
- Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
- Improved alignment of padding in fixed-size structures (#315).
- Fixed invalid Unix OpenCL library names (#327).
- Fixed calling ambiguous OpenCL 64-bit atomic functions (#321).
- Fixed invalid unrolling of loops in some cases (#292).
- Fixed invalid loading of unsigned fields from structures (#314).
- Fixed invalid handling of FP16 types on unsupported devices (#312).
- Fixed invalid constant folding of LHS constants in compare operations (#326).
Major internal changes:
- Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
- Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
- Added additional debugging capabilities via new dumper methods (#282).
Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
Release v0.9.2-beta1
This new beta version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Changed default inling behavior to
AggressiveInlining
to improve performance of (usually) performance critical GPU programs (#294). - Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag
ContextFlags.EnhancedPTXBackendFeatures
(#274, #303). - Added support for RTX 30xx cards (#302, #305).
- Added support for tuple-types in kernel functions (#266).
- Added support for
Span<T>
in the scope ofMemoryBuffer
copy operations (#122, #276). - Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
- Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
- Added support for unrolling of loop nests to improve performance (#281).
- Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
- Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
- Fixed invalid unrolling of loops in some cases (#292).
Major internal changes:
- Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
- Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
- Added additional debugging capabilities via new dumper methods (#282).
Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
Release v0.9.1
The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Added initial loop unrolling capabilities for innermost loops (#259).
- Added new address-space specializer to infer the actual address spaces of memory accesses (#247).
- Added several code simplification techniques to improve generated kernel programs (#268, #270, #271).
- Added support for FP16x2 (
Half2
) types (#273). - Added support for non-capturing lambda kernels (#186).
- Added additional copy operations to ExchangeBuffer (#255).
- Enhanced generation of vectorized IO instructions in the PTX backend using new alignment rules (#247, #260).
- Fixed invalid accelerator synchronization in OpenCL (#246).
- Fixed invalid sign extension of
byte
andushort
values in the context of method calls (#239). - Fixed invalid handling of unsafe array buffers in several cases (#262, #263, #285).
Major internal changes:
- Added new enhanced loop-analyses classes to get detailed insights about loops in ILGPU programs (#259).
- Refactored the internal static-program analysis framework (#247).
- Updated native DLL-interop API (#249).
- Fixed code analysis warnings (#248).
Special thanks to @MoFtZ, @Yey007 and @LxBos for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
Release v0.9.1-beta1
This new beta version offers significant performance improvements of the generated kernel programs (get the Nuget package).
- Added initial loop unrolling capabilities for innermost loops (#259).
- Added new address-space specializer to infer the actual address spaces of memory accesses (#247).
- Added several code simplification techniques to improve generated kernel programs (#268, #270, #271).
- Added support for FP16x2 (
Half2
) types (#273). - Added support for non-capturing lambda kernels (#186).
- Added additional copy operations to ExchangeBuffer (#255).
- Enhanced generation of vectorized IO instructions in the PTX backend using new alignment rules (#247, #260).
- Fixed invalid accelerator synchronization in OpenCL (#246).
- Fixed invalid sign extension of
byte
andushort
values in the context of method calls (#239). - Fixed invalid handling of unsafe array buffers in several cases (#262, #263).
Major internal changes:
- Added new enhanced loop-analyses classes to get detailed insights about loops in ILGPU programs (#259).
- Refactored the internal static-program analysis framework (#247).
- Updated native DLL-interop API (#249).
- Fixed code analysis warnings (#248).
Special thanks to @MoFtZ, @Yey007 and @LxBos for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.
Release v0.9.0
This new stable version offers significant performance and code quality improvements of the generated kernel programs.
- Fixed invalid range checks in memory buffer implementations.
- Fixed invalid 32-bit offsets in memory buffer implementations.
- Fixed if-conversion transformation generating invalid programs in some cases (#232, #233).
- Fixed code-analyses issues that could cause invalid analysis results (#220).
- Added support for 64-bit length buffers and views (#196, #210, #215, #216).
Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information. - Added new if-conversion transformation to improve performance (#183).
- Added support for 16-bit float (Half) types (#180, #208).
- Added initial support for fixed array buffers (#200).
- Added support for non-capturing lambda kernels (#79, #136).
- Added support for multidimensional ExchangeBuffers (#148).
- Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
- Fixed invalid lowering of arrays in divergent control flow (#201).
- Fixed invalid handling of prefixed IL instructions (#204, #211).
Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.
Release v0.9.0-beta1
- Added support for 64-bit length buffers and views (#196, #210, #215, #216).
Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information. - Added new if-conversion transformation to improve performance (#183).
- Added support for 16-bit float (Half) types (#180, #208).
- Added initial support for fixed array buffers (#200).
- Added support for non-capturing lambda kernels (#79, #136).
- Added support for multidimensional ExchangeBuffers (#148).
- Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
- Fixed invalid lowering of arrays in divergent control flow (#201).
- Fixed invalid handling of prefixed IL instructions (#204, #211).
Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.