-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cached interface dispatch for coreclr #111771
base: main
Are you sure you want to change the base?
Cached interface dispatch for coreclr #111771
Conversation
… on shared things, and parts that are not shared
…an be switched between
… not yet supported
…ce dispatch or virtual stub dispatch
Tagging subscribers to this area: @mangod9 |
…te that this requires adding the -mcx16 switch to clang, so that cmpxchg16b instruction gets generated, which is an increase in the baseline CPU required by CoreCLR on Linux, and isn't likely to be OK for shipping publicly
…veAOT cached interface dispatch implementation (as it isn't actually used) Update IsIPinVirtualStub to check the AVLocations, not the stub entry points
…e_dispatch_for_coreclr
…hook up the VTable offset logic and such (vtable paths are untested)
- Enable generating double pointer indirection cells in R2R files using command line switch. - Fix VTableOffset calculation - Add logic in ExternalMethodFixupWorker to handle the double pointer indirection cells.
…llocating the memory for the dispatch using the LoaderHeap Also tweak a collectible assembly test to actually use cached interface dispatch
…interface_dispatch_for_coreclr
src/coreclr/clrfeatures.cmake
Outdated
else() | ||
set(FEATURE_CORECLR_VIRTUAL_STUB_DISPATCH 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set(FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH 1) | |
else() | |
set(FEATURE_CORECLR_VIRTUAL_STUB_DISPATCH 1) | |
set(FEATURE_CACHED_INTERFACE_DISPATCH 1) | |
else() | |
set(FEATURE_VIRTUAL_STUB_DISPATCH 1) |
reg = REG_R11; | ||
regMask = RBM_R11; | ||
} | ||
reg = REG_R11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AAGGHH.. reading this has sent me down the rabbit hole of sadness which has told me that I need to move both CoreCLR and NativeAOT to use r10, and not to r11 as that is the only way to keep CFG from exploding sadly. (Or I can keep the difference between CoreCLR and NativeAOT).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of getting these unified as you may guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather... this won't break stuff, but the codegen for CFG is required to be a bit non-ideal in native aot..OTOH, it appears with some quick testing that this doesn't actually effect the codegen in any case, since our CFG generation logic isn't what I'd call great right now, so I'd like to keep moving everything to r11. If we want to use r10 we can swap back all of the amd64 stuff at once.
src/coreclr/vm/jitinterface.cpp
Outdated
DispatchToken token = VirtualCallStubManager::GetTokenFromFromOwnerAndSlot(ownerType, slot); | ||
|
||
INTERFACE_DISPATCH_CACHED_OR_VSD( | ||
return FALSE; // R2R interface dispatch currently only supports fixups with a single pointer, return FALSE to skip using the method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR has changes to add this support. Did you hit any roadblocks with making this actually work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not. But the support is only for fixups which are used directly for dispatch. This code path is used when gathering a fixup as a function pointer to use directly such as embedded into a delegate or something. I believe that this logic may only actually used for R2R versioning where we replace a non-virtual method with a virtual method and construct a delegate. I'm not entirely sure. At the very least I should write a clearer comment. I'll also see if I can actually trigger this scenario. It's quite possible that the existing R2R behavior with VSD is actually not compliant with our latest VM logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually completely dead code. The VIRTUAL_ENTRY_SLOT path was specific to NGEN. (So are the VIRTUAL_ENTRY_DEF/REF_TOKEN paths, but those might actually be useful for R2R as a load simplification, so I'm not touching those. I'm going to delete the logic that implements VIRTUAL_ENTRY_SLOT.
@@ -76,3 +76,16 @@ class VMToOSInterface | |||
// true if it succeeded, false if it failed | |||
static bool ReleaseRWMapping(void* pStart, size_t size); | |||
}; | |||
|
|||
#if defined(HOST_64BIT) && defined(FEATURE_CACHED_INTERFACE_DISPATCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we added the InterlockedCompareExchange128
to the VMToOSInterface
above and implemented it for Unix / Windows in the respective subfolders.
When I have started with this minipal, the intent was that all stuff that needs to be platform specific will go there. Once we get rid of coreclr PAL, this was meant to be the only place for platform specific code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this is a case where we want to have a shared set of PAL apis between CoreCLR and NativeAOT, and NativeAOT hasn't moved towards the VmToOSInterface idea (And this change is already WAY too big to add that to it.) I'd welcome some rationalization after this is all merged.
@@ -1,9 +1,11 @@ | |||
set(GC_DIR ../../gc) | |||
set(SHARED_RUNTIME_DIR ../../shared_runtime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit - we don't have any folders that have names with underscores, it would be nice to stay consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to hear opinions on the directory structure from several people here. @AaronRobinsonMSFT might also be interested to comment on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jit
and gc
are shared too, but they do not have shared_
in the name. I am not a fan of the shared_
prefix.
Maybe just call it runtime
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime it is. I'll make the changes soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be leaving the CMAKE name of the directory as SHARED_RUNTIME_DIR though, as RUNTIME_DIR is already taken for src/coreclr/nativeaot/Runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename RUNTIME_DIR
-> NATIVEAOT_RUNTIME_DIR
and SHARED_RUNTIME_DIR
-> RUNTIME_DIR
? There are only 25 instances of RUNTIME_DIR
spread over 4 files.
# Allow 16 byte compare-exchange (cmpxchg16b) | ||
add_compile_options($<${FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH}:-mcx16>) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check for both UNIX/AMD64 and if FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH
is set?
I'm definitely not a CMake expert, but I think this syntax means if FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH
is non-zero, then apply the flag. If so, can't we remove the explicit if
?
My push back here if we want to expand this, then at present we need to update two locations instead of one, where FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH
is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can combine all the options into the generator expression, but it usually becomes pretty unreadable with all the $< that get accumulated. So I usually use it just for the build configs or stuff where the expression is simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that. My point is about the need for the if check at all. Isn't the generator already conditioned on ${FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH}
being true? If so, why do we need the check above it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's right, I've missed this point. We don't need the if
around that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to set this flag on Windows or when targeting any other processor type than X64. So we still need the if. FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH is sometimes enabled for Windows and it is really intended to be used for Arm64 in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying that FEATURE_CORECLR_CACHED_INTERFACE_DISPATCH
is valid without the 16 byte compare exchange?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm saying that the -mcx16 switch is only valid/useful for the targeting X64 on unix platforms. I can't find ANY documentation that hints that this switch is useful on architectures other than X64. On Arm64, the baseline hardware we support requires 16 byte compare and swap, and the needed instructions are unconditionally available (via libatomic, they will either be the ldaxp
and stxp
pair or the caspal
instruction, and on Mac/iOS the compiler will always general the caspal
instruction. On Winodws, the -mcx16 switch generates a compiler warning and isn't useful, as all hardware which runs supported versions of Windows has the compare exchange 16 byte instruction.
if (pVcsMgr->cache_entry_heap != NULL) heapsToEnumerate.Push(pVcsMgr->cache_entry_heap); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#endif | |
#endif // FEATURE_VIRTUAL_STUB_DISPATCH |
@@ -3613,14 +3613,16 @@ ClrDataAccess::TraverseVirtCallStubHeap(CLRDATA_ADDRESS pAppDomain, VCSHeapType | |||
break; | |||
|
|||
case CacheEntryHeap: | |||
#ifdef FEATURE_VIRTUAL_STUB_DISPATCH | |||
pLoaderHeap = pVcsMgr->cache_entry_heap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a small comment here on why the CacheEntryHeap is not used in the caching logic? I'm ignorant of this area and it feels odd that the VirtualCallStubManager is used, but ignored in some cases. Perhaps an assert or some other breadcrumb to indicate why this isn't appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use it, and possibly should, but as I noted in the PR description, I intentionally disabled this feature since it was more work to enable, and this change is already too big for easy development. If we decide that this sort of caching is advantageous, it would be appropriate in a separate PR to enable it for the cached interface dispatch scenario.
@@ -3663,7 +3665,9 @@ static const char *LoaderAllocatorLoaderHeapNames[] = | |||
"FixupPrecodeHeap", | |||
"NewStubPrecodeHeap", | |||
"IndcellHeap", | |||
#ifdef FEATURE_VIRTUAL_STUB_DISPATCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds to the confusion from above considering case CacheEntryHeap:
remains, but we define out the string.
"CacheEntryHeap", | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#endif | |
#endif // FEATURE_VIRTUAL_STUB_DISPATCH |
pLoaderHeaps[i++] = HOST_CDADDR(pVcsMgr->cache_entry_heap); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#endif | |
#endif // FEATURE_VIRTUAL_STUB_DISPATCH |
} | ||
static inline DispatchToken FromCachedInterfaceDispatchToken(UINT_PTR token) | ||
{ | ||
return DispatchToken(token >> 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return DispatchToken(token >> 1); | |
_ASSERTE(IsCachedInterfaceDispatchToken(token)); | |
return DispatchToken(token >> 1); |
if (cellCacheHeader != NULL) | ||
{ | ||
InterfaceDispatch_DiscardCacheHeader(cellCacheHeader); | ||
pDispatchCell->m_pCache = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is odd. We have a getter, GetCache()
, above, but then we unilaterally manipulate the field. We should either always access the field or do it through methods/abstractions.
#ifdef FEATURE_CACHED_INTERFACE_DISPATCH | ||
if (VirtualCallStubManager::isCachedInterfaceDispatchStubAVLocation(f_IP)) | ||
return TRUE; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#endif | |
#endif // FEATURE_CACHED_INTERFACE_DISPATCH |
@@ -1,9 +1,11 @@ | |||
set(GC_DIR ../../gc) | |||
set(SHARED_RUNTIME_DIR ../../shared_runtime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…interface_dispatch_for_coreclr
Enabling cached interface dispatch as an options for CoreCLR (should reduce memory usage/remove RWX pages, at the cost of reducing performance)
DOTNET_UseCachedInterfaceDispatch
environment variable to 1. (NOTE: Enabling this feature requires running on a processor which supports 128 bit compare and swap, which has implications on Linux X64 builds, and would have implications for Loongarch/RiscV if we enable the code there.)VirtualCallStubManager
infrastructure for all non-code-generation driven lookups, but to replace the stub generation logic with the CachedInterfaceDispatch paths from NativeAOT.Known issues addressed before making a non-draft PR
Consider enabling the resolve cache for cached interface dispatch scenarios (will not do until perf results show that the current scheme is slow)PalInterlockedCompareExchange128
to the PAL or minipal