workaround cdt's small const memcpy host function calls in EOS VM OC #1016

spoonincode · 2024-11-06T20:45:00Z

AntelopeIO/cdt#102 is a long time thorn and while obviously it would be ideal to fix cdt to not generate these host function calls in the first place, it's not clear when that will happen and regardless there will be many contracts that may never be recompiled and years of historical blocks that will have these calls forever.

This implements a workaround in EOS VM OC that identifies small constant size memcpy host function calls during code generation and replaces them with simple memory loads and stores plus a small call to a native function that verifies the memcpy ranges do not overlap as required by protocol rules.

As a simple example, a contract such as,

(module
 (import "env" "memcpy" (func $memcpy (param i32 i32 i32) (result i32)))
 (export "apply" (func $apply))
 (memory 1)
 (func $apply (param $0 i64) (param $1 i64) (param $2 i64)
    (drop (call $memcpy (i32.const 4) (i32.const 200) (i32.const 8)))
 )
)

Will generate machine code (annotated by me)

    pushq	%rax
    decl	%gs:-18888     ;;decrement depth count and jump if zero
    je  	51
    movq	%gs:200, %rax  ;;load 8 bytes from address 200 in to register
    movq	%rax, %gs:4    ;;store 8 bytes to address 4 from register
    movl	$4, %edi       ;;prepare the 3 parameters to check_memcpy_params
    movl	$200, %esi
    movl	$8, %edx
    callq	*%gs:-21056    ;;call check_memcpy_params
    incl	%gs:-18888     ;;increment depth count
    popq	%rax
    retq
    callq	*%gs:-18992    ;;call depth_assert

(in practice the source and destination addresses are typically not constants)

For some tested block ranges starting at 346025446, 348083350, 371435719, and 396859576 replay performance increases within a range of 3.5% to 6%. This is a bit lower than I expected -- I was expecting consistently north of 5% -- the reason for missing my expectation is that there are still many memcpy host function calls that are not optimized out, and the overlap check can still consume significant amount of CPU.

A run of the LLVM exhaustive workflow is at,
https://github.com/AntelopeIO/spring/actions/runs/11711214896

spoonincode · 2024-11-06T20:45:24Z

libraries/chain/include/eosio/chain/webassembly/eos-vm-oc/intrinsic.hpp

@@ -25,4 +25,7 @@ using intrinsic_map_t = std::map<std::string, intrinsic_entry>;

 const intrinsic_map_t& get_intrinsic_map();

+static constexpr unsigned minimum_const_memcpy_intrinsic_to_optimize = 1;
+static constexpr unsigned maximum_const_memcpy_intrinsic_to_optimize = 128;


128 is a bit of a dice roll. I don't have data to know where the exact threshold is between this local copy being faster vs slower than the full call off to memcpy. I wouldn't want to go higher than this anyways though, since it bloats the code a little more.

spoonincode · 2024-11-06T20:45:38Z

libraries/chain/include/eosio/chain/webassembly/eos-vm-oc/intrinsic.hpp

@@ -25,4 +25,7 @@ using intrinsic_map_t = std::map<std::string, intrinsic_entry>;

 const intrinsic_map_t& get_intrinsic_map();

+static constexpr unsigned minimum_const_memcpy_intrinsic_to_optimize = 1;
+static constexpr unsigned maximum_const_memcpy_intrinsic_to_optimize = 128;
+


This is a weird place for this, but header files for OC are a little weird because of some parts still are compiled with C++17 -- need to place these in a header file that is consumed by the C++17 code.

spoonincode · 2024-11-06T20:46:21Z

libraries/chain/webassembly/runtimes/eos-vm-oc/LLVMEmitIR.cpp

+					  const_memcpy_sz->getZExtValue() >= minimum_const_memcpy_intrinsic_to_optimize &&
+					  const_memcpy_sz->getZExtValue() <= maximum_const_memcpy_intrinsic_to_optimize) {
+					const unsigned sz_value = const_memcpy_sz->getZExtValue();
+					llvm::IntegerType* type_of_memcpy_width = llvm::Type::getIntNTy(context, sz_value*8);


The use of a non-power-of-2 type size here is rather unorthodox, though one would have to imagine something like _BitInt from C23 would do exactly this. Unit tests exhaustively explore every size (1 through 128). An oddball size does generate mostly expected machine code (loads and stores are maybe not in "natural" order one would expect but that doesn't matter of course), for example for size 13:

pushq %rax decl %gs:-18888 je 83 movl %gs:208, %eax ;;load 4 bytes (208-211) movq %gs:200, %rcx ;;load 8 bytes (200-207) movb %gs:212, %dl ;;load 1 byte (212) movb %dl, %gs:16 ;;store 1 byte (16) movq %rcx, %gs:4 ;;store 8 bytes (4-11) movl %eax, %gs:12 ;;store 4 bytes (12-15) movl $4, %edi movl $200, %esi movl $13, %edx callq *%gs:-21056 incl %gs:-18888 popq %rax retq callq *%gs:-18992

spoonincode · 2024-11-06T20:46:55Z

libraries/chain/webassembly/runtimes/eos-vm-oc/executor.cpp

+   const unsigned ulength = length;
+
+   //this must remain the same behavior as the memcpy host function
+   if((size_t)(std::abs((ptrdiff_t)udest - (ptrdiff_t)usrc)) >= ulength)


This check compiles down to about ~11 instructions but can still consume a substantial amount of CPU (beyond 1% in some cases). Perhaps something more clever can shave off another 0.5%.

wonder if

if ((std::max(udest, usrc) - std::min(udest, usrc)) >= ulength)

might be faster?

This check compiles down to about ~11 instructions but can still consume a substantial amount of CPU (beyond 1% in some cases). Perhaps something more clever can shave off another 0.5%.

Did you try plain if comparisons without using std::abs?

Both of these interesting suggestions and worth trying. Though, I'd suggest we keep the existing impl initially as it's most like what already exists (and has already been performance tested at the moment).

I'm also curious what would happen if the check was moved completely inline. The obviously bad aspect of this is that it's going to put a lot of pressure on the branch prediction cache. But.. there is a non-trivial amount of work just to get the values in to the right registers to pass off to a standard ABI function; then again maybe these get pipelined well. Plenty to experiment with here.

ericpassmore · 2024-11-22T23:20:51Z

Note:start
category: Other
component: OC
summary: Improve EOS EVM by working around CDT's memcpy function.
Note:end

spoonincode · 2025-02-03T02:57:26Z

exhaustive LLVM run here,
https://github.com/AntelopeIO/spring/actions/runs/13105241743

chicken dance pending

spoonincode · 2025-02-03T02:59:15Z

libraries/chain/include/eosio/chain/webassembly/eos-vm-oc/eos-vm-oc.hpp

@@ -44,7 +44,7 @@ enum eosvmoc_exitcode : int {
   EOSVMOC_EXIT_EXCEPTION
 };

-static constexpr uint8_t current_codegen_version = 1;
+static constexpr uint8_t current_codegen_version = 2;


This small change will effectively purge a pre-1.1 cache since on load,

spring/libraries/chain/webassembly/runtimes/eos-vm-oc/code_cache.cpp

Lines 345 to 354 in 3984428

for(unsigned i = 0; i < number_entries; ++i) {

code_descriptor cd;

fc::raw::unpack(ds, cd);

if(cd.codegen_version != current_codegen_version) {

allocator->deallocate(code_mapping + cd.code_begin);

allocator->deallocate(code_mapping + cd.initdata_begin);

continue;

}

_cache_index.push_back(std::move(cd));

}

libraries/libfc/include/fc/crypto/sha256.hpp

spoonincode · 2025-02-04T15:53:25Z

Replay test completed successfully

spoonincode added 2 commits November 6, 2024 15:17

make throw_internal_exception a little more generic

af124c6

handle small constant sized memcpy host function calls internal to OC

b2ca768

spoonincode commented Nov 6, 2024

View reviewed changes

spoonincode linked an issue Nov 6, 2024 that may be closed by this pull request

experiment with workaround for cdt inefficient memcpy usage #994

Closed

greg7mdp approved these changes Nov 20, 2024

View reviewed changes

spoonincode added 3 commits January 29, 2025 15:13

Merge remote-tracking branch 'spring/main' into HEAD

491addc

move fc::hash_value(sha256) to common location

d287f80

bump OC codegen version

b1579ed

heifner approved these changes Jan 30, 2025

View reviewed changes

spoonincode marked this pull request as ready for review February 3, 2025 02:57

spoonincode commented Feb 3, 2025

View reviewed changes

greg7mdp approved these changes Feb 3, 2025

View reviewed changes

libraries/libfc/include/fc/crypto/sha256.hpp Show resolved Hide resolved

greg7mdp approved these changes Feb 4, 2025

View reviewed changes

spoonincode merged commit c9d657e into main Feb 4, 2025
46 checks passed

spoonincode deleted the oc_opt_small_const_memcpy branch February 4, 2025 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

spoonincode commented Nov 6, 2024

spoonincode Nov 6, 2024

spoonincode Nov 6, 2024

spoonincode Nov 6, 2024

spoonincode Nov 6, 2024

greg7mdp Nov 6, 2024

linh2931 Nov 7, 2024

spoonincode Jan 30, 2025

ericpassmore commented Nov 22, 2024 •

edited

Loading

spoonincode commented Feb 3, 2025

spoonincode Feb 3, 2025

spoonincode commented Feb 4, 2025

	for(unsigned i = 0; i < number_entries; ++i) {
	code_descriptor cd;
	fc::raw::unpack(ds, cd);
	if(cd.codegen_version != current_codegen_version) {
	allocator->deallocate(code_mapping + cd.code_begin);
	allocator->deallocate(code_mapping + cd.initdata_begin);
	continue;
	}
	_cache_index.push_back(std::move(cd));
	}

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

workaround cdt's small const memcpy host function calls in EOS VM OC #1016

Conversation

spoonincode commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericpassmore commented Nov 22, 2024 • edited Loading

spoonincode commented Feb 3, 2025

Choose a reason for hiding this comment

spoonincode commented Feb 4, 2025

ericpassmore commented Nov 22, 2024 •

edited

Loading