Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lightfuzz cleanup #2300

Merged
merged 34 commits into from
Feb 27, 2025
Merged

Lightfuzz cleanup #2300

merged 34 commits into from
Feb 27, 2025

Conversation

liquidsec
Copy link
Collaborator

Final cleanup and preparation for move to dev

Copy link

codecov bot commented Feb 20, 2025

Codecov Report

Attention: Patch coverage is 88.60759% with 18 lines in your changes missing coverage. Please review.

Project coverage is 93%. Comparing base (7800fb3) to head (95bf6e1).
Report is 52 commits behind head on lightfuzz.

Files with missing lines Patch % Lines
bbot/modules/lightfuzz/lightfuzz.py 80% 5 Missing ⚠️
bbot/modules/lightfuzz/submodules/crypto.py 87% 4 Missing ⚠️
bbot/modules/lightfuzz/submodules/serial.py 72% 4 Missing ⚠️
bbot/modules/lightfuzz/submodules/cmdi.py 67% 2 Missing ⚠️
bbot/modules/lightfuzz/submodules/xss.py 75% 1 Missing ⚠️
bbot/test/test_step_1/test__module__tests.py 67% 1 Missing ⚠️
.../test_step_2/module_tests/test_module_lightfuzz.py 93% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           lightfuzz   #2300   +/-   ##
=========================================
+ Coverage         93%     93%   +1%     
=========================================
  Files            395     396    +1     
  Lines          32528   32586   +58     
=========================================
+ Hits           30088   30161   +73     
+ Misses          2440    2425   -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@liquidsec
Copy link
Collaborator Author

pyahocorasick benchmark

Total time for string_scan with text size 10000 and 10 substrings: 0.07 seconds
Total time for string_scan_yara with text size 10000 and 10 substrings: 1.39 seconds
Total time for Python 'in' check with text size 10000 and 10 substrings: 0.02 seconds


Total time for string_scan with text size 10000 and 100 substrings: 0.21 seconds
Total time for string_scan_yara with text size 10000 and 100 substrings: 2.97 seconds
Total time for Python 'in' check with text size 10000 and 100 substrings: 0.19 seconds


Total time for string_scan with text size 10000 and 1000 substrings: 0.97 seconds
Total time for string_scan_yara with text size 10000 and 1000 substrings: 64.44 seconds
Total time for Python 'in' check with text size 10000 and 1000 substrings: 1.88 seconds


Total time for string_scan with text size 100000 and 10 substrings: 0.57 seconds
Total time for string_scan_yara with text size 100000 and 10 substrings: 1.69 seconds
Total time for Python 'in' check with text size 100000 and 10 substrings: 0.25 seconds


Total time for string_scan with text size 100000 and 100 substrings: 1.39 seconds
Total time for string_scan_yara with text size 100000 and 100 substrings: 3.43 seconds
Total time for Python 'in' check with text size 100000 and 100 substrings: 2.40 seconds


Total time for string_scan with text size 100000 and 1000 substrings: 2.59 seconds
Total time for string_scan_yara with text size 100000 and 1000 substrings: 65.54 seconds
Total time for Python 'in' check with text size 100000 and 1000 substrings: 23.56 seconds


Total time for string_scan with text size 1000000 and 10 substrings: 5.84 seconds
Total time for string_scan_yara with text size 1000000 and 10 substrings: 3.29 seconds
Total time for Python 'in' check with text size 1000000 and 10 substrings: 2.41 seconds


Total time for string_scan with text size 1000000 and 100 substrings: 13.23 seconds
Total time for string_scan_yara with text size 1000000 and 100 substrings: 5.54 seconds
Total time for Python 'in' check with text size 1000000 and 100 substrings: 23.45 seconds



Total time for string_scan with text size 1000000 and 1000 substrings: 18.80 seconds
Total time for string_scan_yara with text size 1000000 and 1000 substrings: 69.83 seconds
Total time for Python 'in' check with text size 1000000 and 1000 substrings: 231.80 seconds

Benchmark Code:

import time
import random
import string
from bbot.core.helpers.misc import string_scan, string_scan_yara

def generate_random_text(length):
    """Generate a random string of specified length."""
    return "".join(random.choices(string.ascii_letters + string.digits, k=length))

def generate_random_substrings(count, length):
    """Generate a list of random substrings."""
    return [generate_random_text(length) for _ in range(count)]

def stress_test(iterations=1000, substring_counts=[10, 100, 1000], text_sizes=[10000, 100000, 1000000]):
    for text_size in text_sizes:
        for count in substring_counts:
      #      print(f"\nStarting stress test with {iterations} iterations, text size {text_size}, and {count} substrings...")

            # Initialize accumulators for total time
            total_ahocorasick_time = 0
            total_yara_time = 0
            total_python_in_time = 0

            for i in range(1, iterations + 1):
                # Generate a large random text and a list of random substrings
                text = generate_random_text(text_size)
                substrings = generate_random_substrings(count, 10)  # Substrings of length 10

                # Test string_scan
                start_time = time.time()
                result_ahocorasick = string_scan(substrings, text)
                ahocorasick_time = time.time() - start_time
                total_ahocorasick_time += ahocorasick_time

                # Test string_scan_yara
                start_time = time.time()
                result_yara = string_scan_yara(substrings, text)
                yara_time = time.time() - start_time
                total_yara_time += yara_time

                # Test Python 'in' keyword
                start_time = time.time()
                result_python_in = [s for s in substrings if s in text]
                python_in_time = time.time() - start_time
                total_python_in_time += python_in_time

                # Verify that all methods return the same results
                assert set(result_ahocorasick) == set(result_yara) == set(result_python_in), "Results differ between methods!"

                # Print progress every 100 iterations
            #    if i % 100 == 0:
            #        print(f"Completed {i} out of {iterations} iterations ({(i / iterations) * 100:.2f}%)")

            # Print total times
            print("\n")
            print(f"Total time for string_scan with text size {text_size} and {count} substrings: {total_ahocorasick_time:.2f} seconds")
            print(f"Total time for string_scan_yara with text size {text_size} and {count} substrings: {total_yara_time:.2f} seconds")
            print(f"Total time for Python 'in' check with text size {text_size} and {count} substrings: {total_python_in_time:.2f} seconds")

if __name__ == "__main__":
    stress_test()

string_scan_yara function used in benchmark:

def string_scan_yara(substrings, text, case_insensitive=True):
    # Create YARA rules for each substring
    rules = []
    for idx, substring in enumerate(substrings):
        rule_name = f"rule_{idx}"
        condition = f'"{substring}"'
        if case_insensitive:
            condition = f"/{substring}/ nocase"
        rule = f"rule {rule_name} {{ strings: $a = {condition} condition: $a }}"
        rules.append(rule)

    # Compile the YARA rules
    compiled_rules = yara.compile(source="\n".join(rules))

    # Scan the text
    matches = compiled_rules.match(data=text)
    found_substrings = [match.rule.split("_")[1] for match in matches]

    return found_substrings

My conclusions:

  • For the one use case we have currently, just using standard python x in y is probably fine
  • If we got above around the 100 string mark, we'd want to avoid using standard python
  • Any yara solution where we compile per run is not a good one, as around the time the native python solution starts to get poor so does yara because of the increased compile time
  • THe ahocorasick solution is clearly the best for large number of strings, and/or large text where we don't know the strings ahead of time

Id say the only viable options are:

Remove it entirely and use native python, since we still have a low string count
Keep the current pyahocorasick solution in, and keep things the same

Btt there is very little cost of any kind to having it in there

@liquidsec liquidsec merged commit 3021ee9 into lightfuzz Feb 27, 2025
12 of 13 checks passed
@liquidsec liquidsec deleted the lightfuzz-cleanup branch February 27, 2025 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant