Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework recursive process_file core calls #191

Merged
merged 6 commits into from
Jan 27, 2022
Merged

Conversation

kukovecz
Copy link
Contributor

Fixes #181

@kukovecz kukovecz requested a review from vlaci January 21, 2022 11:04
@kukovecz kukovecz force-pushed the 181-remove-recursion branch 5 times, most recently from 832600e to 07b1789 Compare January 25, 2022 14:43
@vlaci vlaci mentioned this pull request Jan 26, 2022
@kukovecz kukovecz force-pushed the 181-remove-recursion branch 6 times, most recently from 7c76c9e to cdd7e82 Compare January 27, 2022 14:11
We agreed that we need to rework the recursion around the `process_file`
function, as this makes handling the unblob core codebase cumbersome to
deal with.

Now the `process_file` function calls are converted to `Task` objects,
which are put to a `Queue` and during unblobbing, we are getting
elements from the `Queue` and process them just like before.

Additionally, we need to think of multiprocessing. The only hard part is
to find the right way to recognize when there is nothing more to
process, and never will be - a right exit condition (because every file
processing can yield to multiple additional files to process). In this
matter, the multiprocessing.JoinableQueue can help us by manually
marking finished each Task in the Queue with the `task_done` function.
In this way the worker processes can use the blocking `queue.get`
indefinitely and the main process can check for the whole status by
`queue.join`.
These are static and never changing values, we can easily pass them
through the functions rather then always put the same values into each
`Task`.
Now that the `process_file` is not called recursively, there is no need
to provide these as parameters, as their initial value should be
calculated from the `process_file` itself.

Also renamed `current_depth` to `depth`.
We designed max_depth to be optional in the CLI, having a default value
hardcoded in unblob. It should work the same way when directly calling
the `process_file` function.

Also add showing the default value to the CLI.
Using the builtin `breakpoint` function does not work with
multiprocessing, as the Process forks has different STDIN so the pdb can
not attach to it. Use this `multiprocessing_breakpoint` function
instead, when you need a breakpoint for debugging.
@kukovecz kukovecz force-pushed the 181-remove-recursion branch from cdd7e82 to 5dd6e6a Compare January 27, 2022 15:29
@kissgyorgy kissgyorgy self-requested a review January 27, 2022 15:48
@kissgyorgy kissgyorgy merged commit b22ac58 into main Jan 27, 2022
@kissgyorgy kissgyorgy deleted the 181-remove-recursion branch January 27, 2022 15:49
@qkaiser
Copy link
Contributor

qkaiser commented Jan 27, 2022

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework recursive process_file core calls
3 participants