-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSquery fails to run after upgrade from 8.16.1 to 8.17.1 #6792
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Relevant docs: https://osquery.readthedocs.io/en/stable/deployment/extensions/
|
We need to test whether this happens with fresh 8.17.1 installs or exclusively with upgrades. We don't have any internal reports of this yet since OSQuery is enabled in our internal InfoSec deployments of agent but I've reached out to double check. |
I'll try a fresh install and report here my findings. |
I've just tested a fresh install from the zip and Osquery works without any problem. |
There are apparently 5 machines in our internal InfoSec Fleet that have this It feels like this problem doesn't happen every time, consistent with Tiago being able to reproduce it 1 out of 3 times as mentioned in our internal support case about this problem. |
I'm still having a hard time reproducing it, yesterday I wrote an integration test to keep trying this upgrade scenario, but even with that running multiple time I have not managed to reproduce/gather more information about the failure. :/ |
My bet is the problem is in this code, and delayed visibility of directory entries on Windows could contribute. When we see a file before it's parent directory, we create it with placeholder permissions. Then we call If we wanted to prove this were happening, we would want to see the directory permissions when the failure happens with what they are naturally in the .zip. We could perhaps also detect this with debug logging as we'd need the "Unpacking file" log to come before the "Unpacking directory" line for this to happen. |
elastic-agent/internal/pkg/agent/application/upgrade/step_unpack.go Lines 158 to 162 in 1eefbe0
I wonder if making the Chmod above unconditional could fix this, that way we aren't dependent on use of os.Stat at all. It's only slightly less efficient because it will call Chmod unnecessarily if Mkdirall above it actually created the directory. |
It looks like the components directory wants 0755 which would default to 0770 if we created the file first without knowing the directory permissions and then the directory would be narrowed to 0750 if things were working correctly. elastic-agent/dev-tools/packaging/packages.yml Lines 196 to 204 in 1eefbe0
@belimawr if you try to run an agent with osquery in the policy and set the permissions on the |
Let me try. |
If this really is the problem what about unpacking into a directory on the same partition with a unique name (use os.MkDirTemp) and then extract into that, we don't have to check for existence of any directories because it is all new, and we can set any perms we require. Then when that is done use |
No lucky on Linux :/ I even tried changing the ownership of the file to another user:
Let me try on Windows. |
No lucky on Windows 2019 either:
Tomorrow I can try again with |
I managed to reproduce it on Linux by changing the permissions to 777 with I'll look into implementing the changes to enforce the correct permissions when extracting the zip |
After lots of testing I manged to craft a situation where this (a similar?) issue is reproducible, however it is not triggered by an upgrade, it is triggered by a change in permissions of the Elastic-Agent folder (or the The good news is that my PR (#6998) fixes it. Here is the step by step to reproduce it.
|
I've just tested this on a physical Windows machine, my test device running |
Actually, on latest response from the customer before closing the case they've mentioned that would be following here so I believe they are aware of all your tests and the fix. I'll keep an eye to see if they'll be opening any other case related to this. |
@cmacknz, @blakerouse I believe it's better to keep the discussion for a permanent fix here instead of using the PR. So, given the discussions we had today and the details I explained on #6792 (comment). It seems there are two independent actions we can take:
The one case I can think of right now is an actor (either good or bad, intentionally or unintentionally) that ends up chaining permissions/files. Anyways, I would not touch |
For confirmed bugs, please report:
Steps to Reproduce:
8.17.1
Then some errors communicating with it:
And other OSquery errors:
Out of 3 attempts using a Windows Server 2019, only the first failed, all other attempts worked fine.
The text was updated successfully, but these errors were encountered: