Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSquery fails to run after upgrade from 8.16.1 to 8.17.1 #6792

Open
belimawr opened this issue Feb 10, 2025 · 19 comments · May be fixed by #6998
Open

OSquery fails to run after upgrade from 8.16.1 to 8.17.1 #6792

belimawr opened this issue Feb 10, 2025 · 19 comments · May be fixed by #6998
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@belimawr
Copy link
Contributor

For confirmed bugs, please report:

Steps to Reproduce:

  1. Install Elastic-Agent 8.16.1 with Osquery Manager and Osquery Logs integration
  2. Wait for the Elastic-Agent to become healthy
  3. Upgrade to 8.17.1
  4. The Elastic-Agent might became unhealthy and you'll see errors from OSquery like this
W0204 10:53:17.582293   844 extensions.cpp:426] Will not autoload extension with unsafe directory permissions: C:\Program Files\Elastic\Agent\data\elastic-agent-8.17.1-b46c28\components\osquery-extension.exe

E0204 10:53:17.615715   844 shutdown.cpp:79] Cannot activate osq_config config plugin: Unknown registry plugin: osq_config

Then some errors communicating with it:

failed to connect, reconnect might be attempted, err: dialing pipe '\\.\pipe\elastic\osquery\2084fe12-fa4c-4555-a3b1-196c9a370738': open \\.\pipe\elastic\osquery\2084fe12-fa4c-4555-a3b1-196c9a370738: The system cannot find the file specified.

attempt 1 out of 11 failed, err: dialing pipe '\\.\pipe\elastic\osquery\2084fe12-fa4c-4555-a3b1-196c9a370738': open \\.\pipe\elastic\osquery\2084fe12-fa4c-4555-a3b1-196c9a370738: The system cannot find the file specified.

And other OSquery errors:

osquerybeat Run exited with error: I0204 10:53:20.079067 10368 init.cpp:413] osquery initialized [version=5.12.1]
I0204 10:53:20.081629 10368 dispatcher.cpp:78] Adding new service: UsersService (0000024ED6BD4240) to thread: 12900 (0000024ED6C14F40) in process 12916
I0204 10:53:20.081629 10368 dispatcher.cpp:78] Adding new service: GroupsService (0000024ED6BD5670) to thread: 8372 (0000024ED6C14D40) in process 12916
W0204 10:53:20.082434 10368 extensions.cpp:426] Will not autoload extension with unsafe directory permissions: C:\Program Files\Elastic\Agent\data\elastic-agent-8.17.1-b46c28\components\osquery-extension.exe
I0204 10:53:20.082934 10368 rocksdb.cpp:90] Opening RocksDB handle: osquery\osquery.db
I0204 10:53:20.091150  8372 groups_service.cpp:55] Groups cache initialized
I0204 10:53:20.097339 12900 users_service.cpp:149] Users cache initialized
I0204 10:53:20.120223 10368 dispatcher.cpp:78] Adding new service: ExtensionWatcher (0000024ED76F0260) to thread: 22012 (0000024ED76C45A0) in process 12916
I0204 10:53:20.120223 10368 dispatcher.cpp:78] Adding new service: ExtensionRunnerCore (0000024ED7706C00) to thread: 25828 (0000024ED76C4740) in process 12916
E0204 10:53:20.120223 10368 shutdown.cpp:79] Cannot activate osq_config config plugin: Unknown registry plugin: osq_config
I0204 10:53:20.120223 10368 dispatcher.cpp:149] Thread: 10368 requesting a stop
I0204 10:53:20.120223 10368 dispatcher.cpp:156] Service: 0000024ED6BD4240 has been interrupted
I0204 10:53:20.120223 10368 dispatcher.cpp:156] Service: 0000024ED6BD5670 has been interrupted
I0204 10:53:20.120223 10368 dispatcher.cpp:156] Service: 0000024ED76F0260 has been interrupted
I0204 10:53:20.120223 25828 interface.cpp:299] Extension manager service starting: \\.\pipe\elastic\osquery\2084fe12-fa4c-4555-a3b1-196c9a370738
I0204 10:53:20.144812 10368 dispatcher.cpp:156] Service: 0000024ED7706C00 has been interrupted
I0204 10:53:20.144812 10368 dispatcher.cpp:122] Thread: 10368 requesting a join
I0204 10:53:20.149353 10368 dispatcher.cpp:140] Service thread: 0000024ED76C4740 has joined
I0204 10:53:20.149353 10368 dispatcher.cpp:140] Service thread: 0000024ED76C45A0 has joined
I0204 10:53:20.149353 10368 dispatcher.cpp:140] Service thread: 0000024ED6C14D40 has joined
I0204 10:53:20.149353 10368 dispatcher.cpp:140] Service thread: 0000024ED6C14F40 has joined
I0204 10:53:20.149353 10368 dispatcher.cpp:144] Services and threads have been cleared: exit status 78

Out of 3 attempts using a Windows Server 2019, only the first failed, all other attempts worked fine.

@belimawr belimawr added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Feb 10, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Feb 10, 2025

Relevant docs: https://osquery.readthedocs.io/en/stable/deployment/extensions/

Extensions Binary Permissions
First, a note: the osquery agent will refuse to load an extension executable from the filesystem if the file's permissions allow write or modify by non-privileged accounts. Before loading an extension, change the owner of the your_extension.ext file to be the root account.

On Windows, because of permission inheritance, just changing the owner of a file is not sufficient. You must also change the owner of the parent directory, remove all inherited DACLs, and disable inheritance. For example, if your osquery extensions are in the .\Extensions directory, the following commands will set permissions that satisfy osquery:

icacls .\Extensions /setowner Administrators /t
icacls .\Extensions /grant Administrators:f /t
icacls .\Extensions /inheritance:r /t
icacls .\Extensions /inheritance:d /t

@cmacknz
Copy link
Member

cmacknz commented Feb 10, 2025

We need to test whether this happens with fresh 8.17.1 installs or exclusively with upgrades.

We don't have any internal reports of this yet since OSQuery is enabled in our internal InfoSec deployments of agent but I've reached out to double check.

@belimawr
Copy link
Contributor Author

We need to test whether this happens with fresh 8.17.1 installs or exclusively with upgrades.

I'll try a fresh install and report here my findings.

@belimawr
Copy link
Contributor Author

I've just tested a fresh install from the zip and Osquery works without any problem.

@cmacknz
Copy link
Member

cmacknz commented Feb 11, 2025

There are apparently 5 machines in our internal InfoSec Fleet that have this Will not autoload extension with unsafe directory permissions.

It feels like this problem doesn't happen every time, consistent with Tiago being able to reproduce it 1 out of 3 times as mentioned in our internal support case about this problem.

@belimawr
Copy link
Contributor Author

I'm still having a hard time reproducing it, yesterday I wrote an integration test to keep trying this upgrade scenario, but even with that running multiple time I have not managed to reproduce/gather more information about the failure. :/

@cmacknz
Copy link
Member

cmacknz commented Feb 20, 2025

My bet is the problem is in this code, and delayed visibility of directory entries on Windows could contribute.

When we see a file before it's parent directory, we create it with placeholder permissions. Then we call os.Stat to either create the directory if it doesn't exist or in the case it was created with placeholder permissions as described above we fix the permissions here.

If we wanted to prove this were happening, we would want to see the directory permissions when the failure happens with what they are naturally in the .zip.

We could perhaps also detect this with debug logging as we'd need the "Unpacking file" log to come before the "Unpacking directory" line for this to happen.

@cmacknz
Copy link
Member

cmacknz commented Feb 20, 2025

// directory already exists, set the appropriate permissions
err = os.Chmod(dstPath, f.Mode().Perm()&0770)
if err != nil {
return fmt.Errorf("setting permissions %O for directory %q: %w", f.Mode().Perm()&0770, dstPath, err)
}

I wonder if making the Chmod above unconditional could fix this, that way we aren't dependent on use of os.Stat at all. It's only slightly less efficient because it will call Chmod unnecessarily if Mkdirall above it actually created the directory.

@cmacknz
Copy link
Member

cmacknz commented Feb 20, 2025

It looks like the components directory wants 0755 which would default to 0770 if we created the file first without knowing the directory permissions and then the directory would be narrowed to 0750 if things were working correctly.

- &agent_windows_binary_spec
<<: *common
files:
<<: *agent_binary_files
'data/{{.BeatName}}-{{ commit_short }}/components':
source: '{{.AgentDropPath}}/{{.GOOS}}-{{.AgentArchName}}.zip/'
mode: 0755
config_mode: 0644
skip_on_missing: true

@belimawr if you try to run an agent with osquery in the policy and set the permissions on the components sub-directory to 0770 do osquery extensions load? I think you can try this on Linux as we aren't actually doing anything Windows specific, but the contributing factor may be the visibility of filesystem updates on Windows systems which would be hard to reproduce.

@belimawr
Copy link
Contributor Author

@belimawr if you try to run an agent with osquery in the policy and set the permissions on the components sub-directory to 0770 do osquery extensions load?

Let me try.

@leehinman
Copy link
Contributor

My bet is the problem is in this code, and delayed visibility of directory entries on Windows could contribute.

If this really is the problem what about unpacking into a directory on the same partition with a unique name (use os.MkDirTemp) and then extract into that, we don't have to check for existence of any directories because it is all new, and we can set any perms we require. Then when that is done use os.Rename to move the old dir out of the way if it exists, and os.Rename to move the TempDir to it's final correct destination. If that succeeded then you can remove the old dir (assuming it existed). You shouldn't have any race conditions, and if you have errors you can always go back to the original dir structure.

@belimawr
Copy link
Contributor Author

No lucky on Linux :/ I even tried changing the ownership of the file to another user:

root@archlinux /opt/Elastic/Agent/data/elastic-agent-9.0.0-beta1-aa8178/components % ll
total 582M
drwxr-x--- 1 root    root      18 Feb 20 17:55 certs
drwxr-x--- 1 root    root    5.3K Feb 20 17:55 lenses
drwxr-x--- 1 root    root     404 Feb 20 17:55 module
-rwxr-x--- 1 root    root    357M Feb 20 17:55 agentbeat
-rw------- 1 root    root     17K Feb 20 17:55 agentbeat.spec.yml
-rwxr-x--- 1 root    root     26M Feb 20 17:55 endpoint-security
-rwxr-x--- 1 root    root     27M Feb 20 17:55 endpoint-security-resources.zip
-rw-r----- 1 root    root    3.9K Feb 20 17:55 endpoint-security.spec.yml
-rwxrwxrwx 1 vagrant vagrant 4.3M Feb 20 17:55 osquery-extension.ext
-rwxr-x--- 1 root    root     83M Feb 20 17:55 osqueryd
-rwxr-x--- 1 root    root     87M Feb 20 17:55 pf-host-agent
-rw-r----- 1 root    root     406 Feb 20 17:55 pf-host-agent.spec.yml
root@archlinux /opt/Elastic/Agent/data/elastic-agent-9.0.0-beta1-aa8178/components % elastic-agent status
┌─ fleet
│  └─ status: (STARTING) 
└─ elastic-agent
   └─ status: (HEALTHY) Running
root@archlinux /opt/Elastic/Agent/data/elastic-agent-9.0.0-beta1-aa8178/components % 

Let me try on Windows.

@belimawr
Copy link
Contributor Author

No lucky on Windows 2019 either:

C:\Program Files\Elastic\Agent\data\elastic-agent-9.0.0-beta1-aa8178\components\osquery-extension.exe BUILTIN\Administrators:(F)
                                                                                                      BUILTIN\Administrators:(I)(F)
                                                                                                      VAGRANT\vagrant:(I)(F)

C:\Program Files\Elastic\Agent\data\elastic-agent-9.0.0-beta1-aa8178\components\osqueryd.exe BUILTIN\Administrators:(F)
                                                                                             BUILTIN\Administrators:(I)(F)
                                                                                             VAGRANT\vagrant:(I)(F)

Successfully processed 9 files; Failed processing 0 files
PS C:\Users\vagrant\Downloads\elastic-agent-9.0.0-beta1-windows-x86_64\elastic-agent-9.0.0-beta1-windows-x86_64>

Image

PS C:\Users\vagrant\Downloads\elastic-agent-9.0.0-beta1-windows-x86_64\elastic-agent-9.0.0-beta1-windows-x86_64> & 'C:\Program Files\Elastic\Agent\elastic-agent.exe' status --output=full
+- fleet
¦  +- status: (STARTING)
+- elastic-agent
   +- status: (HEALTHY) Running
   +- info
   ¦  +- id: f87e2a16-35a1-4077-9d93-a57f6dd6b1cf
   ¦  +- version: 9.0.0-beta1
   ¦  +- commit: aa817844993223ad0190bd78036d036ad2063027
   +- beat/metrics-monitoring
   ¦  +- status: (HEALTHY) Healthy: communicating with pid '5588'
   ¦  +- beat/metrics-monitoring
   ¦  ¦  +- status: (HEALTHY) Healthy
   ¦  ¦  +- type: OUTPUT
   ¦  +- beat/metrics-monitoring-metrics-monitoring-beats
   ¦     +- status: (HEALTHY) Healthy
   ¦     +- type: INPUT
   +- filestream-monitoring
   ¦  +- status: (HEALTHY) Healthy: communicating with pid '88'
   ¦  +- filestream-monitoring
   ¦  ¦  +- status: (HEALTHY) Healthy
   ¦  ¦  +- type: OUTPUT
   ¦  +- filestream-monitoring-filestream-monitoring-agent
   ¦     +- status: (HEALTHY) Healthy
   ¦     +- type: INPUT
   +- http/metrics-monitoring
   ¦  +- status: (HEALTHY) Healthy: communicating with pid '4240'
   ¦  +- http/metrics-monitoring
   ¦  ¦  +- status: (HEALTHY) Healthy
   ¦  ¦  +- type: OUTPUT
   ¦  +- http/metrics-monitoring-metrics-monitoring-agent
   ¦     +- status: (HEALTHY) Healthy
   ¦     +- type: INPUT
   +- log-default
   ¦  +- status: (HEALTHY) Healthy: communicating with pid '6392'
   ¦  +- log-default
   ¦  ¦  +- status: (HEALTHY) Healthy
   ¦  ¦  +- type: OUTPUT
   ¦  +- log-default-logfile-osquery-768e37a6-1b08-4f6a-b79b-795dd62e6551
   ¦  ¦  +- status: (HEALTHY) Healthy
   ¦  ¦  +- type: INPUT
   ¦  +- log-default-logfile-system-794804cf-cdc4-4e28-a3ea-f86749e82d2a
   ¦     +- status: (HEALTHY) Healthy
   ¦     +- type: INPUT
   +- osquery-default
   ¦  +- status: (STARTING) Starting: spawned pid '4444'
   ¦  +- osquery-default
   ¦  ¦  +- status: (STARTING) Starting: spawned pid '4444'
   ¦  ¦  +- type: OUTPUT
   ¦  +- osquery-default-cd0dfb0f-a014-4af5-a4e3-bbeeb8d26f0d
   ¦     +- status: (STARTING) Starting: spawned pid '4444'
   ¦     +- type: INPUT
   +- system/metrics-default
   ¦  +- status: (HEALTHY) Healthy: communicating with pid '3188'
   ¦  +- system/metrics-default
   ¦  ¦  +- status: (HEALTHY) Healthy
   ¦  ¦  +- type: OUTPUT
   ¦  +- system/metrics-default-system/metrics-system-794804cf-cdc4-4e28-a3ea-f86749e82d2a
   ¦     +- status: (HEALTHY) Healthy
   ¦     +- type: INPUT
   +- winlog-default
      +- status: (HEALTHY) Healthy: communicating with pid '3288'
      +- winlog-default
      ¦  +- status: (HEALTHY) Healthy
      ¦  +- type: OUTPUT
      +- winlog-default-winlog-system-794804cf-cdc4-4e28-a3ea-f86749e82d2a
         +- status: (HEALTHY) Healthy
         +- type: INPUT
PS C:\Users\vagrant\Downloads\elastic-agent-9.0.0-beta1-windows-x86_64\elastic-agent-9.0.0-beta1-windows-x86_64>

Tomorrow I can try again with 8.17.x.

@belimawr
Copy link
Contributor Author

I managed to reproduce it on Linux by changing the permissions to 777 with 8.17.2.

I'll look into implementing the changes to enforce the correct permissions when extracting the zip

@belimawr belimawr linked a pull request Feb 24, 2025 that will close this issue
8 tasks
@belimawr
Copy link
Contributor Author

belimawr commented Mar 4, 2025

After lots of testing I manged to craft a situation where this (a similar?) issue is reproducible, however it is not triggered by an upgrade, it is triggered by a change in permissions of the Elastic-Agent folder (or the osquery-extension.ext binary) followed by a restart of the Elastic-Agent/OSQuerybeat.

The good news is that my PR (#6998) fixes it.

Here is the step by step to reproduce it.

  1. Start the win2019 vagrant VM: vagrant up win2019

  2. Open VirtualBox, find the VM and select "Show" to see the GUI. The password is vagrant

  3. Install the Standalone Elastic-Agent (any recent version) using the following configuration (adjust the output settings to match your Elastic stack deployment)

    elastic-agent.yml

    outputs:
      default:
        type: elasticsearch
        hosts:
          - https://elasticsearch:9200
        ssl.verification_mode: none
        username: elastic
        password: changeme
    
    agent:
      download:
        sourceURI: https://artifacts.elastic.co/downloads/
      monitoring:
        enabled: true
        use_output: default
        logs: true
        metrics: false
        traces: true
        namespace: default
    inputs:
      - id: osquery-ea-input-id
        name: osquery_manager-1
        type: osquery
        use_output: default
        data_stream:
          namespace: default
        streams:
          - id: >-
              osquery-osquery_manager.action.responses-c92a6eb1-cf10-4a5e-ac3a-53ed558697d3
            data_stream:
              dataset: osquery_manager.action.responses
              type: logs
            query: null
          - id: osquery-osquery_manager.result-c92a6eb1-cf10-4a5e-ac3a-53ed558697d3
            data_stream:
              dataset: osquery_manager.result
              type: logs
            query: null
        osquery:
          options: {}
          schedule:
            system_info:
              query: SELECT * FROM system_info;
              interval: 15

    .\elastic-agent.exe install -if
    
  4. Go to Kibana -> Discover and ensure that

    1. You can see the Elastic-Agent logs (logs-* dataview).
    2. You can see the OSQuery events in the logs-osquery_manager.result* dataview. Look at the following fields: osquery.cpu_brand and osquery.cpu_type.
  5. Open the File Explorer and go to C:\Program Files\Elastic

  6. Double click in the Agent folder, a pop-up will appear giving you the option to get permanent access to the folder, accept by clicking on "Continue"
    Image

  7. Stop the Elastic-Agent service (search for "Services", open it, find the Elastic-Agent and click on "Stop"
    Image

  8. Ensure the Elastic-Agent has stopped, confirm in Kibana you're not getting any data from this host

  9. Restart the Elastic-Agent service by clicking on "Start"

  10. Go back to Kibana, look at the Elastic-Agent logs, you'll see errors like this one

    Exiting: W0304 15:16:51.386566  4340 extensions.cpp:426] Will not autoload extension with unsafe directory permissions: C:\Program Files\Elastic\Agent\data\elastic-agent-8.16.1-b6da7f\components\osquery-extension.exe
    E0304 15:16:51.518429  4340 shutdown.cpp:79] Cannot activate osq_config config plugin: Unknown registry plugin: osq_config: exit status 78
    
  11. This will make OSQuerybeat to go into a restart loop, so the elastic-agent status command might show the Elastic-Agent as healthy and OSQuerybeat as starting.

  12. Open the properties of C:\Program Files\Elastic\Agent, then remove the user "Vagrant" from the permissions (security tab)
    Image

  13. Restart the Elastic-Agent service

  14. OSQuerybeat should be working again.

@belimawr
Copy link
Contributor Author

belimawr commented Mar 5, 2025

I've just tested this on a physical Windows machine, my test device running Microsoft Windows 11 Pro and had the experience and results.

@Cris-Maggi
Copy link

Actually, on latest response from the customer before closing the case they've mentioned that would be following here so I believe they are aware of all your tests and the fix. I'll keep an eye to see if they'll be opening any other case related to this.

@belimawr
Copy link
Contributor Author

belimawr commented Mar 5, 2025

@cmacknz, @blakerouse I believe it's better to keep the discussion for a permanent fix here instead of using the PR.

So, given the discussions we had today and the details I explained on #6792 (comment). It seems there are two independent actions we can take:

  1. First and foremost, the FixPermissions we run on install needs to be updated to disable inheritance at least in the components directory, ideally in the whole tree, just to be on the safe side.
  2. Look into options to fix permissions when the Elastic-Agent is running, so if any user have Elastic-Agents stuck on this state, then it is possible to fix without access to the machine.

#2 is definitely the most complex one, I can think of creating a new action, auto-fix (or something like that) that would fix the permissions of all folders, maybe even re-install the same Elastic-Agent version. However it makes me think on what cases this would be useful. Aside from this case I'm having a hard time picturing situations where we actually need to fix the Elastic-Agent folder/check its integrity.

The one case I can think of right now is an actor (either good or bad, intentionally or unintentionally) that ends up chaining permissions/files.

Anyways, I would not touch FixPermission until #7059 gets merged to avoid conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants