Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kubernetes manifest] Use unique identifier for the state file path #5187

Open
tetianakravchenko opened this issue Jul 23, 2024 · 5 comments
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@tetianakravchenko
Copy link
Contributor

tetianakravchenko commented Jul 23, 2024

Describe the enhancement:

in manifest we have elastic-agent-state and the hostPath is predefined:

        # Mount /var/lib/elastic-agent-managed/kube-system/state to store elastic-agent state
        # Update 'kube-system' with the namespace of your agent installation
        - name: elastic-agent-state
          hostPath:
            path: /var/lib/elastic-agent-managed/kube-system/state
            type: DirectoryOrCreate

as a result when customer want to remove installation kubectl delete -f manifest.yaml and install a new one (with the different FLEET_URL and FLEET_ENROLLMENT_TOKEN) existing state file will be used, that leads to the next error:

"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://XXXXXX.fleet.region.aws.found.io:443 ...

What is the definition of done?

  • create 2 elastic stack deployments: stack1, stack 2
  • install elastic-agent to the k8s cluster (with stack1 credentials)
  • delete it
  • install elastic-agent to the k8s cluster (with stack2 credentials)
  • no errors occure

Few ideas:
we can use fleet url as: /var/lib/elastic-agent-managed/<fleet_url>/kube-system/state (like: /var/lib/elastic-agent-managed/f437b90409bb4804b1647665fa19f7a0.fleet.us-central1.gcp.cloud.es.io/kube-system/state, for local setup: /var/lib/elastic-agent-managed/fleet-serverkube-system/state)
but what to do we there is no fleet server? fallback to default - /var/lib/elastic-agent-managed/kube-system/state ?

@tetianakravchenko tetianakravchenko added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Jul 23, 2024
@cmacknz
Copy link
Member

cmacknz commented Jul 23, 2024

I think we need to treat a change in the FLEET_URL or FLEET_ENROLLMENT_TOKEN environment variables as equivalent to executing the elastic-agent enroll command.

@blakerouse
Copy link
Contributor

@cmacknz I disagree, there are many reasons you might change those values after the Elastic Agent is already running and you don't what to have your Elastic Agents to re-enroll. Say you are updating the FLEET_URL because you just moved the cluster, or you just updated the FLEET_ENROLLMENT_TOKEN as a security policy of rotating tokens periodically.

Would be interesting to see if we could possibly make an anonymous call to Fleet Server and determine if this is the same Fleet Server?

@cmacknz
Copy link
Member

cmacknz commented Jul 23, 2024

Would be interesting to see if we could possibly make an anonymous call to Fleet Server and determine if this is the same Fleet Server?

Is just checking in, or doing anything that uses the stored API key enough to check this?

We could make calling the enroll endpoint idempotent in some situations, perhaps by allowing an optional agent.id parameter. This would allow getting the API key of an existing agent, instead of a net new one though which I don't love from a security perspective (edit: or the response could just not include the existing API key so that this is only an "is an agent with this ID enrolled" check).

@blakerouse
Copy link
Contributor

Would be interesting to see if we could possibly make an anonymous call to Fleet Server and determine if this is the same Fleet Server?

Is just checking in, or doing anything that uses the stored API key enough to check this?

We could make calling the enroll endpoint idempotent in some situations, perhaps by allowing an optional agent.id parameter. This would allow getting the API key of an existing agent, instead of a net new one though which I don't love from a security perspective (edit: or the response could just not include the existing API key so that this is only an "is an agent with this ID enrolled" check).

@cmacknz I like the idempotent idea. We could just change it to return a HTTP conflict or specific response saying that it already exists and not return the API key again.

@blakerouse
Copy link
Contributor

I just wanted to add a note here that if you set FLEET_FORCE=true in environment for the container that it will re-enroll on every restart. This doesn't actually solve this issue, but is a workaround when you are trying to migrate from one Fleet to another Fleet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

No branches or pull requests

3 participants