Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update all usages of fleettools to use the installed Agent ID #7054

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Feb 26, 2025

What does this PR do?

Updates all integration tests to use the installed Elastic Agent ID from the status output to check with Fleet for information about the specific Elastic Agent.

Why is it important?

This ensures that the tests in the integration framework are only communicating with Fleet about that specific Elastic Agent. This removes the need to filter based on hostname or doing any type of paging with the Kibana API to find that specific Elastic Agent. We know the Elastic Agent ID as the test installed it, it should always use that ID.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works (all integration tests)
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool (testing only)
  • I have added an integration test or an E2E test

Disruptive User Impact

None

How to test this PR locally

mage integration:test

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@blakerouse blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-8.x Automated backport to the 8.x branch with mergify backport-8.18 Automated backport to the 8.18 branch backport-9.0 Automated backport to the 9.0 branch labels Feb 26, 2025
@blakerouse blakerouse self-assigned this Feb 26, 2025
@blakerouse blakerouse requested a review from a team as a code owner February 26, 2025 22:11
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@jlind23
Copy link
Contributor

jlind23 commented Feb 27, 2025

There were cloud instability while creating deployment, restarting tests.

@blakerouse
Copy link
Contributor Author

buildkite test this

defer cancel()

var lastErr error
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that it is more appropriate for any retry logic to be up for the caller to implement if necessary, thus, this function shouldn't implicitly keep retrying to get the status of an agent. wdyt? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means everywhere in the testing framework will have to add retry logic. I am not a fan of that. It will pollute the testing code. Being it is testing code, I thought it would be best to place the logic here.

I was thinking of splitting it into two functions one with retry and one without, but I couldn't find a single place where I would prefer the function with no retry over the function with retry.

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pollution 😄 I never thought of it like that, ok I will keep it in mind next time.

I couldn't find a single place where I would prefer the function with no retry over the function with retry.

To me that sounds like that you would prefer to always call the one with retry, just to stay on the "safe" side. otherwise you wouldn't introduce it to begin with?!

Ok let's do what you say then, if you need to retry because tests lack at the moment the waiting for the AgentID to be successful, let's try to minimise the "pollution", with a separate call that allows at least the caller to specify the retry knobs and call that from everywhere 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that deciding if being unable to get an agent id is a showstopper or if retries should be performed (maybe using an assert.Eventually() or some other assertion) should be up to the specific test.

Also assuming that this can last up to 1 minute may be wrong depending on the testcase.

I would prefer to not have "hidden" mechanisms in the utility functions, if we need to change the test code so be it: explicit testcase code is preferable in my opinion.

One more thing: what is the case where an installed and enrolled agent does not have an AgentID exactly? Wouldn't that be an issue with either the enroll operation or the test structure ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we would want to same retry logic in every test. Most developers want a DRY method of development, where the same code is not repeated everywhere. Not having it in the function results in more cases of errors and flakiness in tests as well, if the developer doesn't add the extra code to ensure that retries are performed. Overall retry logic in the function provides cleaner implementation in the test code, which is where we should strive for improved readability. Overall I would prefer to see this type of logic placed into other functions.

I have updated to code to allow the caller to disable retries, adjust the timeout and interval as well. I don't see the need for those honestly, but let see if you like that better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually prefer to move this logic into ExecStatus, because that is where this really belongs. It is very much a retry on failed connection to a remote GRPC server. It might even be better to place this directly in the elastic-agent status command, but that will not fix previous versions. Being that many of the tests test installation of old versions to upgrade to latest versions, this will not be helpful in tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allright, in that case I'm fine with it. Thanks for the explanation! It would be nice to add a comment to the empty string check to make it clear this is a failsafe.

Incidentally, this isn't related to your change, but how does agent not having an ID but the control protocol server running actually happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done just that. Moved the retry logic into ExecStatus as that is the appropriate place for that logic. This is very much about retries for communication with the Elastic Agent daemon which is a local GRPC server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally, this isn't related to your change, but how does agent not having an ID but the control protocol server running actually happen?

Honestly, I just think it could happen. I was just being defensive. The definition of quality software to me is how it handles the unknown. I don't actually know, maybe it is always set.

I have removed that for now. I guess we will see if it becomes an issue. If we start getting failures saying the Agent ID is empty, we will know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed that for now. I guess we will see if it becomes an issue. If we start getting failures saying the Agent ID is empty, we will know.

I like that decision. If it can happen, I'd consider it a bug, so a test catching it would be a good thing.

Copy link
Contributor

mergify bot commented Feb 28, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b integration-use-agent-id-with-fleet upstream/integration-use-agent-id-with-fleet
git merge upstream/main
git push upstream integration-use-agent-id-with-fleet

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until we discuss with the team about what DRY is and when to try to perform it and what developers of integration tests should seek out and all of us collectively agree to a definition, I believe that this PR should not be merged

@blakerouse
Copy link
Contributor Author

until we discuss with the team about what DRY is and when to try to perform it and what developers of integration tests should seek out and all of us collectively agree to a definition, I believe that this PR should not be merged

Happy to discuss.

Copy link

if err != nil {
return "", err
}
return status.Info.ID, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possibly misleading, because even standalone agents have IDs, that are then replaced with one generated by Fleet after enrollment succeeds.

For example I see the following with a local standalone agent. Notice "is_managed": false there but "id": "913ce739-2c6c-45e9-90f5-2226a14bca70" being populated.

sudo elastic-development-agent status --output=json
{
    "info": {
        "id": "913ce739-2c6c-45e9-90f5-2226a14bca70",
        "version": "9.1.0",
        "commit": "d2047ac48df2f4536ca69a86ad4922b3e264501a",
        "build_time": "2025-02-25 21:52:49 +0000 UTC",
        "snapshot": true,
        "pid": 70294,
        "unprivileged": false,
        "is_managed": false
    },
    "state": 2,
    "message": "Running",

Just looking at the ID at any one point in time is not going to give you a valid ID for making requests to Fleet.

We probably want an explicit entry in the status output for the ID as assigned by Fleet so we can poll for it to be populated. Otherwise I worry there will be a race conditions in tests where sometimes the standalone ID is picked up before it replaced by the one assigned during enrollment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not misleading, but it does greatly depend on when you ask for the AgentID. You must check after enrollment has occurred. You don't need to worry about it picking up the wrong ID, as long as you are calling it at the correct time. I think AgentID() is also useful in the standalone case, so I don't think check if is_managed: true would be correct for this type of call.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

History

cc @blakerouse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.18 Automated backport to the 8.18 branch backport-9.0 Automated backport to the 9.0 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
7 participants