-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test]: Multiple tests failing with "unable to create policy: Package at top-level directory must contain a top-level manifest.yml file." #4102
[Flaky Test]: Multiple tests failing with "unable to create policy: Package at top-level directory must contain a top-level manifest.yml file." #4102
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Pinging @elastic/fleet (Team:Fleet) |
Example of a failing request causing this failure: elastic-agent/testing/integration/fqdn_test.go Lines 71 to 85 in 84a2d5c
|
Will keep an eye out for this, if it goes away we can just close this. Our tests do this a lot so if the bug is still there I think it will show itself quickly. |
The same issue is happening again in elastic-agent/testing/integration/container_cmd_test.go Lines 60 to 63 in 2555088
The build sample https://buildkite.com/elastic/elastic-agent/builds/6741#018d6a53-dfe4-4946-8dd6-7776c95e1c73 |
Another failure like this in TestFQDN https://buildkite.com/elastic/elastic-agent/builds/6741#018d6ad7-dc9c-492a-a9bd-53ef8bd44087 |
@nchaulet @kpollich this continues to happen, any ideas on what might be causing this on the Fleet side or what we could do to resolve it? Would retrying help? |
The root cause here is that the Elastic Agent package archive available in the Kibana instance used for these tests is corrupt in some fashion. One thing that might be helpful is checking the Kibana logs while the tests are running to see if there's any kind of installation error reported there. In general, the Is the Kibana instance that's available during these test runs a cloud instance in a particular environment/region? Maybe we can try reproducing the issue by manually provisioning a Kibana instance the same way these tests do. |
We use the CFT region for these tests. |
Oddly all of the failures use the |
What is the best way to reproduce this? Triggering a new build on main? |
We log the deployment ID in the test runs if you look in the console output you should see something like the following.
The deployments are eventually cleaned up so I'm not sure if they'll be kept running long enough to catch this. If I look at the simplest test that reproduced this, it is just creating a simple policy: elastic-agent/testing/integration/fqdn_test.go Lines 70 to 91 in b6c24c1
So you could attempt to call that function repeatedly with a randomly varying name. Effectively what our tests are doing is creating lots of different policies in rapid succession, possibly concurrently, because the various test runners for each tested agent architecture run in parallel. |
The error we get is from the last line in the block I linked above:
|
To be clear, I think you can probably just create your own deployment in the CFT region and use something like the code I linked to create policies rapidly and concurrently without depending on our test framework at all. I don't think there is anything special in the test framework here, besides the pattern of our API calls, which you should be able to replicate outside of the framework. |
Ok I'll give it a go with a new deployment to reproduce, thanks. |
just for the record, another instance of it: TestEndpointSecurityUnprivileged: https://buildkite.com/elastic/elastic-agent/builds/6892#018d876e-3454-4372-865c-14ce199ba51c |
Could you clarify how to select CFT region? I don't seem to have access to manually create a deployment in production. I've tested with a script in this staging instance to create agent policies with agent monitoring concurrently, but couldn't reproduce the issue. Another question, are these integration tests being run concurrently? |
This suddenly escalated in the recent build https://buildkite.com/elastic/elastic-agent/builds/6936#018d8947-2e57-43fe-8b01-e1543bd867f2 45 tests failed, seems like in all of them there was some kind of issue communicating to the fleet server, particularly enrolment.
|
I am able to reproduce it locally it seems an issue with concurrent install of package, trying to figure what will be the right fix here https://elastic.slack.com/archives/C02BPSXUSKF/p1707410812298429?thread_ts=1707311534.367209&cid=C02BPSXUSKF How to reproduce the bug I introduce an artificial delay in the package install process --- a/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts
+++ b/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts
@@ -150,6 +150,7 @@ export async function _installPackage({
spaceId,
verificationResult,
});
+ await new Promise((resolve) => setTimeout(resolve, 30 * 1000));
}
logger.debug(`Package install - Installing Kibana assets`);
const kibanaAssetPromise = withPackageSpan('Install Kibana assets', () => Than doing two concurrent request to create an agent policy repro.mov |
The CFT region is a production region, not staging. Will DM you details since the links are private. |
Another occurrence in a recent build (2024-02-10T00:01:30.609Z) on next day after the fix was merged (2024-02-09T13:43:26Z). This time the error message changed: https://buildkite.com/elastic/elastic-agent/builds/6993#018d9052-86fa-49d4-8986-b2fd55437893
There is a chance that a new snapshot was not used for this run but sine the error message changed it tells me it might be something else. |
No failures in the last 2 days. Closing for now. |
It appears to be back in https://buildkite.com/elastic/elastic-agent/builds/7776#018e3a44-b483-40fd-8255-c68a28340465
Is it the same issue or a new one? |
@juliaElastic may I ask to have a look at #4102 (comment) ? |
@nchaulet Any idea if this is the same issue as before? |
Something seems weird @rdner it is possible that the test try to do concurrent request to ![]() ![]() I was able to get a repro and working on a fix for that elastic/kibana#178993 |
@nchaulet I checked the code, this test does not do requests concurrently but it does 2 First request is coming from here: elastic-agent/testing/integration/logs_ingestion_test.go Lines 82 to 88 in ddd832b
which comes to this function eventually: elastic-agent/pkg/testing/tools/tools.go Lines 41 to 75 in ddd832b
It creates a new policy and enrolls the currently running agent: elastic-agent/pkg/testing/tools/tools.go Line 49 in ddd832b
elastic-agent/pkg/testing/tools/tools.go Line 73 in ddd832b
But there is a sub-test case that also creates a policy here: elastic-agent/testing/integration/logs_ingestion_test.go Lines 97 to 99 in ddd832b
elastic-agent/testing/integration/logs_ingestion_test.go Lines 365 to 375 in ddd832b
@belimawr since you worked on this test, could you give us more context why we create 2 policies here? |
The test elastic-agent/testing/integration/logs_ingestion_test.go Lines 82 to 88 in 0159e54
elastic-agent/testing/integration/logs_ingestion_test.go Lines 493 to 518 in 0159e54
My understanding when I wrote this test is that the call to |
@belimawr I just expected method PUT or PATCH for updating a policy, my bad. |
https://buildkite.com/elastic/elastic-agent/builds/6362
Observed for both
TestFQDN
andTestInstallAndUnenrollWithEndpointSecurity/unprotected
so far.Edit 2024-02-08: Also observed on TestEndpointSecurityUnprivileged on this build.
This error originates from Fleet, not the agent. This is a returned error from the
/api/fleet/agent_policies"
API. The failing line is:elastic-agent/pkg/testing/tools/tools.go
Lines 38 to 52 in 84a2d5c
The error in Fleet seems to originate from https://github.com/elastic/kibana/blob/a7ee92712608755a163a7da6547ce2ba1c31c7f8/x-pack/plugins/fleet/server/services/epm/archive/parse.ts#L217-L225
The text was updated successfully, but these errors were encountered: