Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: Multiple tests failing with "unable to create policy: Package at top-level directory must contain a top-level manifest.yml file." #4102

Closed
cmacknz opened this issue Jan 18, 2024 · 31 comments · Fixed by elastic/kibana#176532 or elastic/kibana#178993
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team Team:Fleet Label for the Fleet team

Comments

@cmacknz
Copy link
Member

cmacknz commented Jan 18, 2024

https://buildkite.com/elastic/elastic-agent/builds/6362

Observed for both TestFQDN and TestInstallAndUnenrollWithEndpointSecurity/unprotected so far.
Edit 2024-02-08: Also observed on TestEndpointSecurityUnprivileged on this build.

=== RUN   TestFQDN
    fqdn_test.go:66: Set FQDN on host to exnykc.baz.io
    fqdn_test.go:70: Enroll agent in Fleet with a test policy
    fqdn_test.go:91: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/fqdn_test.go:91
        	Error:      	Received unexpected error:
        	            	unable to create policy: Package at top-level directory  must contain a top-level manifest.yml file.
        	Test:       	TestFQDN
--- FAIL: TestFQDN (2.07s)
=== RUN   TestInstallAndUnenrollWithEndpointSecurity/unprotected
    endpoint_security_test.go:227: Enrolling the agent in Fleet
    endpoint_security_test.go:251: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/endpoint_security_test.go:251
        	            				/home/ubuntu/agent/testing/integration/endpoint_security_test.go:114
        	Error:      	Received unexpected error:
        	            	unable to create policy: Package at top-level directory  must contain a top-level manifest.yml file.
        	Test:       	TestInstallAndUnenrollWithEndpointSecurity/unprotected
--- FAIL: TestInstallAndUnenrollWithEndpointSecurity/unprotected (1.94s)

This error originates from Fleet, not the agent. This is a returned error from the /api/fleet/agent_policies" API. The failing line is:

// InstallAgentWithPolicy creates the given policy, enrolls the given agent
// fixture in Fleet using the default Fleet Server, waits for the agent to be
// online, and returns the created policy.
func InstallAgentWithPolicy(ctx context.Context, t *testing.T,
installOpts atesting.InstallOpts,
agentFixture *atesting.Fixture,
kibClient *kibana.Client,
createPolicyReq kibana.AgentPolicy) (kibana.PolicyResponse, error) {
t.Helper()
// Create policy
policy, err := kibClient.CreatePolicy(ctx, createPolicyReq)
if err != nil {
return policy, fmt.Errorf("unable to create policy: %w", err)
}

The error in Fleet seems to originate from https://github.com/elastic/kibana/blob/a7ee92712608755a163a7da6547ce2ba1c31c7f8/x-pack/plugins/fleet/server/services/epm/archive/parse.ts#L217-L225

  // The package must contain a manifest file ...
  const manifestFile = path.posix.join(toplevelDir, MANIFEST_NAME);
  const manifestBuffer = assetsMap[manifestFile];
  logger.debug(`Verifying archive - checking manifest file and manifest buffer`);
  if (!paths.includes(manifestFile) || !manifestBuffer) {
    throw new PackageInvalidArchiveError(
      `Package at top-level directory ${toplevelDir} must contain a top-level ${MANIFEST_NAME} file.`
    );
  }
@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Jan 18, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz cmacknz added the Team:Fleet Label for the Fleet team label Jan 18, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@cmacknz
Copy link
Member Author

cmacknz commented Jan 18, 2024

Example of a failing request causing this failure:

createPolicyReq := kibana.AgentPolicy{
Name: "test-policy-fqdn-" + strings.ReplaceAll(fqdn, ".", "-"),
Namespace: info.Namespace,
Description: fmt.Sprintf("Test policy for FQDN E2E test (%s)", fqdn),
MonitoringEnabled: []kibana.MonitoringEnabledOption{
kibana.MonitoringEnabledLogs,
kibana.MonitoringEnabledMetrics,
},
AgentFeatures: []map[string]interface{}{
{
"name": "fqdn",
"enabled": false,
},
},
}

@nchaulet
Copy link
Member

The issue seems to happen when trying to install elastic_agent, (not sure why yet maybe they was an issue with the bundled package, I am not able to reproduce it on the latest SNAPSHOT)

Screenshot 2024-01-18 at 4 31 30 PM

@cmacknz
Copy link
Member Author

cmacknz commented Jan 18, 2024

Will keep an eye out for this, if it goes away we can just close this. Our tests do this a lot so if the bug is still there I think it will show itself quickly.

@cmacknz cmacknz closed this as completed Jan 29, 2024
@rdner
Copy link
Member

rdner commented Feb 2, 2024

The same issue is happening again in

policy, err := info.KibanaClient.CreatePolicy(ctx, createPolicyReq)
if err != nil {
t.Fatalf("could not create Agent Policy: %s", err)
}

TestContainerCMD
    container_cmd_test.go:62: could not create Agent Policy: Package at top-level directory  must contain a top-level manifest.yml file.

The build sample https://buildkite.com/elastic/elastic-agent/builds/6741#018d6a53-dfe4-4946-8dd6-7776c95e1c73

@rdner rdner reopened this Feb 2, 2024
@rdner
Copy link
Member

rdner commented Feb 2, 2024

Another failure like this in TestFQDN https://buildkite.com/elastic/elastic-agent/builds/6741#018d6ad7-dc9c-492a-a9bd-53ef8bd44087

@cmacknz
Copy link
Member Author

cmacknz commented Feb 5, 2024

=== RUN   TestFQDN
    fqdn_test.go:66: Set FQDN on host to hkbgjg.baz.io
    fqdn_test.go:70: Enroll agent in Fleet with a test policy
    fqdn_test.go:91: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/fqdn_test.go:91
        	Error:      	Received unexpected error:
        	            	unable to create policy: Package at top-level directory  must contain a top-level manifest.yml file.
        	Test:       	TestFQDN
--- FAIL: TestFQDN (2.51s)

@nchaulet @kpollich this continues to happen, any ideas on what might be causing this on the Fleet side or what we could do to resolve it? Would retrying help?

@kpollich
Copy link
Member

kpollich commented Feb 6, 2024

The root cause here is that the Elastic Agent package archive available in the Kibana instance used for these tests is corrupt in some fashion. One thing that might be helpful is checking the Kibana logs while the tests are running to see if there's any kind of installation error reported there. In general, the elastic_agent package is bundled with the Kibana distributable so it should be valid and installable no matter what.

Is the Kibana instance that's available during these test runs a cloud instance in a particular environment/region? Maybe we can try reproducing the issue by manually provisioning a Kibana instance the same way these tests do.

@cmacknz
Copy link
Member Author

cmacknz commented Feb 6, 2024

We use the CFT region for these tests.

@cmacknz
Copy link
Member Author

cmacknz commented Feb 6, 2024

Oddly all of the failures use the linux-arm64-ubuntu runner, but I wouldn't expect that to affect the stack deployment.

@juliaElastic
Copy link
Contributor

What is the best way to reproduce this? Triggering a new build on main?
I was thinking to login to the kibana of the deployment to check what is in the problematic elastic_agent bundled package, is there a way to find out the kibana credentials from the build?

@cmacknz
Copy link
Member Author

cmacknz commented Feb 7, 2024

We log the deployment ID in the test runs if you look in the console output you should see something like the following.

>>> Created cloud stack 8.13.0-SNAPSHOT [stack_id: 8130-SNAPSHOT, deployment_id: <REDACTED>]
--
  | 2024-02-02 17:33:30 UTC | >>> Waiting for cloud stack 8.13.0-SNAPSHOT to be ready [stack_id: 8130-SNAPSHOT, deployment_id: <REDACTED>]

The deployments are eventually cleaned up so I'm not sure if they'll be kept running long enough to catch this.

If I look at the simplest test that reproduced this, it is just creating a simple policy:

t.Log("Enroll agent in Fleet with a test policy")
createPolicyReq := kibana.AgentPolicy{
Name: "test-policy-fqdn-" + strings.ReplaceAll(fqdn, ".", "-"),
Namespace: info.Namespace,
Description: fmt.Sprintf("Test policy for FQDN E2E test (%s)", fqdn),
MonitoringEnabled: []kibana.MonitoringEnabledOption{
kibana.MonitoringEnabledLogs,
kibana.MonitoringEnabledMetrics,
},
AgentFeatures: []map[string]interface{}{
{
"name": "fqdn",
"enabled": false,
},
},
}
installOpts := atesting.InstallOpts{
NonInteractive: true,
Force: true,
}
policy, err := tools.InstallAgentWithPolicy(ctx, t, installOpts, agentFixture, kibClient, createPolicyReq)
require.NoError(t, err)

So you could attempt to call that function repeatedly with a randomly varying name.

Effectively what our tests are doing is creating lots of different policies in rapid succession, possibly concurrently, because the various test runners for each tested agent architecture run in parallel.

@cmacknz
Copy link
Member Author

cmacknz commented Feb 7, 2024

The error we get is from the last line in the block I linked above:

=== RUN   TestFQDN
    fqdn_test.go:66: Set FQDN on host to hkbgjg.baz.io
    fqdn_test.go:70: Enroll agent in Fleet with a test policy
    fqdn_test.go:91: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/fqdn_test.go:91
        	Error:      	Received unexpected error:
        	            	unable to create policy: Package at top-level directory  must contain a top-level manifest.yml file.
        	Test:       	TestFQDN
--- FAIL: TestFQDN (2.51s)

@cmacknz
Copy link
Member Author

cmacknz commented Feb 7, 2024

So you could attempt to call that function repeatedly with a randomly varying name.

To be clear, I think you can probably just create your own deployment in the CFT region and use something like the code I linked to create policies rapidly and concurrently without depending on our test framework at all.

I don't think there is anything special in the test framework here, besides the pattern of our API calls, which you should be able to replicate outside of the framework.

@juliaElastic
Copy link
Contributor

Ok I'll give it a go with a new deployment to reproduce, thanks.

@AndersonQ
Copy link
Member

just for the record, another instance of it: TestEndpointSecurityUnprivileged: https://buildkite.com/elastic/elastic-agent/builds/6892#018d876e-3454-4372-865c-14ce199ba51c

@juliaElastic
Copy link
Contributor

juliaElastic commented Feb 8, 2024

create your own deployment in the CFT region

Could you clarify how to select CFT region? I don't seem to have access to manually create a deployment in production.
The test created a deployment in this region: GCP - Los Angeles (us-west2)
In staging I only see GCP - Iowa (us-central1)

I've tested with a script in this staging instance to create agent policies with agent monitoring concurrently, but couldn't reproduce the issue.
I think the error log could be improved to log more details, is it possible to run these tests with a custom kibana image?
I created a pr cloud deployment which uses the same GCP us-west2 zone, but still can't reproduce the issue.

Another question, are these integration tests being run concurrently?

@rdner
Copy link
Member

rdner commented Feb 8, 2024

This suddenly escalated in the recent build https://buildkite.com/elastic/elastic-agent/builds/6936#018d8947-2e57-43fe-8b01-e1543bd867f2

45 tests failed, seems like in all of them there was some kind of issue communicating to the fleet server, particularly enrolment.

Error: fail to enroll: fail to execute request to fleet-server: status code: 0, fleet-server returned an error: , message: Unknown resource.

@nchaulet
Copy link
Member

nchaulet commented Feb 8, 2024

I am able to reproduce it locally it seems an issue with concurrent install of package, trying to figure what will be the right fix here https://elastic.slack.com/archives/C02BPSXUSKF/p1707410812298429?thread_ts=1707311534.367209&cid=C02BPSXUSKF

How to reproduce the bug I introduce an artificial delay in the package install process

--- a/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts
+++ b/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts
@@ -150,6 +150,7 @@ export async function _installPackage({
         spaceId,
         verificationResult,
       });
+      await new Promise((resolve) => setTimeout(resolve, 30 * 1000));
     }
     logger.debug(`Package install - Installing Kibana assets`);
     const kibanaAssetPromise = withPackageSpan('Install Kibana assets', () =>

Than doing two concurrent request to create an agent policy

repro.mov

@cmacknz
Copy link
Member Author

cmacknz commented Feb 8, 2024

Could you clarify how to select CFT region? I don't seem to have access to manually create a deployment in production.
The test created a deployment in this region: GCP - Los Angeles (us-west2)
In staging I only see GCP - Iowa (us-central1)

The CFT region is a production region, not staging. Will DM you details since the links are private.

@rdner
Copy link
Member

rdner commented Feb 12, 2024

Another occurrence in a recent build (2024-02-10T00:01:30.609Z) on next day after the fix was merged (2024-02-09T13:43:26Z). This time the error message changed:

https://buildkite.com/elastic/elastic-agent/builds/6993#018d9052-86fa-49d4-8986-b2fd55437893

=== RUN   TestFQDN
    fqdn_test.go:66: Set FQDN on host to dnyskp.baz.io
    fqdn_test.go:70: Enroll agent in Fleet with a test policy
    fqdn_test.go:91: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/fqdn_test.go:91
        	Error:      	Received unexpected error:
        	            	unable to create policy: Manifest file manifest.yml not found in paths.
        	Test:       	TestFQDN
--- FAIL: TestFQDN (2.26s)
=== RUN   TestEndpointSecurityUnprivileged
    endpoint_security_test.go:592: Enrolling the agent in Fleet
    endpoint_security_test.go:609: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/endpoint_security_test.go:609
        	Error:      	Received unexpected error:
        	            	unable to create policy: Manifest file manifest.yml not found in paths.
        	Test:       	TestEndpointSecurityUnprivileged
        	Messages:   	Policy Response was: {{    []     0 0 [] false} 0001-01-01 00:00:00 +0000 UTC  0 false []}
--- FAIL: TestEndpointSecurityUnprivileged (1.73s)

There is a chance that a new snapshot was not used for this run but sine the error message changed it tells me it might be something else.

@rdner rdner reopened this Feb 12, 2024
@juliaElastic
Copy link
Contributor

juliaElastic commented Feb 12, 2024

@rdner The error message changed by a recent pr, it is coming from the same code path. From what I found, the snapshot during the saturday night build is from friday night, which didn't contain the fix yet.
Let's see if the error occurs again in the builds this week, hopefully not.

@rdner
Copy link
Member

rdner commented Feb 14, 2024

No failures in the last 2 days. Closing for now.

@rdner rdner closed this as completed Feb 14, 2024
@rdner
Copy link
Member

rdner commented Mar 14, 2024

It appears to be back in https://buildkite.com/elastic/elastic-agent/builds/7776#018e3a44-b483-40fd-8255-c68a28340465

=== RUN   TestLogIngestionFleetManaged/Normal_logs_with_flattened_data_stream_are_shipped
    logs_ingestion_test.go:378: received a non 200-OK when adding package to policy. Status code: 400
    logs_ingestion_test.go:385: ================================================================================
    logs_ingestion_test.go:386: Kibana error response:
    logs_ingestion_test.go:387: HTTP/1.1 400 Bad Request
        Content-Length: 99
        Cache-Control: private, no-cache, no-store, must-revalidate
        Content-Security-Policy: script-src 'report-sample' 'self'; worker-src 'report-sample' 'self' blob:; style-src 'report-sample' 'self' 'unsafe-inline'; report-to violations-endpoint
        Content-Type: application/json; charset=utf-8
        Cross-Origin-Opener-Policy: same-origin
        Date: Thu, 14 Mar 2024 00:41:15 GMT
        Elastic-Api-Version: 2023-10-31
        Kbn-License-Sig: aae020e5403c04f90cee724f1b0e9d9bf953cfbd825ce1b7a661696fc918d7c8
        Kbn-Name: instance-0000000000
        Permissions-Policy: camera=(), display-capture=(), fullscreen=(self), geolocation=(), microphone=(), web-share=()
        Referrer-Policy: strict-origin-when-cross-origin
        Reporting-Endpoints: violations-endpoint="https://at-ingest-ci-8140-snapshot.kb.us-west2.gcp.elastic-cloud.com:9243/internal/security/analytics/_record_violations"
        X-Cloud-Request-Id: jpqhTfvBR6q3EiCurZEHMg
        X-Content-Type-Options: nosniff
        X-Found-Handling-Cluster: 766134025f594be89bcc257a84173981
        X-Found-Handling-Instance: instance-0000000000
        
        {"statusCode":400,"error":"Bad Request","message":"Manifest file manifest.yml not found in paths."}
    logs_ingestion_test.go:388: ================================================================================
    logs_ingestion_test.go:389: Rendered policy:
    logs_ingestion_test.go:390: 
        {
          "policy_id": "b84f0de3-e6a8-41fb-9c14-efe9b83e34d4",
          "package": {
            "name": "log",
            "version": "2.3.0"
          },
          "name": "Log-Input-TestLogIngestionFleetManaged/Normal_logs_with_flattened_data_stream_are_shipped-2024-03-14T00:41:15Z",
          "namespace": "testlogingestionfleetmanagednormallogswithflatteneddatastreamareshippednamespace17172997543648347858",
          "inputs": {
            "logs-logfile": {
              "enabled": true,
              "streams": {
                "log.logs": {
                  "enabled": true,
                  "vars": {
                    "paths": [
                      "/tmp/fleet-ingest-1774416147/log.log" 
                    ],
                    "data_stream.dataset": "testlogingestionfleetmanagednormallogswithflatteneddatastreamareshippeddataset"
                  }
                }
              }
            }
          }
        }
    logs_ingestion_test.go:391: ================================================================================
--- FAIL: TestLogIngestionFleetManaged/Normal_logs_with_flattened_data_stream_are_shipped (0.08s)

Is it the same issue or a new one?

@rdner rdner reopened this Mar 14, 2024
@rdner
Copy link
Member

rdner commented Mar 19, 2024

@juliaElastic may I ask to have a look at #4102 (comment) ?

@juliaElastic
Copy link
Contributor

juliaElastic commented Mar 19, 2024

@nchaulet Any idea if this is the same issue as before?
It does look similar, I tested manually adding a Custom log integration, and didn't get the issue. So it seems there still might be a concurrency issue with the package installation somewhere.

@nchaulet
Copy link
Member

nchaulet commented Mar 19, 2024

Something seems weird @rdner it is possible that the test try to do concurrent request to POST /package_policies from the log the request kind of match the error

Screenshot 2024-03-19 at 1 15 32 PM Screenshot 2024-03-19 at 1 17 33 PM

I was able to get a repro and working on a fix for that elastic/kibana#178993

@rdner
Copy link
Member

rdner commented Mar 20, 2024

Something seems weird @rdner it is possible that the test try to do concurrent request to POST /package_policies from the log the request kind of match the error

@nchaulet I checked the code, this test does not do requests concurrently but it does 2 POST /api/fleet/agent_policies requests which are quite close to each other, as you can see in the logs you mentioned.

First request is coming from here:

policy, err := tools.InstallAgentWithPolicy(
ctx,
t,
installOpts,
agentFixture,
info.KibanaClient,
createPolicyReq)

which comes to this function eventually:

func InstallAgentWithPolicy(ctx context.Context, t *testing.T,
installOpts atesting.InstallOpts,
agentFixture *atesting.Fixture,
kibClient *kibana.Client,
createPolicyReq kibana.AgentPolicy) (kibana.PolicyResponse, error) {
t.Helper()
// Create policy
policy, err := kibClient.CreatePolicy(ctx, createPolicyReq)
if err != nil {
return policy, fmt.Errorf("unable to create policy: %w", err)
}
if createPolicyReq.IsProtected {
// If protected fetch uninstall token and set it for the fixture
resp, err := kibClient.GetPolicyUninstallTokens(ctx, policy.ID)
if err != nil {
return policy, fmt.Errorf("failed to fetch uninstal tokens: %w", err)
}
if len(resp.Items) == 0 {
return policy, fmt.Errorf("expected non-zero number of tokens: %w", err)
}
if len(resp.Items[0].Token) == 0 {
return policy, fmt.Errorf("expected non-empty token: %w", err)
}
uninstallToken := resp.Items[0].Token
t.Logf("Protected with uninstall token: %v", uninstallToken)
agentFixture.SetUninstallToken(uninstallToken)
}
err = InstallAgentForPolicy(ctx, t, installOpts, agentFixture, kibClient, policy.ID)
return policy, err
}

It creates a new policy and enrolls the currently running agent:

policy, err := kibClient.CreatePolicy(ctx, createPolicyReq)

err = InstallAgentForPolicy(ctx, t, installOpts, agentFixture, kibClient, policy.ID)

But there is a sub-test case that also creates a policy here:

t.Run("Normal logs with flattened data_stream are shipped", func(t *testing.T) {
testFlattenedDatastreamFleetPolicy(t, ctx, info, policy)
})

// 2. Call Kibana to create the policy.
// Docs: https://www.elastic.co/guide/en/fleet/current/fleet-api-docs.html#create-integration-policy-api
resp, err := info.KibanaClient.Connection.Send(
http.MethodPost,
"/api/fleet/package_policies",
nil,
nil,
bytes.NewBufferString(agentPolicy))
if err != nil {
t.Fatalf("could not execute request to Kibana/Fleet: %s", err)
}

@belimawr since you worked on this test, could you give us more context why we create 2 policies here?

@belimawr
Copy link
Contributor

@belimawr since you worked on this test, could you give us more context why we create 2 policies here?

The test TestLogIngestionFleetManaged creates a single policy

policy, err := tools.InstallAgentWithPolicy(
ctx,
t,
installOpts,
agentFixture,
info.KibanaClient,
createPolicyReq)
then it adds a package to the policy on
func testFlattenedDatastreamFleetPolicy(
If you look at the template it contains the exiting policy ID
var policyJSON = `
{
"policy_id": "{{.PolicyID}}",
"package": {
"name": "log",
"version": "2.3.0"
},
"name": "{{.Name}}",
"namespace": "{{.Namespace}}",
"inputs": {
"logs-logfile": {
"enabled": true,
"streams": {
"log.logs": {
"enabled": true,
"vars": {
"paths": [
"{{.LogFilePath | js}}" {{/* we need to escape windows paths */}}
],
"data_stream.dataset": "{{.Dataset}}"
}
}
}
}
}
}`

My understanding when I wrote this test is that the call to /api/fleet/package_policies only adds packages to the policy.

@rdner
Copy link
Member

rdner commented Mar 20, 2024

@belimawr I just expected method PUT or PATCH for updating a policy, my bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team Team:Fleet Label for the Fleet team
Projects
None yet
8 participants