Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestStandaloneDowngradeToSpecificSnapshotBuild in daily builds can fail with busy file elastic-agent error due to upgrade hash collision #4089

Closed
rdner opened this issue Jan 17, 2024 · 6 comments · Fixed by #4095
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@rdner
Copy link
Member

rdner commented Jan 17, 2024

Flaky Test

It started failing quite consistently on 12th of January 2024. The commit has no code changes 158fd47

Stack

Linux

  fixture.go:261: Extracting artifact elastic-agent-8.13.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestStandaloneDowngradeToSpecificSnapshotBuild3993626959/002
    fixture.go:274: Completed extraction of artifact elastic-agent-8.13.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestStandaloneDowngradeToSpecificSnapshotBuild3993626959/002
    fixture.go:820: Components were not modified from the fetched artifact
    fixture.go:615: >> running binary with: [/tmp/TestStandaloneDowngradeToSpecificSnapshotBuild3993626959/001/elastic-agent-8.13.0-SNAPSHOT-linux-arm64/elastic-agent version --binary-only --yaml]
    fixture.go:615: >> running binary with: [/tmp/TestStandaloneDowngradeToSpecificSnapshotBuild3993626959/002/elastic-agent-8.13.0-SNAPSHOT-linux-arm64/elastic-agent version --binary-only --yaml]
    upgrader.go:201: Installing version "8.13.0-SNAPSHOT"
    fixture_install.go:103: [test TestStandaloneDowngradeToSpecificSnapshotBuild] Inside fixture install function
    fixture.go:615: >> running binary with: [/tmp/TestStandaloneDowngradeToSpecificSnapshotBuild3993626959/001/elastic-agent-8.13.0-SNAPSHOT-linux-arm64/elastic-agent install --force --non-interactive]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    upgrader.go:236: Upgrading from version "8.13.0-SNAPSHOT" to version "8.13.0-SNAPSHOT"
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent upgrade 8.13.0-SNAPSHOT --source-uri file:///home/ubuntu/agent/.agent-testing/artifact --skip-verify]
    upgrade_downgrade_test.go:93: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_downgrade_test.go:93
        	Error:      	Received unexpected error:
        	            	failed to start agent upgrade to version "8.13.0": exit status 1
        	            	Error: Failed trigger upgrade of daemon: TarInstaller: creating file /opt/Elastic/Agent/data/elastic-agent-17f048/elastic-agent: open /opt/Elastic/Agent/data/elastic-agent-17f048/elastic-agent: text file busy
        	            	For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
        	Test:       	TestStandaloneDowngradeToSpecificSnapshotBuild
    fixture_install.go:137: [test TestStandaloneDowngradeToSpecificSnapshotBuild] Inside fixture cleanup function
    fixture_install.go:152: collecting diagnostics; test failed
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestStandaloneDowngradeToSpecificSnapshotBuild-diagnostics-2024-01-16T08:18:37Z.zip]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
--- FAIL: TestStandaloneDowngradeToSpecificSnapshotBuild (66.93s)

Windows

fixture.go:261: Extracting artifact elastic-agent-8.13.0-SNAPSHOT-windows-x86_64.zip to C:\Users\windows\AppData\Local\Temp\TestStandaloneDowngradeToSpecificSnapshotBuild3775977433\002
    fixture.go:274: Completed extraction of artifact elastic-agent-8.13.0-SNAPSHOT-windows-x86_64.zip to C:\Users\windows\AppData\Local\Temp\TestStandaloneDowngradeToSpecificSnapshotBuild3775977433\002
    fixture.go:820: Components were not modified from the fetched artifact
    fixture.go:615: >> running binary with: [C:\Users\windows\AppData\Local\Temp\TestStandaloneDowngradeToSpecificSnapshotBuild3775977433\001\elastic-agent-8.13.0-SNAPSHOT-windows-x86_64\elastic-agent.exe version --binary-only --yaml]
    fixture.go:615: >> running binary with: [C:\Users\windows\AppData\Local\Temp\TestStandaloneDowngradeToSpecificSnapshotBuild3775977433\002\elastic-agent-8.13.0-SNAPSHOT-windows-x86_64\elastic-agent.exe version --binary-only --yaml]
    upgrader.go:201: Installing version "8.13.0-SNAPSHOT"
    fixture_install.go:103: [test TestStandaloneDowngradeToSpecificSnapshotBuild] Inside fixture install function
    fixture.go:615: >> running binary with: [C:\Users\windows\AppData\Local\Temp\TestStandaloneDowngradeToSpecificSnapshotBuild3775977433\001\elastic-agent-8.13.0-SNAPSHOT-windows-x86_64\elastic-agent.exe install --force --non-interactive]
    fixture.go:615: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe status --output json]
    upgrader.go:236: Upgrading from version "8.13.0-SNAPSHOT" to version "8.13.0-SNAPSHOT"
    fixture.go:615: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe upgrade 8.13.0-SNAPSHOT --source-uri file://C:\Users\windows\agent\.agent-testing\artifact --skip-verify]
    upgrade_downgrade_test.go:93: 
        	Error Trace:	C:/Users/windows/agent/testing/integration/upgrade_downgrade_test.go:93
        	Error:      	Received unexpected error:
        	            	failed to start agent upgrade to version "8.13.0": exit status 1
        	            	Error: Failed trigger upgrade of daemon: open C:\Program Files\Elastic\Agent\data\elastic-agent-17f048\elastic-agent.exe: The process cannot access the file because it is being used by another process.
        	            	For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
        	Test:       	TestStandaloneDowngradeToSpecificSnapshotBuild
    fixture_install.go:137: [test TestStandaloneDowngradeToSpecificSnapshotBuild] Inside fixture cleanup function
    fixture_install.go:152: collecting diagnostics; test failed
    fixture.go:615: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe diagnostics -f C:\Users\windows\agent\build\diagnostics\TestStandaloneDowngradeToSpecificSnapshotBuild-diagnostics-2024-01-16T08-25-31Z.zip]
    fixture.go:615: >> running binary with: [C:\Program Files\Elastic\Agent\elastic-agent.exe uninstall --force]
--- FAIL: TestStandaloneDowngradeToSpecificSnapshotBuild (83.98s)
@rdner rdner self-assigned this Jan 17, 2024
@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jan 17, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@rdner
Copy link
Member Author

rdner commented Jan 17, 2024

upgrader.go:236: Upgrading from version "8.13.0-SNAPSHOT" to version "8.13.0-SNAPSHOT"

according to this line the test is mistakenly trying to upgrade to the same version.

This message is coming from this code

t.Logf("Testing Elastic Agent upgrade from %s to %s...", define.Version(), endParsedVersion.String())

define.Version() is coming from our environment variable. The other version is coming from the artefact API. Looks like the artefact API started to return something different 5 days ago and now the test needs a code change / additional checks.

@rdner
Copy link
Member Author

rdner commented Jan 17, 2024

The main theory for now is that multiple builds from the artefact API have the same commit hash for the elastic-agent binary which leads to re-using the same path.

To verify this I extended logging here #4090

Once it's verified, we will need to fix the test by implementing the following:

  1. Iterate through builds received from the artefact API backwards until we find a build with a hash that does not match the currently running version
  2. If there is no SNAPSHOT build like this in the current version (e.g. 8.13.0-SNAPSHOT has only builds that match the current hash from main), we then skip the test.

@cmacknz
Copy link
Member

cmacknz commented Jan 17, 2024

The main theory for now is that multiple builds from the artefact API have the same commit hash for the elastic-agent binary which leads to re-using the same path.

The build this happened for is the daily build which is not triggered based on a new commit. If there were no commits to main between the time when the stack 8.13.0 snapshot build last completed (which builds from main) and when this job triggered then this is exactly the result we would get.

@cmacknz
Copy link
Member

cmacknz commented Jan 17, 2024

The last successful snapshot for main was with build ID 8.13.0-sq0d327c. This matches what I see being downloaded in the tests:

Downloading artifact from https://snapshots.elastic.co/8.13.0-sq0d327c/downloads/beats/elastic-agent/elastic-agent-8.13.0-SNAPSHOT-windows-x86_64.zip

Looking at the stack agent package Buildkite job for that snapshot (or inspecting the agent artifacts directly) we see it is from commit 17f0480

Daily build including beats serverless tests
Build #6287
main
17f048008

Looking at the Buidkite job it is also from 17f0480

So without looking at the log at all we can confirm this is in fact a hash collision.

@cmacknz cmacknz changed the title TestStandaloneDowngradeToSpecificSnapshotBuild fails with busy file elastic-agent Daily builds can fail with busy file elastic-agent error due to upgrade hash collision Jan 17, 2024
@cmacknz cmacknz changed the title Daily builds can fail with busy file elastic-agent error due to upgrade hash collision TestStandaloneDowngradeToSpecificSnapshotBuild in daily builds can fail with busy file elastic-agent error due to upgrade hash collision Jan 17, 2024
@cmacknz
Copy link
Member

cmacknz commented Jan 17, 2024

Edited the title to be a bit more general since there is at least one duplicate of this problem in #4091

@rdner rdner added the flaky-test Unstable or unreliable test cases. label Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants