-
Notifications
You must be signed in to change notification settings - Fork 177
Description
We recently had a support case where the host Windows system appeared to crash or shutdown shortly after we entered the "Upgrade Replacing" state. This is the state where the the top level elastic-agent
symlink has been pointed at the newly downloaded and extracted version of the agent. This was the state of the upgrade marker at the time of the problem:
version: 9.0.3+build202507110136
hash: 5f1479
versioned_home: data\elastic-agent-9.0.3+build202507110136-5f1479
updated_on: 2025-07-15T18:21:24.0025395-07:00
prev_version: 9.0.3
prev_hash: 5f1479
prev_versioned_home: data\elastic-agent-9.0.3-5f1479
acked: false
action:
id: 1072a966-3972-4af0-a7e4-bdfd6fc90f5c
type: UPGRADE
version: 9.0.3+build202507110136
details:
target_version: 9.0.3+build202507110136
state: UPG_REPLACING
action_id: 1072a966-3972-4af0-a7e4-bdfd6fc90f5c
metadata:
download_percent: 1
retry_until: null
desired_outcome: UPGRADE
In the specific case the contents of the .elastic-agent.active.commit
and the .package-version
files had the commit hash corrupted into the bytes 0000000000000a
instead of the expected 5f147906dab8702d8738a71f627a07a0725b5abc
for the 9.0.3+build202507110136
hotfix release.
All elastic-agent
commands were failing with the following error:
Error initializing version information: parsing version "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00": version string does not match expected format
In this situation, the upgrade watcher will not have been running because the host system rebooted, and while the OS will attempt to start the elastic-agent
service because the path to the executable is still valid the service likely exited because
of the dependency on valid version files. This likely makes this a situation where the existing upgrade design could have recovered if we started the upgrade watcher as early as possible with no dependency on the version (perhaps having to discover the path to the watcher differently if the path files are corrupted).
We should setup tests that stage an agent in various representative versions of this situation and make changes so that it can recover. The conditions would be:
- The agent service is not yet started (or agent is not installed to simplify control)
- Two versions of the agent on disk in the data sub-directory
- The upgrade watcher is not yet running and/or has been stopped
- The active version of the agent has a runtime problem but is able to start
- The upgrade marker could be correct
- The version files could be corrupt
- The path layout could not be what is expected
- Etc
- The agent watcher should be able to start and switch the symlink to the working version of the agent.
If the top level elastic-agent
symlink were to be corrupted, or the active version of the agent executable were to be corrupted when the pervious one was not, we would likely need another design change to address this situation. We can discuss doing this depending on complexity but prioritize fixing the known and observed bug above in the current design first.