Skip to content

Unable to recover from corrupted upgrades when the host system reboots #9099

@cmacknz

Description

@cmacknz

We recently had a support case where the host Windows system appeared to crash or shutdown shortly after we entered the "Upgrade Replacing" state. This is the state where the the top level elastic-agent symlink has been pointed at the newly downloaded and extracted version of the agent. This was the state of the upgrade marker at the time of the problem:

version: 9.0.3+build202507110136
hash: 5f1479
versioned_home: data\elastic-agent-9.0.3+build202507110136-5f1479
updated_on: 2025-07-15T18:21:24.0025395-07:00
prev_version: 9.0.3
prev_hash: 5f1479
prev_versioned_home: data\elastic-agent-9.0.3-5f1479
acked: false
action:
    id: 1072a966-3972-4af0-a7e4-bdfd6fc90f5c
    type: UPGRADE
    version: 9.0.3+build202507110136
details:
    target_version: 9.0.3+build202507110136
    state: UPG_REPLACING
    action_id: 1072a966-3972-4af0-a7e4-bdfd6fc90f5c
    metadata:
        download_percent: 1
        retry_until: null
desired_outcome: UPGRADE

In the specific case the contents of the .elastic-agent.active.commit and the .package-version files had the commit hash corrupted into the bytes 0000000000000a instead of the expected 5f147906dab8702d8738a71f627a07a0725b5abc for the 9.0.3+build202507110136 hotfix release.

All elastic-agent commands were failing with the following error:

Error initializing version information: parsing version "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00": version string does not match expected format

In this situation, the upgrade watcher will not have been running because the host system rebooted, and while the OS will attempt to start the elastic-agent service because the path to the executable is still valid the service likely exited because
of the dependency on valid version files. This likely makes this a situation where the existing upgrade design could have recovered if we started the upgrade watcher as early as possible with no dependency on the version (perhaps having to discover the path to the watcher differently if the path files are corrupted).

We should setup tests that stage an agent in various representative versions of this situation and make changes so that it can recover. The conditions would be:

  • The agent service is not yet started (or agent is not installed to simplify control)
  • Two versions of the agent on disk in the data sub-directory
  • The upgrade watcher is not yet running and/or has been stopped
  • The active version of the agent has a runtime problem but is able to start
    • The upgrade marker could be correct
    • The version files could be corrupt
    • The path layout could not be what is expected
    • Etc
  • The agent watcher should be able to start and switch the symlink to the working version of the agent.

If the top level elastic-agent symlink were to be corrupted, or the active version of the agent executable were to be corrupted when the pervious one was not, we would likely need another design change to address this situation. We can discuss doing this depending on complexity but prioritize fixing the known and observed bug above in the current design first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions