Skip to content

[8.18](backport #42398) Handle leak of process info in hostfs provider for add_session_metadata #45322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 16, 2025

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jul 11, 2025

Proposed commit message

Fixes #42317

So, it turns out that the processsDB used by the procfs provider in add_session_metadata expects events to come in order, which won't always be the case under load. If we get a an exit event before the exec event, we'll drop the exit event, and then the process event will remain in the db.processes map indefinitely. In addition to this, auditbeat is configured to tell netlink to drop events, meaning that under load, we can lose either the exec or the exit event, potentially leading to a leak if we can never pair up the two for a given process.

This alters the DB so we don't drop orphaned exit events, and instead the DB reaper will wait a few iterations of reapProcs() to try to match the orphaned exit. We also optionally reap process exec events. I've tested this under load, and it does prevent the process DB from growing indefinitely.

There's a few caveats to this as-is:

  • We're now putting every single exit event into our db.removalMap, which means we'll be using more memory until those exit events are reaped. I can't really think of a good way around this.
  • This processor still uses a lot of resources, and under high-load situations, we may still end up using an unacceptable amount of memory.
  • If we need to reap processes, it can result in data loss if the processes don't exist in /proc.

There's also a few smaller changes to the process DB:

  • The removal list has been changed from a heap type to a map. This is less performant, but needed, as we're looking up exit events with every exec.
  • We expose a number of new config vars.
  • This adds metrics to the DB, to further help out with any issues in the future.

I'm still running performance tests on this, as the behavior is a bit bursty and hard to measure without some proper scripts. Will update when I have results.

How to test

Run auditbeat with the following:

- module: auditd
  # Load audit rules from separate files. Same format as audit.rules(7).
  audit_rule_files: [ '${path.config}/audit.rules.d/*.conf' ]
  audit_rules: |
    -a exit,always -F arch=b64 -S fork
    -a exit,always -F arch=b64 -S vfork
    ## set_sid
    -a exit,always -F arch=b64 -F euid=0 -S execve -k rootact
    -a exit,always -F arch=b32 -F euid=0 -S execve -k rootact
    -a always,exit -F arch=b64 -S connect -F a2=16 -F success=1 -F key=network_connect_4
    -a always,exit -F arch=b64 -F exe=/bin/bash -F success=1 -S connect -k "remote_shell"
    -a always,exit -F arch=b64 -F exe=/usr/bin/bash -F success=1 -S connect -k "remote_shell" 
    -a always,exit -F arch=b64 -S exit_group
    -a exit,always -F arch=b64 -S close
    -a always,exit -F arch=b64 -S exit
    -a exit,always -F arch=b64 -S kill
    -a always,exit -F arch=b64 -S setsid 
    -a always,exit -F arch=b64 -S execve,execveat -k exec

processors:
  - add_session_metadata:
      backend: "procfs"

logging.level: debug

Grep for the REAPER: log line to examine the following the state of the various DB maps.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Data: ProcessDB under load

Screenshot 2025-02-06 at 12 51 07 PM

Data: Memory during and after load

Screenshot 2025-02-06 at 2 02 26 PM
This is an automatic backport of pull request #42398 done by [Mergify](https://mergify.com).

…adata` (#42398)

* handle leak in hostfs provider for sessionmd

* add metrics, clean up

* fix tests

* add process reaper for dropped exit events

* remove test code

* linter

* more testing, fix mock provider

* fix error checks

* clean up, add session maps to reaper, expand metrics

* fix tests

* fix tests

* format

* docs

(cherry picked from commit d6ff82b)
@mergify mergify bot requested a review from a team as a code owner July 11, 2025 13:00
@mergify mergify bot added the backport label Jul 11, 2025
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 11, 2025
@github-actions github-actions bot added bug Auditbeat Team:Security-Linux Platform Linux Platform Team in Security Solution labels Jul 11, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 11, 2025
Copy link
Contributor Author

mergify bot commented Jul 14, 2025

This pull request has not been merged yet. Could you please review and merge it @fearful-symmetry? 🙏

We now check for G115, most overflows are impossible, like converting s63
seconds to u64 seconds for date (will overflow in 292 billion years).

Pids are actually 32bit in the kernel so casting * -> u32 is safe.

This is a backport, and I'd hate to introduce a bug by adding unecessarily
overflow handling.
@haesbaert haesbaert added backport-9.0 Automated backport to the 9.0 branch and removed backport-9.0 Automated backport to the 9.0 branch labels Jul 16, 2025
Copy link
Contributor

@nicholasberlin nicholasberlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overflow lint handling LGTM

@haesbaert haesbaert merged commit a9daa98 into 8.18 Jul 16, 2025
25 checks passed
@haesbaert haesbaert deleted the mergify/bp/8.18/pr-42398 branch July 16, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auditbeat backport bug Team:Security-Linux Platform Linux Platform Team in Security Solution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants