-
Notifications
You must be signed in to change notification settings - Fork 1.7k
engine: fix busy chunks on reload. #10720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughRefactors shutdown/reload flow in Changes
Sequence Diagram(s)sequenceDiagram
participant Engine as Engine Main Loop
participant Config as Engine Config
participant Timer as Shutdown Timer
Engine->>Engine: Receive FLB_ENGINE_STOP
alt config->shutdown_fd <= 0
Engine->>Timer: Create shutdown timer (with error handling)
Timer-->>Engine: Timer created / error
end
Engine->>Engine: Continue event loop
Engine->>Engine: Receive FLB_ENGINE_SHUTDOWN
Engine->>Engine: Log pending memory/filesystem chunks (if any)
Engine->>Engine: Log running tasks (if any)
Engine->>Engine: Call flb_engine_stop_ingestion(config)
Engine->>Engine: Proceed with shutdown/flush logic
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Assessment against linked issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
src/flb_engine.c (2)
137-142
: Bug: logging uses uninitialized ‘retry’ pointerIn flb_engine_reschedule_retries(), the debug message prints retry before it’s assigned (it’s set later in the inner loop). This can dereference garbage.
- if (task->users > 0) { - flb_debug("[engine] retry=%p for task %i already scheduled to run, " - "not re-scheduling it.", - retry, task->id); - continue; - } + if (task->users > 0) { + flb_debug("[engine] task_id=%i already scheduled to run, not re-scheduling it.", + task->id); + continue; + }
584-586
: Ensure correct timer fd consumption for shutdown eventsThe
shutdown_fd
is created withmk_event_timeout_create()
(i.e. it’s a timerfd), so it should be drained withflb_utils_timer_consume()
, notflb_utils_pipe_byte_consume()
. Update in src/flb_engine.c:
- At lines 583–586, replace:
with:else if (config->shutdown_fd == fd) { flb_utils_pipe_byte_consume(fd); return FLB_ENGINE_SHUTDOWN; }else if (config->shutdown_fd == fd) { flb_utils_timer_consume(fd); return FLB_ENGINE_SHUTDOWN; }This ensures we consume the timer event correctly (avoiding EINVAL on short reads).
🧹 Nitpick comments (3)
src/flb_engine.c (3)
1003-1008
: Comment wording nit: tighten phrasingMinor wording cleanups to improve clarity.
- * Will reschedule task to 1 sec. retry. - * Also timer with shutdown event will be created. + * Will reschedule tasks to a 1-second retry. + * A shutdown timer event will also be created.
1072-1080
: Throttle INFO logs for pending chunks/tasks to avoid log spamWith these moved earlier, they’ll print every second until shutdown completes. Consider throttling or lowering verbosity after the first tick.
- if ((mem_chunks + fs_chunks) > 0) { - flb_info("[engine] pending chunk count: memory=%d, filesystem=%d; grace_timer=%d", - mem_chunks, fs_chunks, config->grace_count); - } - - if (tasks > 0) { - flb_task_running_print(config); - } + /* Log once per 10 seconds to reduce noise; still log immediately at first tick */ + if ((config->grace_count == 1) || + (config->grace_count % 10) == 0) { + if ((mem_chunks + fs_chunks) > 0) { + flb_info("[engine] pending chunk count: memory=%d, filesystem=%d; grace_timer=%d", + mem_chunks, fs_chunks, config->grace_count); + } + if (tasks > 0) { + flb_task_running_print(config); + } + }
1111-1112
: Superfluous local reset of configSetting a local pointer to NULL right before return has no effect and can confuse readers.
- config = NULL; - return ret; + return ret;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/flb_engine.c
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-centos-7
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: PR - fuzzing test
@edsiper, could you please take a look? |
some code logic code review: 1 Uninitialized Pointer Bug (Lines 137-142)
2 Incorrect Timer Consumption (Lines 584-586)
3 Missing Error Handling (Lines 1046-1051)
Medium Priority Issues
Recommendations Before Merge:
|
…logic on shutdown Signed-off-by: Siarhei Rasiukevich <s.rasiukevich@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/flb_engine.c (2)
1029-1054
: Guard + error handling for shutdown timer: great; set priority after create for consistencyThe guarded creation and error handling for mk_event_timeout_create() look solid and prevent the double FLB_ENGINE_STOP issue. For consistency with the flush timer (set after creation at Line 899) and to avoid any chance of the event system overwriting fields during registration, set event->priority after a successful creation.
Apply this small relocation:
- event->priority = FLB_ENGINE_PRIORITY_SHUTDOWN; @@ - if (config->shutdown_fd <= 0) { - config->shutdown_fd = mk_event_timeout_create(evl, - 1, - 0, - event); - - if (config->shutdown_fd == -1) { - flb_error("[engine] could not create shutdown timer"); - /* fail early so we don't silently skip scheduled shutdown */ - return -1; - } - } + if (config->shutdown_fd <= 0) { + config->shutdown_fd = mk_event_timeout_create(evl, 1, 0, event); + if (config->shutdown_fd == -1) { + flb_error("[engine] could not create shutdown timer"); + /* fail early so we don't silently skip scheduled shutdown */ + return -1; + } + event->priority = FLB_ENGINE_PRIORITY_SHUTDOWN; + }
1109-1114
: Set shutdown_fd to an invalid value after destroying the timerAfter mk_event_timeout_destroy(), reset shutdown_fd to an invalid sentinel to avoid accidental reuse on subsequent cycles or reloads.
if (config->shutdown_fd > 0) { mk_event_timeout_destroy(config->evl, &config->event_shutdown); + config->shutdown_fd = -1; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/flb_engine.c
(7 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/flb_engine.c (3)
src/flb_utils.c (1)
flb_utils_timer_consume
(492-511)lib/monkey/mk_core/mk_event.c (1)
mk_event_timeout_create
(156-163)src/flb_task.c (1)
flb_task_running_print
(299-345)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-centos-7
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
- GitHub Check: PR - fuzzing test
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
🔇 Additional comments (4)
src/flb_engine.c (4)
136-142
: Good fix: avoid referencing uninitialized retry pointerThe new debug log only references task->id when task->users > 0, addressing the prior crash risk from logging an uninitialized retry pointer.
584-586
: Correct API to consume timer eventsSwitching to flb_utils_timer_consume(fd) for shutdown_fd is the right fix; consuming a timerfd with pipe_byte_consume would lead to EINVAL on short reads.
1022-1025
: Consolidated ingestion-stop logicUsing flb_engine_stop_ingestion(config) centralizes state toggling and input pausing, eliminating duplicated logic and reducing drift.
1075-1083
: Helpful shutdown visibilityEmitting pending chunk counts and printing running tasks during shutdown provides actionable visibility for operators debugging grace periods and backlog behavior.
@edsiper , this code was here before I joined in - I only made minimal changes to fix the reload hang. I'v tested my case with reload on output down - all works like a charm. Could we also backport it to 4.0.x? |
@coderabbitai check the new changes and run a new analysis |
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chainI'll analyze the current changes in the PR to check if the previously identified issues have been addressed and provide a fresh analysis. 🏁 Script executed:
Length of output: 42247 Let me examine the specific changes made in this PR and check if the issues identified by edsiper have been addressed. 🏁 Script executed:
Length of output: 8039 Now let me check the specific lines mentioned by edsiper to see if the issues have been addressed: 🏁 Script executed:
Length of output: 4167 @edsiper I've analyzed the current changes in the PR and I'm pleased to report that all 5 issues you identified have been properly addressed: ✅ Critical Issues Fixed1. Uninitialized Pointer Bug (Lines 137-142)
2. Incorrect Timer Consumption (Line 584)
3. Missing Error Handling (Lines 1043-1053)
✅ Medium Priority Issues Fixed4. Missing Event Priority (Line 1032)
5. Superfluous Code Cleanup (Line 1114)
Code Quality AssessmentThe fixes demonstrate excellent engineering practices:
The changes maintain the original intent of fixing busy chunks on reload while addressing all the safety and correctness issues you identified. This is exactly the kind of focused, high-quality fix that should be merged. Recommendation: The code is now ready for merge. All critical and medium priority issues have been resolved appropriately. |
@edsiper , hello, could we proceed with PR? |
Changes contains:
Will completely fix: #10670.
Testing
Fluent-bit should be started with
Hot_Reload
andEnsure_Thread_Safety
set to True orRetry_Limit
set tono_limits
.🔽 Fluent-bit config
Tested on ubuntu:
debug.log
Documentation
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit
Refactor
Bug Fix
Logging