Bad File Descriptor error on rolling restart of RabbitMQ #715
Replies: 3 comments 2 replies
-
@shashankmehra I don't see how concurrent connection recovery by independent connection can be the root cause. Those connections do not share any state. Please submit a PR that would move the handlers around. |
Beta Was this translation helpful? Give feedback.
-
I don't see how the more generic handler being called is problematic in #716 removes the dedicated |
Beta Was this translation helpful? Give feedback.
-
Apologies in advance if this is actually a different issue. I'm having a similar issue in our Kubernetes environment when Sunday server maintenance does a upgrade of the k8s nodes running RabbitMQ and the Ruby apps which is is about the same as a rolling restart. It's not the exact same error, but the same result: infinite loop of our message worker trying to connect and it never does. If I replace the subscriber worker ( That's got me wondering if there's some stale state in the app or container itself causing a problem? I've had no luck reproducing this behaviour locally. I'd probably have to run an actual Kubernetes cluster and try to figure out the sequence of events that causes this. (RabbitMQ creation, k8s Service creation, worker pod creation, replacement, etc). If a restart of the container fixes the problem, I'd guess that a workaround or even ok fix would be setting bunny: 2.24.0
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Environment
Issue Description
When performing a rolling restart of RabbitMQ containers, the worker process attempts to recover but gets stuck in an infinite loop of Bad File Descriptor (Errno::EBADF) errors. The application never successfully reconnects, even after all RabbitMQ nodes are back online.
Logs
Complete set of logs are attached here.
logs.txt
Analysis
Looking at the logs, multiple threads (at least three) are trying to recover separate sessions simultaneously. This suggests a race condition in the network recovery mechanism.
I've found a specific issue in the code: in
ReaderLoop#run_loop
, the exception handling forErrno::EBADF
appears after the exception handling forSystemCallError
. SinceErrno::EBADF
inherits fromSystemCallError
, the more specific handler is never reached.Beta Was this translation helpful? Give feedback.
All reactions