Bad File Descriptor error on rolling restart of RabbitMQ #715

shashankmehra · 2025-05-19T10:48:09Z

shashankmehra
May 19, 2025

Environment

Bunny version: 2.24.0
Ruby version: 3.2.2
RabbitMQ setup: Cluster behind HAProxy (running in Docker Compose)
Client: Worker process using Sneakers/Kicks library

Issue Description

When performing a rolling restart of RabbitMQ containers, the worker process attempts to recover but gets stuck in an infinite loop of Bad File Descriptor (Errno::EBADF) errors. The application never successfully reconnects, even after all RabbitMQ nodes are back online.

Logs

Complete set of logs are attached here.

logs.txt

Analysis

Looking at the logs, multiple threads (at least three) are trying to recover separate sessions simultaneously. This suggests a race condition in the network recovery mechanism.

I've found a specific issue in the code: in ReaderLoop#run_loop, the exception handling for Errno::EBADF appears after the exception handling for SystemCallError. Since Errno::EBADF inherits from SystemCallError, the more specific handler is never reached.

michaelklishin · 2025-05-19T12:17:47Z

michaelklishin
May 19, 2025
Maintainer

@shashankmehra I don't see how concurrent connection recovery by independent connection can be the root cause. Those connections do not share any state.

Please submit a PR that would move the handlers around.

0 replies

michaelklishin · 2025-05-19T12:43:50Z

michaelklishin
May 19, 2025
Maintainer

I don't see how the more generic handler being called is problematic in run_loop. If anything, I think it's the Errno::EBADF handler that can be the issue, if only it were to run.

#716 removes the dedicated EBADF handler. @shashankmehra if you have a reliable way to reproduce this exception, loop, I'd appreciate if you would give #716 a shot.

1 reply

shashankmehra May 21, 2025
Author

#716 removes the dedicated EBADF handler. @shashankmehra if you have a reliable way to reproduce this exception, loop, I'd appreciate if you would give #716 a shot.

Thanks for looking into this so quickly @michaelklishin !
Yes this change seems to have fixed the recoverability issue in my manual testing. I will continue testing to gain more confidence :)

mroach · 2025-06-02T07:30:06Z

mroach
Jun 2, 2025

Apologies in advance if this is actually a different issue.

I'm having a similar issue in our Kubernetes environment when Sunday server maintenance does a upgrade of the k8s nodes running RabbitMQ and the Ruby apps which is is about the same as a rolling restart. It's not the exact same error, but the same result: infinite loop of our message worker trying to connect and it never does. If I replace the subscriber worker (kubectl rollout restart deployment message-consumer) with a new one, it connects right away with no issues.

That's got me wondering if there's some stale state in the app or container itself causing a problem? I've had no luck reproducing this behaviour locally. I'd probably have to run an actual Kubernetes cluster and try to figure out the sequence of events that causes this. (RabbitMQ creation, k8s Service creation, worker pod creation, replacement, etc).

If a restart of the container fixes the problem, I'd guess that a workaround or even ok fix would be setting recovery_attempts so that the worker actually crashes and triggers a re-creation, which is what I'm doing manually on Monday mornings.

bunny: 2.24.0
Ruby: 3.3 and 3.4 (happens in two different apps, different Ruby versions)
RabbitMQ: 4.0.9 in a 3-node cluster behind a k8s Service

Connection recovery attempt started
Retrying connection on next host in line: rabbitmq.edge.svc.cluster.local:5672
TCP connection failed
Reconnecting in 5.0 seconds
Will recover from a network failure (no retry limit)...
Got an exception when receiving data: closed stream (IOError)
Exception in the reader loop: IOError: closed stream
Backtrace: 
  /usr/local/bundle/gems/bunny-2.24.0/lib/bunny/cruby/socket.rb:67:in `select'
  /usr/local/bundle/gems/bunny-2.24.0/lib/bunny/cruby/socket.rb:67:in `rescue in read_fully'
  /usr/local/bundle/gems/bunny-2.24.0/lib/bunny/cruby/socket.rb:58:in `read_fully'
  /usr/local/bundle/gems/bunny-2.24.0/lib/bunny/transport.rb:240:in `read_fully'
  /usr/local/bundle/gems/bunny-2.24.0/lib/bunny/transport.rb:262:in `read_next_frame'
  /usr/local/bundle/gems/bunny-2.24.0/lib/bunny/reader_loop.rb:76:in `run_once'
  /usr/local/bundle/gems/bunny-2.24.0/lib/bunny/reader_loop.rb:41:in `block in run_loop'
  <internal:kernel>:187:in `loop'
  /usr/local/bundle/gems/bunny-2.24.0/lib/bunny/reader_loop.rb:38:in `run_loop'
Will recover from a network failure (no retry limit)...
Got an exception when receiving data: closed stream (IOError)

1 reply

michaelklishin Jun 2, 2025
Maintainer

I don't have much to add to #716. Give it a try from main and if it helps, I will produce a new release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad File Descriptor error on rolling restart of RabbitMQ #715

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Bad File Descriptor error on rolling restart of RabbitMQ #715

Uh oh!

shashankmehra May 19, 2025

Environment

Issue Description

Logs

Analysis

Replies: 3 comments · 2 replies

Uh oh!

michaelklishin May 19, 2025 Maintainer

Uh oh!

michaelklishin May 19, 2025 Maintainer

Uh oh!

shashankmehra May 21, 2025 Author

Uh oh!

mroach Jun 2, 2025

Uh oh!

michaelklishin Jun 2, 2025 Maintainer

shashankmehra
May 19, 2025

Replies: 3 comments 2 replies

michaelklishin
May 19, 2025
Maintainer

michaelklishin
May 19, 2025
Maintainer

shashankmehra May 21, 2025
Author

mroach
Jun 2, 2025

michaelklishin Jun 2, 2025
Maintainer