Multi-threaded topology recovery #370

vikinghawk · 2018-05-21T19:41:52Z

Proposed Changes

Decrease recovery time for connections that have lots of queues and bindings by recovering entities (exchanges, queues, bindings, and consumers) in parallel using multiple threads. One of our applications has 100s of channels and 1000s of queues and bindings on a single connection. This PR makes it configurable and opt-in to use multiple threads. Default is still single threaded on the main connection thread. I can add javadoc and clean it up a bit if the team is interested in pulling this in.

In my testing with our workflow I have found recovery time to improve by about 5x using 8 threads vs 1. More than 8 threads seems to have diminishing returns (could be related to the VM i was testing from which only had 2 cores). In one case recovery time went from 26 seconds to 5 seconds and from 119 seconds to 25 in another.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation (correction or otherwise)
Cosmetics (whitespace, appearance)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
All tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)
Any dependent changes have been merged and published in related repositories

Further Comments

This reverts commit 774590a.

michaelklishin · 2018-05-22T03:49:38Z

This is pretty interesting and I like the way you avoid natural race conditions here. It would be interesting to see what kind of tricky scenarios we can add to the test suite before we put it through a few 100s runs ;)

michaelklishin · 2018-05-23T05:27:47Z

@vikinghawk @acogoluegnes and I discussed this and we'd like to add a couple of things:

Make sure ConnectionFactory's thread factory is used when the client has to instantiate an executor service
Add an option to use a user-provided executor

Also, can you share any background on this or numbers around how much quicker does parallel recovery finish? Thank you!

acogoluegnes · 2018-05-24T09:40:02Z

src/main/java/com/rabbitmq/client/impl/recovery/AutorecoveringConnection.java

+            // Support recovering entities in parallel for connections that have a lot of queues, bindings, & consumers
+            // A channel is single threaded, so group things by channel and recover 1 entity at a time per channel
+            // We still need to recover 1 type of entity at a time in case channel1 has a binding to a queue that is currently owned and being recovered by channel2 for example
+            final ExecutorService executor = Executors.newFixedThreadPool(recoveryThreads, delegate.getThreadFactory());


Creating a new pool of threads can be a concern in some environments, so maybe we should just add (yet) another executor service in the connection factory. Parallel topology recovery would occur only if this executor service is not null (so we could get rid of the recoveryThreads property).

acogoluegnes · 2018-05-24T09:46:30Z

src/main/java/com/rabbitmq/client/impl/recovery/AutorecoveringConnection.java

-                TopologyRecoveryException e = new TopologyRecoveryException(message, cause);
-                this.getExceptionHandler().handleTopologyRecoveryException(delegate, x.getDelegateChannel(), e);
+                // invokeAll will block until all callables are completed
+                executor.invokeAll(groupEntitiesByChannel(Utility.copy(recordedExchanges).values()));


Maybe we could get the Futures back, check there's the same number as the inputs and they are all done, log a warning if not.

acogoluegnes · 2018-05-24T09:59:40Z

src/main/java/com/rabbitmq/client/impl/recovery/AutorecoveringConnection.java

+                executor.invokeAll(groupEntitiesByChannel(Utility.copy(recordedBindings)));
+                executor.invokeAll(groupEntitiesByChannel(Utility.copy(consumers).values()));
+            } finally {
+                executor.shutdownNow();


Check it returns an empty list and log a warning if not?

no longer calling shutdown since the executor is passed in and needs to be reused on subsequent recoveries

acogoluegnes · 2018-05-24T12:51:04Z

@vikinghawk Thanks for this contribution. I added few comments that worth addressing I guess. The CI build also failed:

Tests run: 403, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1,243.408 sec <<< FAILURE! - in com.rabbitmq.client.test.server.HATests
recoveryWithMultipleThreads(com.rabbitmq.client.test.functional.ConnectionRecovery)  Time elapsed: 11.958 sec  <<< ERROR!
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
	at com.rabbitmq.client.test.functional.ConnectionRecovery.recoveryWithMultipleThreads(ConnectionRecovery.java:849)

I haven't run it yet, we're kind of busy this week with a team meeting.

vikinghawk · 2018-05-24T14:53:23Z

changes made as requested. I believe I fixed the test as well, but I haven't had time to setup my new machine yet to run tests.

My testing scenario:
2 node cluster with roughly 600 connections, 4.5k channels,10.5k queues & bindings, 50k consumers, and 300 msgs/second traffic.

Testing on a single connection with 190 channels, 1500 non-durable/auto-delete queues, and 200 durable/mirrored queues. 1 topic exchange binding and 1 consumer per queue.

Threads	Total Recovery Time	Channel Time	Queue Time	Binding Time	Consumer Time
1	26 secs	1	12	11	2
8	5 secs	1	2	2	.5

Testing on a single connection with 190 channels and 1700 durable/mirrored queues. 1 topic exchange binding and 1 consumer per queue.

Note: this client was on a small VM with 2 vCPUs so its possible >8 threads could be beneficial on a bigger machine.

Threads	Total Recovery Time	Channel Time	Queue Time	Binding Time	Consumer Time
1	119 secs	12	82	23	2
4	41 secs	12	22	6	1
8	26 secs	11	11	4	1
16	23 secs	12	6	4	1

I did try to recover Channels in parallel as well at one time, but it did not appear to be any faster so I backed it out. I assume there is some synchronization point on the connection either client or server side.

michaelklishin · 2018-05-26T16:21:32Z

Thanks, those numbers suggest the difference is pretty significant (which is expected but always nice to see some benchmark evidence). We will continue with QA.

michaelklishin · 2018-05-28T08:19:58Z

@vikinghawk thank you for the (as always) quality contribution!

References #370

acogoluegnes · 2018-05-29T07:35:53Z

@vikinghawk Merged, thanks! This will be in 4.7.0, we'll produce a release candidate in the next few days.

vikinghawk added 5 commits May 15, 2018 00:37

parallel recovery

c454b63

logging changes

e238df3

recover channels in parallel too

774590a

Revert "recover channels in parallel too"

adb4187

This reverts commit 774590a.

add test back

84c891c

fix toString

f55ad22

acogoluegnes requested changes May 24, 2018

View reviewed changes

code review comments

b29ebe6

michaelklishin approved these changes May 27, 2018

View reviewed changes

acogoluegnes self-assigned this May 28, 2018

acogoluegnes added this to the 4.7.0 milestone May 28, 2018

acogoluegnes approved these changes May 28, 2018

View reviewed changes

acogoluegnes merged commit f892aaa into rabbitmq:4.x.x-stable May 28, 2018

acogoluegnes added a commit that referenced this pull request May 28, 2018

Log unfinished multi-threaded topology recovery tasks

3499d2f

References #370

acogoluegnes added a commit that referenced this pull request May 28, 2018

Use lambda

8976609

References #370

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-threaded topology recovery #370

Multi-threaded topology recovery #370

Uh oh!

vikinghawk commented May 21, 2018 •

edited

Loading

Uh oh!

michaelklishin commented May 22, 2018

Uh oh!

michaelklishin commented May 23, 2018

Uh oh!

acogoluegnes May 24, 2018

Uh oh!

vikinghawk May 24, 2018

Uh oh!

acogoluegnes May 24, 2018

Uh oh!

acogoluegnes May 24, 2018

Uh oh!

vikinghawk May 24, 2018

Uh oh!

acogoluegnes commented May 24, 2018

Uh oh!

vikinghawk commented May 24, 2018 •

edited

Loading

Uh oh!

michaelklishin commented May 26, 2018

Uh oh!

michaelklishin commented May 28, 2018

Uh oh!

acogoluegnes commented May 29, 2018

Uh oh!

Uh oh!

Multi-threaded topology recovery #370

Multi-threaded topology recovery #370

Uh oh!

Conversation

vikinghawk commented May 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Types of Changes

Checklist

Further Comments

Uh oh!

michaelklishin commented May 22, 2018

Uh oh!

michaelklishin commented May 23, 2018

Uh oh!

acogoluegnes May 24, 2018

Choose a reason for hiding this comment

Uh oh!

vikinghawk May 24, 2018

Choose a reason for hiding this comment

Uh oh!

acogoluegnes May 24, 2018

Choose a reason for hiding this comment

Uh oh!

acogoluegnes May 24, 2018

Choose a reason for hiding this comment

Uh oh!

vikinghawk May 24, 2018

Choose a reason for hiding this comment

Uh oh!

acogoluegnes commented May 24, 2018

Uh oh!

vikinghawk commented May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelklishin commented May 26, 2018

Uh oh!

michaelklishin commented May 28, 2018

Uh oh!

acogoluegnes commented May 29, 2018

Uh oh!

Uh oh!

vikinghawk commented May 21, 2018 •

edited

Loading

vikinghawk commented May 24, 2018 •

edited

Loading