-
Notifications
You must be signed in to change notification settings - Fork 56
Adopt a linear-time replay algorithm #305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Do we need to make any documentation updates to account for the mentioned breaking changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review feedback. I still have more to review.
The current state of the code should have addressed all PR feedback with the exception of two things:
@cgillum: I'd appreciate if you could make another pass over this PR, resolving your old comments and questions if they've been addressed. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more things (and a couple old ones too).
Co-authored-by: Chris Gillum <cgillum@microsoft.com>
…e-functions-durable-js into dajusto/linear-replay
My latest PR responds to the most recent round of feedback. My bad for missing a few of your previous comments, there was a lot to address :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with this iteration. Remaining comments are optional/non-blocking.
Wohoo, we're finally ready to merge 🎉 I'll add the copyright headers in a different PR, to avoid cluttering this one |
Follow-up to: Azure/azure-functions-durable-python#302
Problem Statement
A DF SDK's primary function is to rehydrate an orchestrator's state by processing its internal History log. We call this: "replaying an orchestrator". Previously, our approach to replay had a quadratic complexity cost in the number of History events.
This PR makes our replay algorithm linear-time instead. We do this by iterating over the orchestration History array only once, which is possible by making better use of the data within each History event. Based on our experience with this algorithm in the DF Python SDK, this change should give us unprecedented performance, correctness, and scalability improvements for JS/TS DF users.
Other contributions
This PR is quite intrusive to our Node.JS SDK: most, if not all, the "hot paths" of our SDK have been re-worked. As a result, this was a good opportunity to evaluate dropping support for legacy behavior and interfaces/types that we were supporting mostly for backwards compatibility's sake and that newer SDKs such as Python does not include.
As a result, my goal after merging this PR is to release a new major version of this SDK.
Outline of changes
A new replay driver
The
TaskOrchestrationExecutor
is the new driver of orchestration replay. Previously, each DF API was responsible for inferring it's corresponding Task status by searching through the History log. Now, this responsibility is centralized in theTaskOrchestrationExecutor
. This class interleaves iterating through the History log with resuming the execution of the user-defined generator (the orchestrator code). While iterating through the History, it seeks to update the state of currently-scheduled ("open") tasks which, when resolved ("closed"), allow us to resume running the generator.Support for the Extension-level OOProc replay schema V2
The DF extension "recently" introduced a new replay protocol for OOProc orchestrations. By subscribing to this new extension-level replay protocol, OOProc SDKs enjoy improved correctness when dealing with
WhenAny
andWhenAll
tasks. This PR adds support to this new protocol.To do this, it changes its representation of user actions depending on the
upperSchemaVersion
field sent by the extension.New task state machines
Tasks are now represented as state machines that can be signaled to update their state. This is particularly helpful in the case of WhenAny/WhenAll tasks, where the task needs to determine its current status by listening to state changes in its child-tasks.
Breaking Changes
There are at least 3 breaking changes introduced in this PR.
We no longer support returning a Task from an orchestrator to be valid. Previously, a statement such as
return context.CallActivity(...)
would be treated asreturn yield context.CallActivity(...)
. This is no longer the case.We no longer support yielding the
ContinueAsNew
API, and doing so would throw an exception. This API is supposed to be fire-and-forget and so it doesn't make sense to support "awaiting" it.All user-facing Task types now inherit from single identifier :
Task
. Additionally, the user-facing Task` types have been simplified and no longer expose various property that were for framework-internal use only. Our users have been asking for a more streamlined experience with Task-types so this is long overdue as well.To my knowledge, this is the full extent of breaking changes, but I feel that the changes in this PR are disruptive enough that there might be one other minor behavior I could be missing. In any case, the behavior of the Node SDK after this PR should be more consistent with the Python SDK, which I believe is a good thing.
Follow-ups
This PR will need to be benchmarked and the changes extensively tested before a release.
Note:
Some files with major changes (like the orchestration context) are showing up as minimized due to large diffs. Don't miss them while reviewing :)
Thanks!