-
Notifications
You must be signed in to change notification settings - Fork 89
Log error for old invocation status table #3585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this warning for our users @slinkydeveloper. Did you check whether it adds a noticeable start up penalty when the partition store is a bit larger? If that's the case, then we could remember if this check was executed before and had a negative outcome to avoid it in the future. Apart from this +1 for merging.
error!( | ||
"Detected an invocation with id {invocation_id} stored using the deprecated format. Cancel or kill the invocation.\nTHIS WILL BECOME AN HARD ERROR IN 1.6" | ||
) | ||
} else { | ||
error!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this a warn since Restate will still work after seeing these log messages.
pub fn verify_deprecations(&self, key_range: RangeInclusive<PartitionKey>) { | ||
// Verify invocation status v1 is unused | ||
let Ok(iter) = self.iterator_from::<InvocationStatusKeyV1>( | ||
TableScan::FullScanPartitionKeyRange(key_range), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check whether this full table scan has a noticeable effect when starting Restate with a larger db?
@@ -606,6 +609,33 @@ impl PartitionStore { | |||
pub fn partition(&self) -> &Arc<Partition> { | |||
self.db.partition() | |||
} | |||
|
|||
pub fn verify_deprecations(&self, key_range: RangeInclusive<PartitionKey>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key range should be available already in self.partition_key_range()
, or do you have a scenario where partial range scan is needed?
error!( | ||
"Detected an invocation with id {invocation_id} stored using the deprecated format. Cancel or kill the invocation.\nTHIS WILL BECOME AN HARD ERROR IN 1.6" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1- Isn't this a big ask for our users? is there a way to auto migrate those invocations instead?
2- What if the server operator is not he owner of the invocation, e.g. restate cloud or a platform team?
3- Is the intention here to warn about a single invocation_id even if the table has many entries?
Side nit: no need for shouting in log messages.
@@ -150,6 +150,8 @@ impl SpawnPartitionProcessorTask { | |||
Err(e) => Err(ProcessorError::from(e)), | |||
}?; | |||
|
|||
partition_store.verify_deprecations(partition.key_range); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should do operations like these on startup. If this is something that can happen lazily (potentially after a partition becomes leader) then we should do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you would move this in Leadership state?
We already lazily execute the migration since 1.2, this is gonna trigger for invocations that have made no progress since 1.1 (not even transitioned to another state, etc). I don't think this is a big ask, the only realistic case where an invocation is blocked since 1.1 is someone doing a sleep of 1 year or similar (very unlikely). |
@AhmedSoliman Also check #1852 for the list of what we've already done in the context of this deprecation/removal. If you have alternative ideas for for removing this piece of code, that would be appreciated. |
@slinkydeveloper is it possible to auto migrate the remaining invocations? Imagine if a user is running on v1.1 and successively upgrades to v1.2, v1.3, then v1.4. They might not have enough time for all their invocations to change states. |
Part of #1852