-
Notifications
You must be signed in to change notification settings - Fork 596
commit_builder: allow recreating commit, don't create cycle in parent… #6950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… mapping When we stop including predecessors in the Commit object, we're going to sometimes end up creating commits with the same id. Commit f438ac2 made that an error. However, I don't think we can really prevent these cases, so we need to handle them better instead. This patch therefore removes the error and instead makes sure that we don't create direct cycles in the predecessor and parent mappings. There's still a risk of creating longer cycles, but I don't think we have any cases where that can happen, so this should be good enough in practice for now at least. We ran into this in one of our backends at Google. That backend uses the Git hashing scheme, so it doesn't include the predecessors and it has second-level timestamp resolution. It would therefore somewhat frequently end up rewriting a commit with the same id as already existed.
0990b63
to
6cd0f61
Compare
if mut_repo.is_backed_by_default_index() && mut_repo.index().has_id(commit.id()) { | ||
// Recording existing commit as new would create cycle in | ||
// predecessors/parent mappings within the current transaction, and | ||
// in predecessors graph globally. | ||
return Err(BackendError::Other( | ||
format!("Newly-created commit {id} already exists", id = commit.id()).into(), | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave this error and add special case for commit.id() == rewrite_source.id()
?
We don't handle indirect cycles within/across operations. evolog
won't detect cycles across operations, but op diff
can notice cycles in accumulated predecessors graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So instead of filtering out the call to mut_repo.set_rewritten_commit()
we make it an error here?
The case we ran into at Google was code like this:
let mut tx = repo.start_transaction();
tx.repo_mut().rewrite_commit(&commit).set_description(description).write()?;
tx.commit("set description")?;
If the description is unchanged and this code is run within the same second as the commit was created, then we end up with the new commit matching the old commit here. I kind of feel like it should be allowed to write such code without checking for the previous description, but I also see value in erroring out so we learn about these cases (I've since updated the code at Google to avoid creating any commits or operations). So I think I'm saying that I think I'm okay with the error for now as I think you're suggesting, but I also wonder if we should allow such cases in the long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, creating an existing commit should be an error in general, because we don't know how such commit should be recorded as rewritten. I also think it's okay to accept no-op rewrite (as a special case) because that's common and relatively easy to work around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. I think it should be allowed to create an existing commit because it's a strange requirement to put on callers to first check if the commit already exists, and it's going to be a racy check too. So I think we will need to allow callers to create an existing commit. I think even requiring that it's not a visible commit is an unnecessary restriction. What is the caller supposed to do? As I said, I think we should even allow code like the snippet above (i.e. predecessor cycles), but I don't mind if we're more restrictive to start with. In my mind, checking that the description is different in that code is mostly an optimization; the caller should be allowed to have the naive code if they prefer.
I'm okay with either allowing predecessor cycles or rejecting them for now. I don't feel strongly either way, but I do feel strongly that we should allow creating existing commits. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the caller needs to check the precondition. In practice, commit.id() == rewrite_source.id()
would be the only situation where the existing commit can reappear. The user could set fake timestamp to trigger the error, but I think it's okay to just return an error in that case.
I'm okay with either allowing predecessor cycles or rejecting them for now. I don't feel strongly either way, but I do feel strongly that we should allow creating existing commits. Does that make sense?
Since we don't have a good way to handle indirect predecessors/rewrite cycles, I would just unblock the common case commit.id() == rewrite_source.id()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about cases like where someone wants to import history from some other VCS.
My assumption is that we won't use CommitBuilder
when importing history losslessly. It wouldn't make much sense to generate predecessors/rewrite history for the import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant if someone writes a script for importing history.
I suspect we're going to have to remove the error at some point, but if you feel strongly about it, I can leave the error for now. What do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we're going to have to remove the error at some point, but if you feel strongly about it, I can leave the error for now. What do you prefer?
I would leave the error because the current implementation can't handle cycles (and will either panic or error out.) If we find a better way to deal with that, I think it's good to remove the restriction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant if someone writes a script for importing history.
Is the script the thing that interfaces with git-fast-import or the fast import backend itself, which has options for date formats?
Also, just a reminder that git commit
has a flag to supply a date, which I expected jj
to have as well. Breezy has one also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just talking about a hypothetical script. I've written such scripts for importing from ClearCase to Git in the past, for example. Anyway, it was just an example. I think we need to handle recreating commits regardless of why someone happens to create them.
mut_repo.add_head(&commit)?; | ||
mut_repo.set_predecessors(commit.id().clone(), predecessors); | ||
if !predecessors.contains(commit.id()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would use commit.id() != rewrite_source.id()
for consistency. If the caller set weird predecessors, I think panicking is good.
You likely already considered this, and I mainly looked at the commit description, but perhaps it's worth mentioning, and in any case I'm curious: does our current trick of subtracting a second from the committer time-stamp to de-duplicate commit ids not help here? One worry is that, in the worst case, you could keep subtracting for a while. Update Ah, I guess the main problem is that you might need to write the commits concurrently on different machines, and there isn't a good way to reserve the commit id. If this is the problem, mentioning concurrency in the commit description might help. |
We (both Yuya and I, I think) hope to get rid of that hack once we no longer store any commit-linked metadata outside of the Git commit object. @yuja : Perhaps what I'm missing is what the problem with cycles is. Is it the kind of cycle you would get if you rewrote commit A to commit B in operation X and then rewrote commit B to commit A in operation Y? If that's it, what problem does that kind of cycle cause? |
I see, the old problem of duplicating a commit and then updating both copies in the same second (to make their committer dates the same) will no longer result in commits with the same hash if the change id is stored inside the commit. Though, if somebody (say on a remote) rebased both commits with some tool that strips the change id header, they might have some trouble. (This is of course out of scope for this pr) |
+1
An immediate problem is that |
Yeah, I guessed that might work so long as the graph renderer doesn't keep track of emitted nodes to detect data inconsistency.
No. Just curious, why do you want to remove the error checking? I know you have a real problem caused by the length-1 cycles, but I don't see why we also need to unblock the other unusual cases without addressing the underlying problems. |
Only because I think it's the right thing to do. Since we've fixed our server at Google to not create operations that do no-op updates of commit descriptions, it doesn't matter to us anymore AFAIK.
I do want to address the underlying problems. Preventing intra-operation cycles is just the only problem I'm aware of, so I thought that if we prevent that, then we can allow recreating existing commits. What would make you comfortable with removing the error? To be clear, there's no rush from my side. |
I see. As I said, inter-operation cycles can also cause problems atm, so I don't want to remove the error. That's the point. I don't know if it's practical to accept cycles in the predecessors graph. It might be just simpler to store predecessors (or random bytes, incremented timestamp, etc.) in the commit object to prevent this kind of problems. |
Ah, you're talking about the cycle-detection in |
Yes, and some other potential problems on graph filtering/grouping. iirc, |
Checklist
If applicable:
CHANGELOG.md
README.md
,docs/
,demos/
)cli/src/config-schema.json
)