Skip to content

commit_builder: allow recreating commit, don't create cycle in parent… #6950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

martinvonz
Copy link
Member

Checklist

If applicable:

  • I have updated CHANGELOG.md
  • I have updated the documentation (README.md, docs/, demos/)
  • I have updated the config schema (cli/src/config-schema.json)
  • I have added/updated tests to cover my changes

@martinvonz martinvonz requested a review from a team as a code owner July 10, 2025 02:04
… mapping

When we stop including predecessors in the Commit object, we're going
to sometimes end up creating commits with the same id. Commit
f438ac2 made that an error. However, I don't think we can really
prevent these cases, so we need to handle them better instead. This
patch therefore removes the error and instead makes sure that we don't
create direct cycles in the predecessor and parent mappings. There's
still a risk of creating longer cycles, but I don't think we have any
cases where that can happen, so this should be good enough in practice
for now at least.

We ran into this in one of our backends at Google. That backend uses
the Git hashing scheme, so it doesn't include the predecessors and it
has second-level timestamp resolution. It would therefore somewhat
frequently end up rewriting a commit with the same id as already
existed.
Comment on lines -356 to -363
if mut_repo.is_backed_by_default_index() && mut_repo.index().has_id(commit.id()) {
// Recording existing commit as new would create cycle in
// predecessors/parent mappings within the current transaction, and
// in predecessors graph globally.
return Err(BackendError::Other(
format!("Newly-created commit {id} already exists", id = commit.id()).into(),
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave this error and add special case for commit.id() == rewrite_source.id()?

We don't handle indirect cycles within/across operations. evolog won't detect cycles across operations, but op diff can notice cycles in accumulated predecessors graph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of filtering out the call to mut_repo.set_rewritten_commit() we make it an error here?

The case we ran into at Google was code like this:

let mut tx = repo.start_transaction();
tx.repo_mut().rewrite_commit(&commit).set_description(description).write()?;
tx.commit("set description")?;

If the description is unchanged and this code is run within the same second as the commit was created, then we end up with the new commit matching the old commit here. I kind of feel like it should be allowed to write such code without checking for the previous description, but I also see value in erroring out so we learn about these cases (I've since updated the code at Google to avoid creating any commits or operations). So I think I'm saying that I think I'm okay with the error for now as I think you're suggesting, but I also wonder if we should allow such cases in the long term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, creating an existing commit should be an error in general, because we don't know how such commit should be recorded as rewritten. I also think it's okay to accept no-op rewrite (as a special case) because that's common and relatively easy to work around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. I think it should be allowed to create an existing commit because it's a strange requirement to put on callers to first check if the commit already exists, and it's going to be a racy check too. So I think we will need to allow callers to create an existing commit. I think even requiring that it's not a visible commit is an unnecessary restriction. What is the caller supposed to do? As I said, I think we should even allow code like the snippet above (i.e. predecessor cycles), but I don't mind if we're more restrictive to start with. In my mind, checking that the description is different in that code is mostly an optimization; the caller should be allowed to have the naive code if they prefer.

I'm okay with either allowing predecessor cycles or rejecting them for now. I don't feel strongly either way, but I do feel strongly that we should allow creating existing commits. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the caller needs to check the precondition. In practice, commit.id() == rewrite_source.id() would be the only situation where the existing commit can reappear. The user could set fake timestamp to trigger the error, but I think it's okay to just return an error in that case.

I'm okay with either allowing predecessor cycles or rejecting them for now. I don't feel strongly either way, but I do feel strongly that we should allow creating existing commits. Does that make sense?

Since we don't have a good way to handle indirect predecessors/rewrite cycles, I would just unblock the common case commit.id() == rewrite_source.id().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about cases like where someone wants to import history from some other VCS.

My assumption is that we won't use CommitBuilder when importing history losslessly. It wouldn't make much sense to generate predecessors/rewrite history for the import.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if someone writes a script for importing history.

I suspect we're going to have to remove the error at some point, but if you feel strongly about it, I can leave the error for now. What do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we're going to have to remove the error at some point, but if you feel strongly about it, I can leave the error for now. What do you prefer?

I would leave the error because the current implementation can't handle cycles (and will either panic or error out.) If we find a better way to deal with that, I think it's good to remove the restriction.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if someone writes a script for importing history.

Is the script the thing that interfaces with git-fast-import or the fast import backend itself, which has options for date formats?
Also, just a reminder that git commit has a flag to supply a date, which I expected jj to have as well. Breezy has one also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just talking about a hypothetical script. I've written such scripts for importing from ClearCase to Git in the past, for example. Anyway, it was just an example. I think we need to handle recreating commits regardless of why someone happens to create them.

mut_repo.add_head(&commit)?;
mut_repo.set_predecessors(commit.id().clone(), predecessors);
if !predecessors.contains(commit.id()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would use commit.id() != rewrite_source.id() for consistency. If the caller set weird predecessors, I think panicking is good.

@ilyagr
Copy link
Contributor

ilyagr commented Jul 15, 2025

You likely already considered this, and I mainly looked at the commit description, but perhaps it's worth mentioning, and in any case I'm curious: does our current trick of subtracting a second from the committer time-stamp to de-duplicate commit ids not help here?

One worry is that, in the worst case, you could keep subtracting for a while.

Update Ah, I guess the main problem is that you might need to write the commits concurrently on different machines, and there isn't a good way to reserve the commit id.

If this is the problem, mentioning concurrency in the commit description might help.

@martinvonz
Copy link
Member Author

You likely already considered this, and I mainly looked at the commit description, but perhaps it's worth mentioning, and in any case I'm curious: does our current trick of subtracting a second from the committer time-stamp to de-duplicate commit ids not help here?

We (both Yuya and I, I think) hope to get rid of that hack once we no longer store any commit-linked metadata outside of the Git commit object.

@yuja : Perhaps what I'm missing is what the problem with cycles is. Is it the kind of cycle you would get if you rewrote commit A to commit B in operation X and then rewrote commit B to commit A in operation Y? If that's it, what problem does that kind of cycle cause?

@ilyagr
Copy link
Contributor

ilyagr commented Jul 15, 2025

I see, the old problem of duplicating a commit and then updating both copies in the same second (to make their committer dates the same) will no longer result in commits with the same hash if the change id is stored inside the commit.

Though, if somebody (say on a remote) rebased both commits with some tool that strips the change id header, they might have some trouble.

(This is of course out of scope for this pr)

@yuja
Copy link
Contributor

yuja commented Jul 15, 2025

We (both Yuya and I, I think) hope to get rid of that hack once we no longer store any commit-linked metadata outside of the Git commit object.

+1

Perhaps what I'm missing is what the problem with cycles is. Is it the kind of cycle you would get if you rewrote commit A to commit B in operation X and then rewrote commit B to commit A in operation Y? If that's it, what problem does that kind of cycle cause?

An immediate problem is that MutableRepo would panic if cycle existed within the same operation. I'm not pretty sure if inter-operation cycles should be banned, but the existence of cycles will mean that we can't simply apply DAG algorithms. For example, I don't know if Sapling's graph renderer can deal with cyclic graph.

@martinvonz
Copy link
Member Author

An immediate problem is that MutableRepo would panic if cycle existed within the same operation. I'm not pretty sure if inter-operation cycles should be banned, but the existence of cycles will mean that we can't simply apply DAG algorithms. For example, I don't know if Sapling's graph renderer can deal with cyclic graph.

That seems to behave reasonably at least in this simple case (created by a hacked jj binary):

Screenshot 2025-07-16 at 07 54 12

It looks funny with the two working-copy nodes but I think that's the right thing to do. Other than evolog, I don't think we have anything that walks the predecessors, right?

Regarding intra-operation cycles, I can add a stricter check in addition to filtering out length-1 cycles as this PR currently does. Would that make you comfortable with removing the error for recreating commits?

@yuja
Copy link
Contributor

yuja commented Jul 17, 2025

An immediate problem is that MutableRepo would panic if cycle existed within the same operation. I'm not pretty sure if inter-operation cycles should be banned, but the existence of cycles will mean that we can't simply apply DAG algorithms. For example, I don't know if Sapling's graph renderer can deal with cyclic graph.

That seems to behave reasonably at least in this simple case (created by a hacked jj binary):

Yeah, I guessed that might work so long as the graph renderer doesn't keep track of emitted nodes to detect data inconsistency.

It looks funny with the two working-copy nodes but I think that's the right thing to do. Other than evolog, I don't think we have anything that walks the predecessors, right?

jj op diff will do, and transitive predecessors will be resolved there unlike evolog. Another potential problem is that we might add some revset-like filtering to evolog, and cycles could break topological sorting of the filtered graph.

Regarding intra-operation cycles, I can add a stricter check in addition to filtering out length-1 cycles as this PR currently does. Would that make you comfortable with removing the error for recreating commits?

No. Just curious, why do you want to remove the error checking? I know you have a real problem caused by the length-1 cycles, but I don't see why we also need to unblock the other unusual cases without addressing the underlying problems.

@martinvonz
Copy link
Member Author

Just curious, why do you want to remove the error checking?

Only because I think it's the right thing to do. Since we've fixed our server at Google to not create operations that do no-op updates of commit descriptions, it doesn't matter to us anymore AFAIK.

I know you have a real problem caused by the length-1 cycles, but I don't see why we also need to unblock the other unusual cases without addressing the underlying problems.

I do want to address the underlying problems. Preventing intra-operation cycles is just the only problem I'm aware of, so I thought that if we prevent that, then we can allow recreating existing commits.

What would make you comfortable with removing the error? To be clear, there's no rush from my side.

@yuja
Copy link
Contributor

yuja commented Jul 17, 2025

I do want to address the underlying problems. Preventing intra-operation cycles is just the only problem I'm aware of, so I thought that if we prevent that, then we can allow recreating existing commits.

What would make you comfortable with removing the error? To be clear, there's no rush from my side.

I see. As I said, inter-operation cycles can also cause problems atm, so I don't want to remove the error. That's the point.

I don't know if it's practical to accept cycles in the predecessors graph. It might be just simpler to store predecessors (or random bytes, incremented timestamp, etc.) in the commit object to prevent this kind of problems.

@martinvonz
Copy link
Member Author

I see. As I said, inter-operation cycles can also cause problems atm, so I don't want to remove the error. That's the point.

Ah, you're talking about the cycle-detection in jj op diff you mentioned earlier, right? Sorry, I missed that until now. I think we'll need to try to fix that somehow. I agree that this PR should wait until we've fixed that. Will convert this to a draft until then.

@martinvonz martinvonz marked this pull request as draft July 20, 2025 18:30
@yuja
Copy link
Contributor

yuja commented Jul 21, 2025

Ah, you're talking about the cycle-detection in jj op diff you mentioned earlier, right?

Yes, and some other potential problems on graph filtering/grouping. iirc, TopoGroupedGraphIterator doesn't work if multiple items of the same identity were queued.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants