commit_builder: allow recreating commit, don't create cycle in parent… #6950

martinvonz · 2025-07-10T02:04:28Z

Checklist

If applicable:

I have updated CHANGELOG.md
I have updated the documentation (README.md, docs/, demos/)
I have updated the config schema (cli/src/config-schema.json)
I have added/updated tests to cover my changes

… mapping When we stop including predecessors in the Commit object, we're going to sometimes end up creating commits with the same id. Commit f438ac2 made that an error. However, I don't think we can really prevent these cases, so we need to handle them better instead. This patch therefore removes the error and instead makes sure that we don't create direct cycles in the predecessor and parent mappings. There's still a risk of creating longer cycles, but I don't think we have any cases where that can happen, so this should be good enough in practice for now at least. We ran into this in one of our backends at Google. That backend uses the Git hashing scheme, so it doesn't include the predecessors and it has second-level timestamp resolution. It would therefore somewhat frequently end up rewriting a commit with the same id as already existed.

yuja · 2025-07-10T08:32:06Z

lib/src/commit_builder.rs

-        if mut_repo.is_backed_by_default_index() && mut_repo.index().has_id(commit.id()) {
-            // Recording existing commit as new would create cycle in
-            // predecessors/parent mappings within the current transaction, and
-            // in predecessors graph globally.
-            return Err(BackendError::Other(
-                format!("Newly-created commit {id} already exists", id = commit.id()).into(),
-            ));
-        }


Can you leave this error and add special case for commit.id() == rewrite_source.id()?

We don't handle indirect cycles within/across operations. evolog won't detect cycles across operations, but op diff can notice cycles in accumulated predecessors graph.

So instead of filtering out the call to mut_repo.set_rewritten_commit() we make it an error here?

The case we ran into at Google was code like this:

let mut tx = repo.start_transaction(); tx.repo_mut().rewrite_commit(&commit).set_description(description).write()?; tx.commit("set description")?;

If the description is unchanged and this code is run within the same second as the commit was created, then we end up with the new commit matching the old commit here. I kind of feel like it should be allowed to write such code without checking for the previous description, but I also see value in erroring out so we learn about these cases (I've since updated the code at Google to avoid creating any commits or operations). So I think I'm saying that I think I'm okay with the error for now as I think you're suggesting, but I also wonder if we should allow such cases in the long term.

IMHO, creating an existing commit should be an error in general, because we don't know how such commit should be recorded as rewritten. I also think it's okay to accept no-op rewrite (as a special case) because that's common and relatively easy to work around.

I disagree. I think it should be allowed to create an existing commit because it's a strange requirement to put on callers to first check if the commit already exists, and it's going to be a racy check too. So I think we will need to allow callers to create an existing commit. I think even requiring that it's not a visible commit is an unnecessary restriction. What is the caller supposed to do? As I said, I think we should even allow code like the snippet above (i.e. predecessor cycles), but I don't mind if we're more restrictive to start with. In my mind, checking that the description is different in that code is mostly an optimization; the caller should be allowed to have the naive code if they prefer.

I'm okay with either allowing predecessor cycles or rejecting them for now. I don't feel strongly either way, but I do feel strongly that we should allow creating existing commits. Does that make sense?

I don't think the caller needs to check the precondition. In practice, commit.id() == rewrite_source.id() would be the only situation where the existing commit can reappear. The user could set fake timestamp to trigger the error, but I think it's okay to just return an error in that case.

I'm okay with either allowing predecessor cycles or rejecting them for now. I don't feel strongly either way, but I do feel strongly that we should allow creating existing commits. Does that make sense?

Since we don't have a good way to handle indirect predecessors/rewrite cycles, I would just unblock the common case commit.id() == rewrite_source.id().

I was thinking about cases like where someone wants to import history from some other VCS.

My assumption is that we won't use CommitBuilder when importing history losslessly. It wouldn't make much sense to generate predecessors/rewrite history for the import.

I meant if someone writes a script for importing history.

I suspect we're going to have to remove the error at some point, but if you feel strongly about it, I can leave the error for now. What do you prefer?

I suspect we're going to have to remove the error at some point, but if you feel strongly about it, I can leave the error for now. What do you prefer?

I would leave the error because the current implementation can't handle cycles (and will either panic or error out.) If we find a better way to deal with that, I think it's good to remove the restriction.

I meant if someone writes a script for importing history.

Is the script the thing that interfaces with git-fast-import or the fast import backend itself, which has options for date formats?
Also, just a reminder that git commit has a flag to supply a date, which I expected jj to have as well. Breezy has one also.

I was just talking about a hypothetical script. I've written such scripts for importing from ClearCase to Git in the past, for example. Anyway, it was just an example. I think we need to handle recreating commits regardless of why someone happens to create them.

yuja · 2025-07-10T08:34:30Z

lib/src/commit_builder.rs

        mut_repo.add_head(&commit)?;
-        mut_repo.set_predecessors(commit.id().clone(), predecessors);
+        if !predecessors.contains(commit.id()) {


nit: I would use commit.id() != rewrite_source.id() for consistency. If the caller set weird predecessors, I think panicking is good.

ilyagr · 2025-07-15T06:16:26Z

You likely already considered this, and I mainly looked at the commit description, but perhaps it's worth mentioning, and in any case I'm curious: does our current trick of subtracting a second from the committer time-stamp to de-duplicate commit ids not help here?

One worry is that, in the worst case, you could keep subtracting for a while.

Update Ah, I guess the main problem is that you might need to write the commits concurrently on different machines, and there isn't a good way to reserve the commit id.

If this is the problem, mentioning concurrency in the commit description might help.

martinvonz · 2025-07-15T06:23:41Z

You likely already considered this, and I mainly looked at the commit description, but perhaps it's worth mentioning, and in any case I'm curious: does our current trick of subtracting a second from the committer time-stamp to de-duplicate commit ids not help here?

We (both Yuya and I, I think) hope to get rid of that hack once we no longer store any commit-linked metadata outside of the Git commit object.

@yuja : Perhaps what I'm missing is what the problem with cycles is. Is it the kind of cycle you would get if you rewrote commit A to commit B in operation X and then rewrote commit B to commit A in operation Y? If that's it, what problem does that kind of cycle cause?

ilyagr · 2025-07-15T06:38:45Z

I see, the old problem of duplicating a commit and then updating both copies in the same second (to make their committer dates the same) will no longer result in commits with the same hash if the change id is stored inside the commit.

Though, if somebody (say on a remote) rebased both commits with some tool that strips the change id header, they might have some trouble.

(This is of course out of scope for this pr)

yuja · 2025-07-15T12:20:57Z

We (both Yuya and I, I think) hope to get rid of that hack once we no longer store any commit-linked metadata outside of the Git commit object.

+1

Perhaps what I'm missing is what the problem with cycles is. Is it the kind of cycle you would get if you rewrote commit A to commit B in operation X and then rewrote commit B to commit A in operation Y? If that's it, what problem does that kind of cycle cause?

An immediate problem is that MutableRepo would panic if cycle existed within the same operation. I'm not pretty sure if inter-operation cycles should be banned, but the existence of cycles will mean that we can't simply apply DAG algorithms. For example, I don't know if Sapling's graph renderer can deal with cyclic graph.

martinvonz · 2025-07-16T15:02:07Z

An immediate problem is that MutableRepo would panic if cycle existed within the same operation. I'm not pretty sure if inter-operation cycles should be banned, but the existence of cycles will mean that we can't simply apply DAG algorithms. For example, I don't know if Sapling's graph renderer can deal with cyclic graph.

That seems to behave reasonably at least in this simple case (created by a hacked jj binary):

It looks funny with the two working-copy nodes but I think that's the right thing to do. Other than evolog, I don't think we have anything that walks the predecessors, right?

Regarding intra-operation cycles, I can add a stricter check in addition to filtering out length-1 cycles as this PR currently does. Would that make you comfortable with removing the error for recreating commits?

yuja · 2025-07-17T01:50:10Z

An immediate problem is that MutableRepo would panic if cycle existed within the same operation. I'm not pretty sure if inter-operation cycles should be banned, but the existence of cycles will mean that we can't simply apply DAG algorithms. For example, I don't know if Sapling's graph renderer can deal with cyclic graph.

That seems to behave reasonably at least in this simple case (created by a hacked jj binary):

Yeah, I guessed that might work so long as the graph renderer doesn't keep track of emitted nodes to detect data inconsistency.

It looks funny with the two working-copy nodes but I think that's the right thing to do. Other than evolog, I don't think we have anything that walks the predecessors, right?

jj op diff will do, and transitive predecessors will be resolved there unlike evolog. Another potential problem is that we might add some revset-like filtering to evolog, and cycles could break topological sorting of the filtered graph.

Regarding intra-operation cycles, I can add a stricter check in addition to filtering out length-1 cycles as this PR currently does. Would that make you comfortable with removing the error for recreating commits?

No. Just curious, why do you want to remove the error checking? I know you have a real problem caused by the length-1 cycles, but I don't see why we also need to unblock the other unusual cases without addressing the underlying problems.

martinvonz · 2025-07-17T02:02:29Z

Just curious, why do you want to remove the error checking?

Only because I think it's the right thing to do. Since we've fixed our server at Google to not create operations that do no-op updates of commit descriptions, it doesn't matter to us anymore AFAIK.

I know you have a real problem caused by the length-1 cycles, but I don't see why we also need to unblock the other unusual cases without addressing the underlying problems.

I do want to address the underlying problems. Preventing intra-operation cycles is just the only problem I'm aware of, so I thought that if we prevent that, then we can allow recreating existing commits.

What would make you comfortable with removing the error? To be clear, there's no rush from my side.

yuja · 2025-07-17T02:27:22Z

I do want to address the underlying problems. Preventing intra-operation cycles is just the only problem I'm aware of, so I thought that if we prevent that, then we can allow recreating existing commits.

What would make you comfortable with removing the error? To be clear, there's no rush from my side.

I see. As I said, inter-operation cycles can also cause problems atm, so I don't want to remove the error. That's the point.

I don't know if it's practical to accept cycles in the predecessors graph. It might be just simpler to store predecessors (or random bytes, incremented timestamp, etc.) in the commit object to prevent this kind of problems.

martinvonz · 2025-07-20T18:30:19Z

I see. As I said, inter-operation cycles can also cause problems atm, so I don't want to remove the error. That's the point.

Ah, you're talking about the cycle-detection in jj op diff you mentioned earlier, right? Sorry, I missed that until now. I think we'll need to try to fix that somehow. I agree that this PR should wait until we've fixed that. Will convert this to a draft until then.

yuja · 2025-07-21T02:39:18Z

Ah, you're talking about the cycle-detection in jj op diff you mentioned earlier, right?

Yes, and some other potential problems on graph filtering/grouping. iirc, TopoGroupedGraphIterator doesn't work if multiple items of the same identity were queued.

martinvonz requested a review from a team as a code owner July 10, 2025 02:04

martinvonz force-pushed the push-wslopvrrvzpq branch from 0990b63 to 6cd0f61 Compare July 10, 2025 03:09

yuja reviewed Jul 10, 2025

View reviewed changes

martinvonz marked this pull request as draft July 20, 2025 18:30

commit_builder: allow recreating commit, don't create cycle in parent… #6950

Are you sure you want to change the base?

commit_builder: allow recreating commit, don't create cycle in parent… #6950

Uh oh!

Conversation

martinvonz commented Jul 10, 2025

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilyagr commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinvonz commented Jul 15, 2025

Uh oh!

ilyagr commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuja commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martinvonz commented Jul 16, 2025

Uh oh!

yuja commented Jul 17, 2025

Uh oh!

martinvonz commented Jul 17, 2025

Uh oh!

yuja commented Jul 17, 2025

Uh oh!

martinvonz commented Jul 20, 2025

Uh oh!

yuja commented Jul 21, 2025

Uh oh!

Uh oh!

ilyagr commented Jul 15, 2025 •

edited

Loading

ilyagr commented Jul 15, 2025 •

edited

Loading

yuja commented Jul 15, 2025 •

edited

Loading