Page MenuHomeSoftware Heritage

Define a ref mapping naming scheme for all Mercurial "pointers" (heads, closed heads, bookmarks, tip)
Open, HighPublic

Description

Mercurial's branching model is more complex than Git's; it allows for multiple heads per branch, closed heads and bookmarks. Since the current snapshot model is Git-centric and would require a large re-work (that we may not have the necessary perspective for yet), we need to define a naming scheme to map Mercurial "pointers" to Git refs.

I propose the following [updated 2021-06-03]:

  • HEAD [required] either the node pointed by the @ bookmark or the tip of default branch
  • branch-tip/<branch-name> [required] the tipmost head of each open branch
  • bookmarks/<bookmark_name> [optional] hold the bookmark mapping if any
  • branch-heads/<branch_name>/0..n [optional] for any branch with multiple open head, list all open heads
  • branch-closed-heads/<branch_name>/0..n [optional] for any branch with at least one closed head, list all closed heads
  • tags/<tag-name> [optional] record tags

The format is not ambiguous regardless of branch name since we know it ends with a /<index>, as long as we have a stable sorting of the heads.
There may be some overlap between the refs, but it's simpler not to try to figure out de-duplication (and frankly I don't really see the point of trying). However, we want to optimize for simplicity in the most common case : only tip is required, since it's always present in a repository. We will not store branch heads for repositories with only one branch (since its head is tip), and bookmarks or closed heads will of course be populated only if they exist within that repository.

Event Timeline

Alphare triaged this task as High priority.May 31 2021, 3:40 PM
Alphare created this task.
Alphare created this object in space S1 Public.

I would go: refs/hg/branch-tip/ instead of just tip

I would also favor - instead of _ so branch-heads (from branch_heads)

I would also use branch in the part about closed. So something like rev/hg/branch-closed-heads/<branch>/0

Agreed! That would look like:

  • refs/hg/branch-tip (required)
  • refs/hg/bookmarks/<bookmark_name> (optional)
  • refs/hg/branch-heads/<branch_name>/0..n (optional, branch heads, not topological heads)
  • refs/hg/branch-closed-heads/<branch_name>/0..n (optional)

@olasd What do you think?

I'm pinging @zack as I think his feedback on this naming scheme would be valuable.

Some stuff I'm thinking about:

  • Snapshots already have a "default" branch called HEAD, which is what gets shown by default when browsing an origin the Web UI. We can either replace your proposed refs/hg/tip with a branch named HEAD, or make HEAD an alias to refs/hg/tip (I think I'm slightly in favor of the first option).
  • I'm not sure about the hg/ infix sprinkled everywhere. I'd propose the following color for the bikeshed, which would quack a bit more like what we're generating for git.
    • HEAD (required)
    • refs/bookmarks/<bookmark_name> (bookmarks)
    • refs/heads/<branch_name>/0..n (branch heads)
    • refs/closed/<branch_name>/0..n (closed branch heads)

Overall, as long as the mapping from hg concepts to SWH snapshot branch names is explicit, documented and inambiguous, I think we'll be fine.

@Alphare, could you showcase what the snapshot structure would generate for some currently active hg repos, to see whether it looks sensible? I'm not sure if my understanding of the branches / bookmarks concept actually matches reality

@Alphare proposal is missing the branch-name part for branch-tip. I would ajust it as such:

  • HEAD [required] either the node pointed by the @ bookmark of the tip of default branch
  • refs/hg/branch-tip/<branch-name> [required] the tipmost head of each open branch
  • refs/hg/bookmarks/<bookmark_name> [optional] hold the bookmark mapping if any
  • refs/hg/branch-heads/<branch_name>/0..n [optional] for any branch with multiple open head, list all open heads
  • refs/hg/branch-closed-heads/<branch_name>/0..n [optional] for any branch with at least one closed head, list all closed heads = refs/hg/tags/<tag-name> [optional] record tags

The refs/hg/ prefix is here to clarify that this is a mapping from Mercurial, it will allow use to painlessly change our mapping strategy without ambiguity in the future.

I would also rather keep the branch- part as it make things explicit as new namespace can appears (eg: we already have tags, we will want to track topics eventually, etc).

Is the ability to recognize that a snapshot comes from Mercurial an actual goal here? I don't think we care about "clashes" between snapshot created from different VCS, but maybe I'm missing something.

If we agree that's not a goal (again: to be discussed :-)), we can drop the refs/hg/ prefix entirely.

HEAD is also a name that is Git-specific, but I understand that we want that notion and probably that name is as good as any. (Unless there is a Mercurial generic name for something similar.)

In T3352#65655, @zack wrote:

Is the ability to recognize that a snapshot comes from Mercurial an actual goal here? I don't think we care about "clashes" between snapshot created from different VCS, but maybe I'm missing something.

My point is more about the fact that we are not doing a pristine picture of the repository here. We decide to do a mapping of Mercurial data to swh/git data. That mapping make choices, and because the data in mercurial might evolve, that mapping might need to evolve. If we have some clear versioning of this mapping (eg: ref/hg vs rev/hg2/), it will be easier to adapt in the future.

Understood. To explain my thinking here, the refs/... structure is something we picked to represent git branch names as faithfully as possible, adding as little as possible on top of it. In trying to represent branch names from another VCS, as a first approximation I'd rather reuse the same *approach* than a *result* that is similar, if that makes sense. So, to pivot the question around, what is the minimal (also in the sense that it is shorter / has less cruft) naming scheme that would allow us to represent without ambiguity all the Mercurial naming aspects that you want to capture?

Regarding migrations, note that we didn't care in the past about changing the structure of git snapshots (I think it concerned what HEAD can point to, if I'm not mistaken), so that would not be a big deal for Mercurial either. I think it's fair that loaders can evolve over time and change the way they represent things, without feeling bound by backward compatibility/uniformity.
(If we want to include version information, which makes intuitive sense, we should probably find an explicit place where to put that. Either a separate field in the snapshots, or maybe visit metadata.)

In T3352#65657, @zack wrote:

So, to pivot the question around, what is the minimal (also in the sense that it is shorter / has less cruft) naming scheme that would allow us to represent without ambiguity all the Mercurial naming aspects that you want to capture?

Same, without refs/hg/ the prefix, (though I would rather keep some prefix for versioning of the mapping)

I have implemented @marmoute's version of the mapping at D5816. Since the exact naming scheme is just one search-and-replace away, we can still change it easily. Implementing this has highlighted a flaw in the handling of multiple open heads, which is now fixed.

In T3352#65655, @zack wrote:

HEAD is also a name that is Git-specific, but I understand that we want that notion and probably that name is as good as any. (Unless there is a Mercurial generic name for something similar.)

While that's right that the name of our HEAD branch comes from git, currently, all loaders that generate a snapshot with an inambiguous default branch use HEAD to point at that object. The webapp uses that as the default branch to display when browsing an origin, so we've de-facto standardized on it.

I'm happy with the HEAD branch being an alias to a name that's more representative of the corresponding mercurial concept (which would be tip, I guess).

I agree that the refs/hg prefix doesn't add much value to the meaning of what we put in the snapshots, so I'm in favor of dropping it. We already have a somewhat wide range of snapshot shapes, so this shouldn't pose a problem in the webapp, for instance.

As for the versioning argument, if we have to change the branch structure for an update of mercurial, existing snapshots (with their immutable ids) and visits (with their dates) will not be changed, so all references to the old branch structure will stay valid. The main concern would be "url hacking" of sorts, on origins which are still actively being archived, and I believe that's something that we shouldn't generally support, as long as the changes in branch structure are documented (We have an archive changelog for these sorts of things, although we'll probably want a separate document keeping track of historical snapshot branch structures)

In T3352#65749, @olasd wrote:
In T3352#65655, @zack wrote:

I'm happy with the HEAD branch being an alias to a name that's more representative of the corresponding mercurial concept (which would be tip, I guess).

Nah, do not use tip. Take all the knownledge you have about tip and throw it in a volcano. The best approximation of HEAD (ei: what you get after a clone) is the @ if it exists or the head of the default branch (the highest revision number head in default if they are multiple open head).

As for the versioning argument, if we have to change the branch structure for an update of mercurial, existing snapshots (with their immutable ids) and visits (with their dates) will not be changed, so all references to the old branch structure will stay valid.

My point here is for user looking at the structure to easily distinguish between the different mapping format. Something based on the "visit data" and associated documentation seems quite fragile.

My point here is for user looking at the structure to easily distinguish between the different mapping format. Something based on the "visit data" and associated documentation seems quite fragile.

But a version number for the mapping format will be completely meaningless for a user. It's too Software Heritage-specific. The more I think of it the more I'm convinced we should just offer names that are as natural and self-explanatory as possible. Our notion of what is self-explanatory might change in the future (also based on user feedback), but so be it. It is not going to be a new problem in the archive, as @olasd pointed out, so we will not be making it any worse.

In practical terms here I concur that that means dropping the "refs/hg" prefix.

Looking at the rest, I also wonder about "branch_heads" and "closed_heads". Are they both going to include "heads"? If so, shouldn't we go for something like "heads/{branch,closed}"?

In T3352#65752, @zack wrote:

My point here is for user looking at the structure to easily distinguish between the different mapping format. Something based on the "visit data" and associated documentation seems quite fragile.

But a version number for the mapping format will be completely meaningless for a user. It's too Software Heritage-specific. The more I think of it the more I'm convinced we should just offer names that are as natural and self-explanatory as possible. Our notion of what is self-explanatory might change in the future (also based on user feedback), but so be it. It is not going to be a new problem in the archive, as @olasd pointed out, so we will not be making it any worse.

In practical terms here I concur that that means dropping the "refs/hg" prefix.

I guess we are running in a loop here and this is your archive. So if you don't want some versioning lets not have one.

Looking at the rest, I also wonder about "branch_heads" and "closed_heads". Are they both going to include "heads"? If so, shouldn't we go for something like "heads/{branch,closed}"?

They are all branch heads (git "branch" are about heads too, bookmarks too), so a heads/ prefix does not bring much.

They are all branch heads (git "branch" are about heads too, bookmarks too), so a heads/ prefix does not bring much.

I realized just now I was looking at an old version of the namespace on this point, because what I had in mind is isomorphic to a later proposal.
Can someone post an up-to-date and complete namespace proposal here for approval?

Here's what I gather to be the most up-to-date version:

  • HEAD [required] either the node pointed by the @ bookmark or the tip of default branch
  • branch-tip/<branch-name> [required] the tipmost head of each open branch
  • bookmarks/<bookmark_name> [optional] hold the bookmark mapping if any
  • branch-heads/<branch_name>/0..n [optional] for any branch with multiple open head, list all open heads
  • branch-closed-heads/<branch_name>/0..n [optional] for any branch with at least one closed head, list all closed heads
  • tags/<tag-name> [optional] record tags

Note that the current patch sent in D5816 still has the refs/hg/ prefix, does not use the @ bookmark by default nor does it have a namespace for tags. I'll be integrating these things in the next update to the patch.

Thanks @Alphare.

My remaining question then is: how about, instead of branch-{tip,heads,closed-heads}/name we use branches/{heads,closed,tip}/name ?

Note that there are two separate changes there:

  1. grouping things below branches/ instead of using - as a separator. The rationale is that we already group things with / elsewhere in the namespace proposal, so this looks more consistent
  2. switching from singular branch to plural branches, for consistency with tags

Both are debatable and I've no strong opinion on either of these, I'm raising these points just in case they have been overlooked.

I think @marmoute's intention was to more closely convey the semantics of Mercurial's branching system. A branch tip or head are not a branches themselves, so it would be "wrong" to put them under branches/. Thus, since the plural of "a branch head" is "branch heads" I don't feel like the second change would be appropriate either.

But then again, you have the final say, and I won't die on this hill. :)

That explains it, and it's good enough for me, thanks :)

I'm fine with this version of the proposal (which I've also put in the task description, with a timestamp).

@olasd: ?

I think @marmoute's intention was to more closely convey the semantics of Mercurial's branching system. A branch tip or head are not a branches themselves, so it would be "wrong" to put them under branches/. Thus, since the plural of "a branch head" is "branch heads" I don't feel like the second change would be appropriate either.

But then again, you have the final say, and I won't die on this hill. :)

Given the context of the snapshot (that record heads/single-revs) , I am fine to put them under branches/

However the current proposal works for me.

Here's what I gather to be the most up-to-date version:

  • HEAD [required] either the node pointed by the @ bookmark or the tip of default branch
  • branch-tip/<branch-name> [required] the tipmost head of each open branch
  • bookmarks/<bookmark_name> [optional] hold the bookmark mapping if any
  • branch-heads/<branch_name>/0..n [optional] for any branch with multiple open head, list all open heads
  • branch-closed-heads/<branch_name>/0..n [optional] for any branch with at least one closed head, list all closed heads
  • tags/<tag-name> [optional] record tags

Note that the current patch sent in D5816 still has the refs/hg/ prefix, does not use the @ bookmark by default nor does it have a namespace for tags. I'll be integrating these things in the next update to the patch.

👍 from me.