From the staging webapp, we identify a revision [1]
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jul 29 2021
by the way ^
Shipped the following modules to solve that problem:
- swh.model v2.7.0
- swh.storage v0.35.0
- swh.loader.mercurial v2.1.0
Jul 28 2021
Jul 23 2021
For now I went the simplest way I could think of, which is:
Jul 1 2021
(3) should be ideally implemented in a way that guarantees that extid that were resolvable in previous versions of the mapping will always be resolvable in future versions
I don't understand. Option 3 is to remove relations between extids and SWHID, so it won't be resolvable anymore.
Jun 30 2021
So if that mapping change, but always give back an object in the archive (pointed by a SWHID)
I've the feeling that option (1) will lead in the long run to an explosion on the size of the mapping which will make us eventually converge (slowly) toward option (3).
and it would probably be kind of a mess from a kafka perspective
The "mapping version field" is the most fleshed out proposal as it would be my preference. My rationale for it against changing extid_type for backwards incompatible changes is that the extid_type is a property of the external artifact, while the mapping version is a property of our archiving infrastructure.
Jun 24 2021
It no longer does...
We have an end-to-end checkpoint for the mercurial loading and it's green now.
Jun 21 2021
Now that the branch structure has landed, I've deployed this latest version. After some cleanup of the duplicate extids left over from an earlier deployment, everything seems to be fine and the loader is ready for production.
Jun 3 2021
In T3352#65755, @Alphare wrote:Here's what I gather to be the most up-to-date version:
- HEAD [required] either the node pointed by the @ bookmark or the tip of default branch
- branch-tip/<branch-name> [required] the tipmost head of each open branch
- bookmarks/<bookmark_name> [optional] hold the bookmark mapping if any
- branch-heads/<branch_name>/0..n [optional] for any branch with multiple open head, list all open heads
- branch-closed-heads/<branch_name>/0..n [optional] for any branch with at least one closed head, list all closed heads
- tags/<tag-name> [optional] record tags
Note that the current patch sent in D5816 still has the refs/hg/ prefix, does not use the @ bookmark by default nor does it have a namespace for tags. I'll be integrating these things in the next update to the patch.
In T3352#65758, @Alphare wrote:I think @marmoute's intention was to more closely convey the semantics of Mercurial's branching system. A branch tip or head are not a branches themselves, so it would be "wrong" to put them under branches/. Thus, since the plural of "a branch head" is "branch heads" I don't feel like the second change would be appropriate either.
But then again, you have the final say, and I won't die on this hill. :)
That explains it, and it's good enough for me, thanks :)
I think @marmoute's intention was to more closely convey the semantics of Mercurial's branching system. A branch tip or head are not a branches themselves, so it would be "wrong" to put them under branches/. Thus, since the plural of "a branch head" is "branch heads" I don't feel like the second change would be appropriate either.
My remaining question then is: how about, instead of branch-{tip,heads,closed-heads}/name we use branches/{heads,closed,tip}/name ?
Here's what I gather to be the most up-to-date version:
In T3352#65753, @marmoute wrote:They are all branch heads (git "branch" are about heads too, bookmarks too), so a heads/ prefix does not bring much.
In T3352#65752, @zack wrote:My point here is for user looking at the structure to easily distinguish between the different mapping format. Something based on the "visit data" and associated documentation seems quite fragile.
But a version number for the mapping format will be completely meaningless for a user. It's too Software Heritage-specific. The more I think of it the more I'm convinced we should just offer names that are as natural and self-explanatory as possible. Our notion of what is self-explanatory might change in the future (also based on user feedback), but so be it. It is not going to be a new problem in the archive, as @olasd pointed out, so we will not be making it any worse.
In practical terms here I concur that that means dropping the "refs/hg" prefix.
My point here is for user looking at the structure to easily distinguish between the different mapping format. Something based on the "visit data" and associated documentation seems quite fragile.
In T3352#65749, @olasd wrote:In T3352#65655, @zack wrote:I'm happy with the HEAD branch being an alias to a name that's more representative of the corresponding mercurial concept (which would be tip, I guess).
In T3352#65655, @zack wrote:HEAD is also a name that is Git-specific, but I understand that we want that notion and probably that name is as good as any. (Unless there is a Mercurial generic name for something similar.)
Jun 2 2021
Jun 1 2021
In T3352#65657, @zack wrote:So, to pivot the question around, what is the minimal (also in the sense that it is shorter / has less cruft) naming scheme that would allow us to represent without ambiguity all the Mercurial naming aspects that you want to capture?
May 31 2021
Understood. To explain my thinking here, the refs/... structure is something we picked to represent git branch names as faithfully as possible, adding as little as possible on top of it. In trying to represent branch names from another VCS, as a first approximation I'd rather reuse the same *approach* than a *result* that is similar, if that makes sense. So, to pivot the question around, what is the minimal (also in the sense that it is shorter / has less cruft) naming scheme that would allow us to represent without ambiguity all the Mercurial naming aspects that you want to capture?
In T3352#65655, @zack wrote:Is the ability to recognize that a snapshot comes from Mercurial an actual goal here? I don't think we care about "clashes" between snapshot created from different VCS, but maybe I'm missing something.
Is the ability to recognize that a snapshot comes from Mercurial an actual goal here? I don't think we care about "clashes" between snapshot created from different VCS, but maybe I'm missing something.
As mentioned in T3336, we've now passed 3000 repos loaded successfully in staging. We've had two failures due to attempting to add two identical objects concurrently, which is something my simple test script wouldn't catch, but would be handled properly by an actual worker process.
@Alphare proposal is missing the branch-name part for branch-tip. I would ajust it as such:
I'm pinging @zack as I think his feedback on this naming scheme would be valuable.
Agreed! That would look like:
I would go: refs/hg/branch-tip/ instead of just tip
The run from this week-end, detailed in T3336, appears to have worked fine. (just making sure it's obvious from this task also)
Great news! Let me know if I can help in any way.
After the weekend, the loader ran a few thousand loading tasks (out of 235k total). Out of those, only 2 failed for already known concurrency reasons. We should be good to go to production on this loader.
May 28 2021
base_dir=/srv/storage/space/mirrors/boatbucket tail -n +10000 $base_dir/mapping-to-repos.txt | head -10000 | while read dir url; do repo_dir="$base_dir/$dir" visit_date=`stat -c %z $repo_dir/.hg/blackbox.log | sed -E 's/ \+0000/+0000/'` SWH_CONFIG_FILENAME=/etc/softwareheritage/loader_mercurial.yml swh --log-level=DEBUG loader run mercurial_from_disk $url directory=$repo_dir visit_date="\"$visit_date\"" done 2>&1 | tee -a bitbucket-archive.2.log
After packaging swh.loader.mercurial 1.1 with @Alphare 's changes, all seems well on the staging environment (at least the inconsistencies I had noticed are not there anymore).
That's awesome news, thanks for the heads up \o/.
For posterity, I have tested that all corrupted and "verify failed" repositories in the archive load correctly, as well as the humongous Mozilla-unified, PyPy and about a few thousand random other ones from the archive. Aside from the incremental loading issues detailed in T3336 (that should be fixed in today's run), everything seems fine.
May 27 2021
May 24 2021
Reproduction for the duplicate nodeids in the extid table:
So that's been done on Friday, and things seem to work in general, but there is a bunch of issues:
May 21 2021
The mapping file is located (on the boatbucket machine) at /srv/boatbucket/mapping-to-repos.txt. It does *not* contain the (very few) outright corrupted repositories, I might have to do some digging and even bother the BB team again to get the URL for those.
So, this was done by way of D5628, and is now released and ready to deploy.
May 19 2021
Mar 4 2021
Dec 10 2020
Nov 2 2020
Oct 26 2020
We don't have easily accessible backups going back to (before) August 2018, so I don't think we'll be able to recover this data.