Page MenuHomeSoftware Heritage

changelog: Reference first completion of sourceforge git/svn origins
ClosedPublic

Authored by ardumont on Jul 1 2021, 9:26 AM.

Diff Detail

Repository
rDDOC Development documentation
Branch
sourceforge
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 22706
Build 35411: arc lint + arc unit

Event Timeline

Thanks a lot for this!

I think we should detail this more though. To make a concrete change proposal, I need some info:

  • which kind of VCS repositories from SF have already been archived (or even just listed) and which haven't?
  • do we have an ETA for when the most part of them will be archived? (if it's not too far in the future, maybe this entry could become similar to the guix one, i.e., "completed first archival of, and added to regular crawling [...]")
zack requested changes to this revision.Jul 1 2021, 9:33 AM
This revision now requires changes to proceed.Jul 1 2021, 9:33 AM

Thanks a lot for this!

I think we should detail this more though. To make a concrete change proposal, I need some info:

  • which kind of VCS repositories from SF have already been archived (or even just listed) and which haven't?
  • do we have an ETA for when the most part of them will be archived? (if it's not too far in the future, maybe this entry could become similar to the guix one, i.e., "completed first archival of, and added to regular crawling [...]")

Thanks!

Yes, I was counting on your feedback to improve this a tad ;). I also intended to just
mention that we started first and then give some more details when at least git and svn
are done.

I don't really have an ETA yet [1]. We are roughly 67% done for git and 84.6% for svn
[2]. For mercurial, it's not started as other blocking points are being worked on.
Bazaar and cvs origins are listed but we don't have any loader on that front yet.

[1] The task has the history of update changes so that could be computed though.

[2] Related to T3374

I don't really have an ETA yet [1]. We are roughly 67% done for git and 84.6% for svn
[2]. For mercurial, it's not started as other blocking points are being worked on.
Bazaar and cvs origins are listed but we don't have any loader on that front yet.

Thanks, this is super helpful.

Given this, my proposal is to:

  • wait until both git and svn are "reasonably complete" (i.e., 95+%). By the look of it that should happen in 1-2 weeks max, doesn't it?
  • add an entry to the archive changelog that only talks about these two VCS, here's a concrete proposal: "completed first archival of SourceForge Git and Subversion repositories; regular crawling for those repositories enabled"

Can you confirm that the "regular crawling" part of the second point above is correct?

If you agree, I'd also love if you can take care of keeping en eye of when "reasonably complete" will have happened and ping this diff when that's the case.

I don't really have an ETA yet [1]. We are roughly 67% done for git and 84.6% for svn
[2]. For mercurial, it's not started as other blocking points are being worked on.
Bazaar and cvs origins are listed but we don't have any loader on that front yet.

Thanks, this is super helpful.

Given this, my proposal is to:

  • wait until both git and svn are "reasonably complete" (i.e., 95+%).

ok

By the look of it that should happen in 1-2 weeks max, doesn't it?

I concur.

  • add an entry to the archive changelog that only talks about these two VCS, here's a

concrete proposal: "completed first archival of SourceForge Git and Subversion
repositories; regular crawling for those repositories enabled"

Yes, that sounds like a great summary, thanks.

Can you confirm that the "regular crawling" part of the second point above is correct?

We do have one daily incremental listing (so i guess it can be called a regular crawling
as well) [1]. New detected origins are or will be ingested along the way by the current
dedicated worker (worker17) [2].

[1] Thus, the slight increase on origin git, svn and a bit of mercurial (cvs and bzr
have been mostly stale) in the table described in the associated task.

[2] At some point, we'll have to decide whether we can just let the standard workers
keep up with the sourceforge ingestion or not. Right now, we had to dedicate one new
worker to respect the sourceforge admin's wishes to not exceed more than 8 ingestions in
parallel: 2 for the listers (1 full listing every 64 days or something, 1 daily
incremental), and then the remaining 6 for the worker.

If you agree, I'd also love if you can take care of keeping en eye of when "reasonably
complete" will have happened and ping this diff when that's the case.

I'm currently doing that so yeah, i'll ping here when it's time ;)

Cheers,

Heads up, svn origins populated at 96% and git to 68.35% [1]. It will definitely be
faster once the svn origins are done. So that will probably finish next week when i'm on
vacation. I'll update when I'll get back.

[1] See the table in the T3374 description.

Far out! We exceeded the 95% on both git and svn, 99% in avg for both. ;)

ardumont retitled this revision from archive-changelog: Reference the start of sourceforge ingestion to changelog: Reference first completion of sourceforge git/svn origins.Jul 22 2021, 10:13 AM
This revision is now accepted and ready to land.Jul 22 2021, 10:33 AM