Page MenuHomeSoftware Heritage

Ingest sourceforge repositories (origins of type git, svn, hg)
Closed, MigratedEdits Locked

Description

Lister is deployed and ingestion started for the svn and git repositories [1]

We need some more work for the mercurial loader and then start the ingestion for the
mercurial repositories.

This will track both the ingestion monitoring and the remaining actions to trigger the
hg ingestion.

When this is reasonably well on its way, say for example when the git and svn
repositories are done, we need to update the logo in the main archive page [2] and add
an entry in the archive changelog about it [3]

[1] Status done out of total seen summary (out of [4] and [5] for the curious)

|------------+-------------------------------+-------------+--------+------------------------------+-----------+---------|
| Visit type | Status done (now)             | Status done |  Total | Total (now)                  |    % Done | Remains |
|------------+-------------------------------+-------------+--------+------------------------------+-----------+---------|
| git        | 2021-07-30 08:06:15.578567+00 |      181658 | 181646 | 2021-07-30 08:06:26.37034+00 | 100.00661 |     -12 |
| svn        | 2021-08-03 08:16:40.75639+00  |      101940 | 101894 | 2021-07-30 08:06:26.37034+00 | 100.04514 |     -46 |
| cvs        | x                             |           x |  28622 | 2021-07-30 08:06:26.37034+00 |         x |       x |
| hg         | 2021-08-03 08:14:43.364316+00 |       27630 |  27660 | 2021-07-30 08:06:26.37034+00 | 99.891540 |      30 |
| bzr        | x                             |           x |    290 | 2021-07-30 08:06:26.37034+00 |         x |       x |
|------------+-------------------------------+-------------+--------+------------------------------+-----------+---------|
#+TBLFM: @2$6=(100.0 * @2$3) / @2$4::@3$6=(100.0 * @3$3) / @3$4::@2$7=@2$4 - @2$3::@3$7=@3$4 - @3$3::@5$6=(100.0 * @5$3) / @5$4::@5$7=@5$4 - @5$3

[2] https://archive.softwareheritage.org (integrated in D6004 already)

[3] https://docs.softwareheritage.org/devel/archive-changelog.html (integrated in D5952)

[4] count the listed origins:

softwareheritage-scheduler=> select now(), visit_type, count(*) from listed_origins lo inner join listers l on l.id=lo.
lister_id where l.name='sourceforge' group by visit_type order by count(*) desc;

[5] count origins

softwareheritage=> select now(), count(*) from origin where url like 'https://git.code.sf.net%';  -- replace per `svn`

Limited to those origin types as we don't have any cvs nor bazar loader implementations.

Event Timeline

Note: when this is (reasonably) done, we should document the addition of SourceForge to the archive coverage page at archive.s.o and also to the archive changelog.

ardumont updated the task description. (Show Details)

Note: when this is (reasonably) done, we should document the addition of SourceForge to the archive coverage page at archive.s.o and also to the archive changelog.

Right, thanks. I amended the main description with it ^.

ardumont changed the task status from Open to Work in Progress.Jun 24 2021, 4:48 PM
ardumont updated the task description. (Show Details)

So a bit of status report.

So far, we did 99% of the svn origins but a mix of rabbitmq and swh internal details [1]
entail that we don't ingest that fast the git origins (still 68% [2]) because the svn
ones are ingested first (and they take in average 10 min each [3]).

So, to improve the current situation, the svn ingestion got put in stand-by to let the
git ingestion progress as well.

[1] internal detail in our scheduler model, svn queue is limited to 1000 whereas git is
limited to 10000 and in effect, that makes rabbitmq prefer giving priority to the small
queue somehow.

[2] The last description edition shows that the percentage remains at 68% for git origins
in between the last 2 editions (~10 days).

[3] P1095#7318. From that sample log [06/07; 19/07], loading sourceforge svn origins
took in average 10min (~600s).

So, to improve the current situation, the svn ingestion got put in stand-by to let the
git ingestion progress as well.

This worked. Overnight [1], the git ingestion bumped from 68.8% to 74.8%.

[1] change around 5:30pm yesterday to this morning 9am.

The 'git' ingestion caught up [1] so now let's make the svn origins finish [2]. In
effect, making the loader run as before. Activating back the svn queue consumption where
it remains few origins to consume.

[1] The scheduler runner instance dedicated for git is now seeing nothing to schedule
for now (possibly some origins might get schedule during the day when the lister run
sees some new ones).

10000 slots available in celery queue
0 visits to send to celery

[2] Last description edition demonstrates it. The negative "remains" may be due to save code now requests (and then origins disappeared).

zack mentioned this in Unknown Object (Maniphest Task).Jul 22 2021, 10:38 AM

The 'hg' ingestion started now that the latest loader mercurial got deployed.

ardumont renamed this task from Ingest sourceforge repositories to Ingest sourceforge repositories (origins of type git, svn, hg).Jul 29 2021, 5:45 PM
ardumont updated the task description. (Show Details)
ardumont moved this task from Backlog to in-progress on the System administration board.

Heads up on this task, i'm actually waiting for the bitbucket ingestion (which is going faster now) to finish.
To re-use our worker17 to make one last run on all the mercurial origins.

As a bug was detected and fixed on the mercurial loader (about missing snapshot), running all the previous hg
origins on those makes sense to have a decent snapshot (that will also go very fast since we made a run
on it already).

And then, we need to activate back those origins the normal way so workers do their regular crawling [1]

[1] T3470

I've triggered back run for mercurial and git origins (it's done).
So it should now have kept up with the eventual lags.
Those are regularly crawled.

I've triggered the same for svn and it's currently ongoing and be regularly crawled.
Once that's done, i'll be able to enable the sourceforge origins (they are not for
already explained reasons here).

So let's close this.