Page MenuHomeSoftware Heritage

staging: Analyze result of the maven listing and ingestion
Closed, ResolvedPublic

Description

Once the deployment and listing is triggered and done, analyze the results and/or errors.

Current errors:

  • T3746#83075, D7572: lister: currently the lister stops when not finding an url (404), it should log (warn) the issue and continue listing the rest [1]
  • T3746#83079, D7573: loader: currently fails loading what the lister lists, inconsitency between lister output and loader input to fix [2]
  • D7584: argument of type 'NoneType' is not iterable [3]
  • D7879: Canonicalize github origins during listing (deployed swh.lister v2.9.3)
  • T3874#86084: ongoing analysis for ^

[1] https://sentry.softwareheritage.org/share/issue/e2da55065b524c568c7a442d653b40c6/

[2] https://sentry.softwareheritage.org/share/issue/c48ad72d74d348e4a4dde0959f373674/

[3] https://sentry.softwareheritage.org/share/issue/ce84069a7d4b4ddeaa19f4b524d89b8e/

Event Timeline

ardumont triaged this task as Normal priority.Jan 24 2022, 9:28 AM
ardumont created this task.
ardumont renamed this task from staging: Analyze result of the maven ingestion to staging: Analyze result of the maven listing and ingestion.Apr 13 2022, 5:02 PM

Another round of deployment occured with swh.lister v2.8.1 occurred.
clojars repository got listed again (ongoing) and the lister is no longer crashing for that one.

ardumont changed the task status from Open to Work in Progress.Apr 15 2022, 5:12 PM

Old maven behavior results in origins like git://github.com, ... [1]
The new maven lister behavior should now result in canonical github urls http://github.com/user/repo.
Analysis ongoing and report will go after that comment.

[1] P1369gj

Plan:

  • P1369: Listing status after first round listing
  • Clean up maven github origins listing [1]
  • Trigger maven full run [2]
  • Wait for listing to finish
  • Listing status after new maven lister round of listing
  • Ping in mailing list discussion with data!

[1]

14:43:40 *swh-scheduler@db1:5432=> with maven_lister_ids as (
swh-scheduler(>     select id from listers where name='maven'
swh-scheduler(> ) delete from
swh-scheduler->     listed_origins lo1
swh-scheduler->   where
swh-scheduler->     lister_id in (select id from maven_lister_ids)
swh-scheduler->     and visit_type = 'git'
swh-scheduler->     and url like '%github.com%'
swh-scheduler->     and not exists (
swh-scheduler(>       select 1 from listed_origins lo2
swh-scheduler(>       where
swh-scheduler(>         lo1.visit_type = lo2.visit_type
swh-scheduler(>         and lo1.url = lo2.url
swh-scheduler(>         and lo2.lister_id not in (select id from maven_lister_ids)
swh-scheduler(>     );

DELETE 28067
Time: 61233.213 ms (01:01.233)
14:44:42 *swh-scheduler@db1:5432=>
14:44:42 *swh-scheduler@db1:5432=> commit;
COMMIT
Time: 258.813 ms

[2]

15:04:45 swh-scheduler@db1:5432=>  update task set status='next_run_not_scheduled', next_run=now() where id=31171944;
UPDATE 1
Time: 215.620 ms
15:05:01 swh-scheduler@db1:5432=> select * from task where id=31171944;
+-[ RECORD 1 ]-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id               | 31171944                                                                                                                                                        |
| type             | list-maven-full                                                                                                                                                 |
| arguments        | {"args": [], "kwargs": {"url": "https://repo1.maven.org/maven2/", "index_url": "https://maven-exporter.internal.staging.swh.network/export-maven-central.fld"}} |
| next_run         | 2022-06-01 13:05:01.466561+00                                                                                                                                   |
| current_interval | 90 days                                                                                                                                                         |
| status           | next_run_scheduled                                                                                                                                              |
| policy           | recurring                                                                                                                                                       |
| retries_left     | 0                                                                                                                                                               |
| priority         | (null)                                                                                                                                                          |
+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

Time: 5.767 ms

Full listing is not finished yet but still there remains origins with exotic starting urls which are not canonicalized.
I'd say the issue lies with the canonicalize swh.core implementation code which only deals with https:// and git:// urls.
So some improvments are needed there.

[1] P1371

Yesterday, i had fixed, diffed, released and pushed the diff [1] fixing the
canonicalization of remaining exotic urls, cleaned up 'git' (out of a maven listing)
origins and triggered back a listing. Today, checking back those origins (staging
scheduler), there was still noise which should no longer have been there...

And then I found some dangling listing processes... So i gather what happened is some
old processes (with previous version of the code) actually consumed the maven listing
(and created such noise). So, I killed it, stopped the services, cleaned up yet again
those origins ^ and triggered back a full-maven listing.

Status: Waiting yet again for the full maven listing to finish (and hopefully without
exotic non-canonicalized github origins this time)

Prior to actually ping in the mailing list discussion ^.

[1] D7946

status: triggered 2 full-maven lister runs on maven central and jboss [1]
And no more exotic github urls are popping up [2].

So i guess it's fixed. That is all listed maven origins that ends up being github ones (canonicalized ones) are already present in listed origins due to other listers [3]

[1]

root@pergamon:~# clush -b -w @staging-workers 'systemctl status swh-worker@lister' | grep maven | grep succeeded
Jun 03 09:08:00 worker0 python3[4170026]: [2022-06-03 09:08:00,712: INFO/ForkPoolWorker-4] Task swh.lister.maven.tasks.FullMavenLister[8d1d52b4-f0d0-4f3d-817b-3d02ad48eb0a] succeeded in 1171.5895328279585s: {'pages': 9941, 'origins': 9935}
Jun 03 08:08:43 worker3 python3[3724078]: [2022-06-03 08:08:43,751: INFO/ForkPoolWorker-4] Task swh.lister.maven.tasks.FullMavenLister[c9a18b0b-fcdd-492b-ab5a-5eca07896b2f] succeeded in 2246.291659256909s: {'pages': 30853, 'origins': 30840}

[2]

13:52:37 swh-scheduler@db1:5432=> with maven_lister_ids as (
    select id from listers where name='maven'
) select now(), visit_type, url
  from
    listed_origins lo1
  where
    lister_id in (select id from maven_lister_ids)
    and visit_type = 'git'
    and url like '%github.com%'
    and not exists (
      select 1 from listed_origins lo2
      where
        lo1.visit_type = lo2.visit_type
        and lo1.url = lo2.url
        and lo2.lister_id not in (select id from maven_lister_ids)
    )
;
+-----+------------+-----+
| now | visit_type | url |
+-----+------------+-----+
+-----+------------+-----+
(0 rows)

Time: 16936.595 ms (00:16.937)

[3] subset of the actual archive but still a few

16:43:29 swh-scheduler@db1:5432=> select now(), count(*) from listed_origins where visit_type='git' and url like '%github.com%';
+-------------------------------+---------+
|              now              |  count  |
+-------------------------------+---------+
| 2022-06-03 14:43:31.116227+00 | 2213849 |
+-------------------------------+---------+
(1 row)

Time: 4361.559 ms (00:04.362)

There remains git and other dvcs typed origins [1] listed by maven but not github ones [2].

[1]

16:02:46 swh-scheduler@db1:5432=> with maven_lister_ids as (
    select id from listers where name='maven'
) select now(), visit_type, count(*)
  from
    listed_origins lo1
  where
    lister_id in (select id from maven_lister_ids)
    and visit_type != 'maven'
    and not exists (
      select 1 from listed_origins lo2
      where
        lo1.visit_type = lo2.visit_type
        and lo1.url = lo2.url
        and lo2.lister_id not in (select id from maven_lister_ids)
    )
  group by visit_type;
+-------------------------------+------------+-------+
|              now              | visit_type | count |
+-------------------------------+------------+-------+
| 2022-06-03 14:02:57.631052+00 | svn        | 14015 |
| 2022-06-03 14:02:57.631052+00 | hg         |   255 |
| 2022-06-03 14:02:57.631052+00 | cvs        |    64 |
| 2022-06-03 14:02:57.631052+00 | git        |  1887 |
+-------------------------------+------------+-------+
(4 rows)

Time: 59889.690 ms (00:59.890)

[2] P1377