Page MenuHomeSoftware Heritage

Investigate why cran origins are marked as non-visited in the scheduler metrics
Closed, ResolvedPublic

Description

Those origins have been ingested [1]. They reference a snapshot [2]. But they are marked
as never visited in the metrics for unknown reasons [3].

[1]

15:22:12 softwareheritage-scheduler@belvedere:5432=> select count(url) from listed_origins o inner join listers l on l.id=o.lister_id where visit_type='tar' and l.name='CRAN';
+-------+
| count |
+-------+
| 18896 |
+-------+
(1 row)

Time: 3608.017 ms (00:03.608)
15:22:21 softwareheritage-scheduler@belvedere:5432=> select url from listed_origins o inner join listers l on l.id=o.lister_id where visit_type='tar' and l.name='CRAN' limit 10;
+-------------------------------------------------+
|                       url                       |
+-------------------------------------------------+
| https://cran.r-project.org/package=A3           |
| https://cran.r-project.org/package=AATtools     |
| https://cran.r-project.org/package=ABACUS       |
| https://cran.r-project.org/package=ABC.RAP      |
| https://cran.r-project.org/package=ABCanalysis  |
| https://cran.r-project.org/package=ABCoptim     |
| https://cran.r-project.org/package=ABCp2        |
| https://cran.r-project.org/package=ABHgenotypeR |
| https://cran.r-project.org/package=ABPS         |
| https://cran.r-project.org/package=ACA          |
+-------------------------------------------------+
(10 rows)

Time: 5.533 ms

[2]

15:22:57 softwareheritage-scheduler@belvedere:5432=> select count(url) from origin_visit_stats where visit_type='tar' and url like 'https://cran.r-project.org/%' and last_snapshot is null and last_visit is not null;
+-------+
| count |
+-------+
|     0 |
+-------+
(1 row)

Time: 672009.331 ms (11:12.009)

[3]

15:37:02 softwareheritage-scheduler@belvedere:5432=> select l.name, l.instance_name, sm.* from scheduler_metrics sm inner join listers l on sm.lister_id=l.id where l.name='CRAN' ;
+------+---------------+--------------------------------------+------------+-------------------------------+---------------+-----------------+-----------------------+------------------------------+
| name | instance_name |              lister_id               | visit_type |          last_update          | origins_known | origins_enabled | origins_never_visited | origins_with_pending_changes |
+------+---------------+--------------------------------------+------------+-------------------------------+---------------+-----------------+-----------------------+------------------------------+
| CRAN | cran          | 0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4 | tar        | 2021-10-20 12:38:35.110398+00 |         18896 |           18896 |                 18896 |                            0 |
+------+---------------+--------------------------------------+------------+-------------------------------+---------------+-----------------+-----------------------+------------------------------+
(1 row)

Time: 13.436 ms

Event Timeline

ardumont triaged this task as Normal priority.Oct 20 2021, 3:37 PM
ardumont created this task.
ardumont added a project: Scheduling utilities.
ardumont updated the task description. (Show Details)
ardumont changed the task status from Open to Work in Progress.Oct 22 2021, 2:48 PM
ardumont raised the priority of this task from Normal to Unbreak Now!.

Ah, @anlambert found the issue, tar type instead of cran type. The result of the
following queries [1] should be the opposite... no tar origins, only cran ones.

14:53:34 *softwareheritage-scheduler@belvedere:5432=> select count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='CRAN' and visit_type='tar';
+-------+
| count |
+-------+
| 18904 |
+-------+
(1 row)

Time: 1236.045 ms (00:01.236)
14:53:39 *softwareheritage-scheduler@belvedere:5432=> select count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='CRAN' and visit_type='cran';
+-------+
| count |
+-------+
|     0 |
+-------+
(1 row)

Time: 17.059 ms

Plan to fix this:

  • land diff D6539
  • package and deploy it (within swh.lister v2.2.0)
  • update the origins in the scheduler backend (tables listed_origins, scheduler metrics, origin_visit_stats) [1] [2]
  • ensure cran origins are scheduled (scheduler runner in place)
  • ensure the swh-worker@loader_cran service does its job
  • Schedule a recent listing after all fixes
  • D6544: drop the computation done in the archive coverage part for those origins.

That should be it. Both staging and production environment got fixed.

[1]

softwareheritage-scheduler=# update listed_origins set visit_type='cran' where lister_id='0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4' and visit_type='tar';
update listed_origins set visit_type='cran' where lister_id='0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4' and visit_type='tar';
UPDATE 18904
softwareheritage-scheduler=# update scheduler_metrics set visit_type='cran' where lister_id='0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4' and visit_type='tar';
update scheduler_metrics set visit_type='cran' where lister_id='0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4' and visit_type='tar';
UPDATE 1
softwareheritage-scheduler=# update origin_visit_stats set visit_type='cran' where visit_type='tar' and url like 'https://cran.r-project.org/%';
update origin_visit_stats set visit_type='cran' where visit_type='tar' and url like 'https://cran.r-project.org/%';
UPDATE 0

[2]

softwareheritage-scheduler=# select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='CRAN' and visit_type='tar';
select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='CRAN' and visit_type='tar';
              now              | count
-------------------------------+-------
 2021-10-22 13:46:31.520123+00 |     0
(1 row)

softwareheritage-scheduler=# select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='CRAN' and visit_type='cran';
select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where l.name='CRAN' and visit_type='cran';
              now              | count
-------------------------------+-------
 2021-10-22 13:46:36.055949+00 | 18904
(1 row)

drop the computation done in the archive coverage part for those origins.

Once CRAN origins loading has been fixed in production and that all origins have been processed,
this piece of code can be removed in swh-web coverage widget implementation.

Cran loader worked hard and fast, metrics got updated and now are consistent with reality [1]
So we can land D6544, deploy and be done with this issue.

[1]

17:28:02 softwareheritage-scheduler@belvedere:5432=> select * from scheduler_metrics lo inner join listers l on lo.lister_id=l.id where l.name='CRAN' and visit_type='cran';
+--------------------------------------+------------+-------------------------------+---------------+-----------------+-----------------------+------------------------------+--------------------------------------+------+---------------+-------------------------------+---------------+-------------------------------+
|              lister_id               | visit_type |          last_update          | origins_known | origins_enabled | origins_never_visited | origins_with_pending_changes |                  id                  | name | instance_name |            created            | current_state |            updated            |
+--------------------------------------+------------+-------------------------------+---------------+-----------------+-----------------------+------------------------------+--------------------------------------+------+---------------+-------------------------------+---------------+-------------------------------+
| 0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4 | cran       | 2021-10-22 15:12:20.890003+00 |         18912 |           18912 |                     0 |                            0 | 0bac0a61-1ee1-45ad-b37e-13a38a0fb8f4 | CRAN | cran          | 2021-02-04 18:59:31.657968+00 | {}            | 2021-02-04 18:59:31.657968+00 |
+--------------------------------------+------------+-------------------------------+---------------+-----------------+-----------------------+------------------------------+--------------------------------------+------+---------------+-------------------------------+---------------+-------------------------------+
(1 row)

Time: 10.056 ms

Remaining coverage fix is deployed.

The archive is now properly showing the 18912 cran origins [1]

[1] https://archive.softwareheritage.org/

ardumont claimed this task.