Investigate CRAN revisions without an origin
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Aug 26 2020, 1:53 PM

Description

19:58 <vlorentz> we have a revision 00a867beb2ad8e203f242e9843d2e88de0856cda
19:58 <vlorentz> which was obviously created by the CRAN loader, from its content
19:59 <vlorentz> and the origin should be https://cran.r-project.org/package=mljar
19:59 <vlorentz> but that origin doesn't exist
19:59 <vlorentz> additionally, the revision is referenced by a snapshot (f7ce27eb87dc82e02b6178db3c9d9dbc9fa055d6) which isn't reference by any visit status
19:59 <vlorentz> I could understand an orphan snapshot, but not the missing origin
20:01 <vlorentz> any idea what happened?
20:01 <vlorentz> the revision was created on 2020-01-09T13:14:43.438615+00:00, btw
20:02 <+olasd> one could fish out the cran_loader worker logs for that date, see what happened
20:03 <+olasd> (using kibana)
20:03 <+olasd> but for such old logs you'd need to reopen the elasticsearch indexes first

but I can't access the logs.

I opened the index, but Kibana doesn't see it. (and it's half unusable anyway, T2534), and ES doesn't return any document when searching in the index:

$ curl "esnode1.internal.softwareheritage.org:9200/swh_workers-2020.01.09/_search" -X GET -H "Content-Type: application/json" -d '{"query": {"match_all": {}}}'
{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0.0,"hits":[]}}

List of "grand-orphan" revisions I found so far:

00a867beb2ad8e203f242e9843d2e88de0856cda (origin: https://cran.r-project.org/package=mljar )
024bee9f941d1cc9f11daebc72b42332a7af9f31 (origin: https://cran.r-project.org/package=ROI.plugin.ipop )
028e9890a9287b35851c48ca351641743542d030 (origin: https://cran.r-project.org/package=PhaseType )
08dd80bdaf5cf36f940865e8d7e0556cad6d881d (origin: https://cran.r-project.org/package=SchemaOnRead )
09e34df7727658fd5db298bb3d4dd862a68768e7 (origin: https://cran.r-project.org/package=multilevelMatching )
0b17032b3021e24424f8d5d2694b3cf207df6854 ( https://cran.r-project.org/package=MetamapsDB )
0d9502792997d32416f9b2e0582d6eb80185723b https://cran.r-project.org/package=spartan
243056d1bfb23e22844b477f8dbbd09e5bf5ebf6 https://cran.r-project.org/package=ActivityIndex
2431636ecb54fb0d0e340eb82e0abaf6ef3897d7 https://cran.r-project.org/package=DZEXPM
27254839c516a21d72de8ebdfe80ee559d2affe7 https://cran.r-project.org/package=rinat
2c478e2d48e70b2bd5dbf53c271f9727511ab8b3 https://cran.r-project.org/package=ReliabilityTheory
2db27bd5b2907bd8a71c437a972016e1480dc2d9 https://cran.r-project.org/package=IPtoCountry
2e8bbe5ab4f49c6e9f73c995d5685d9b78d36b1e https://cran.r-project.org/package=refGenome
3662f0d1e781eaaa6796c4a639111d7b9a787163 https://cran.r-project.org/package=JuniperKernel
38b3070edfc018308a0bc3d9604d19d90c430c7c https://cran.r-project.org/package=BayHap
38e4aacb06cb218ee7c2fa38ddb16611e426d7bb https://cran.r-project.org/package=aws.sns
3a54793134512d29a9ce7c55834ddf1513d7f5ef https://cran.r-project.org/package=bclust
3f3d73c171248c7a618c7fd1e7e00f27aa6dde56 https://cran.r-project.org/package=cec2005benchmark

Related Objects

Mentioned In: P1157 Revisions with non-empty metadata field and missing raw extrinsic metadata
T2548: Restore CRAN visits deleted in january 2020 from backups
Mentioned Here: T2548: Restore CRAN visits deleted in january 2020 from backups
T2549: Restore Mercurial visits deleted in august 2018 from backups
P585 clean up obsolete cran visits and origins
T2029: cran lister: Align lister to output list of tarballs per origin
D2524: cran.lister: Use CRAN's canonical url as origin url
rDLS476177363182: cran.lister: Use cran's canonical url for origin url
T2534: Migrate/restore Kibana's config

Event Timeline

vlorentz triaged this task as High priority.Aug 26 2020, 1:53 PM

vlorentz created this task.

vlorentz updated the task description. (Show Details)

vlorentz renamed this task from Investigate origin https://cran.r-project.org/package=mljar to Investigate CRAN revisions without an origin.Aug 26 2020, 6:01 PM

vlorentz updated the task description. (Show Details)

vlorentz updated the task description. (Show Details)Aug 26 2020, 6:14 PM

vlorentz updated the task description. (Show Details)Aug 26 2020, 9:26 PM

vlorentz updated the task description. (Show Details)Aug 27 2020, 12:08 AM

vlorentz updated the task description. (Show Details)Aug 27 2020, 9:49 AM

vlorentz updated the task description. (Show Details)Aug 27 2020, 10:59 AM

vlorentz updated the task description. (Show Details)Aug 27 2020, 7:46 PM

vlorentz updated the task description. (Show Details)Aug 27 2020, 9:18 PM

vlorentz updated the task description. (Show Details)Aug 27 2020, 11:35 PM

vlorentz updated the task description. (Show Details)Aug 28 2020, 8:12 AM

vlorentz updated the task description. (Show Details)Aug 28 2020, 11:46 AM

So far, i'm confused by this.

I started looking into this but i keep on finding unexpected things.

The origins, as showed do not exist (checked prod and mirror).

elasticsearch

Indeed, the index is opened.
It seems to have data (as it takes space) somehow.

Stats on that index.

{ -
  "_shards": { -
    "total": 4,
    "successful": 4,
    "failed": 0
  },
  "_all": { -
    "primaries": { -
      "docs": { -
        "count": 3975277,
        "deleted": 0
      },
      "store": { -
        "size": "2.7gb",
        "size_in_bytes": 2953638609
      },

Note: using cerebro gui to click my way into elasticsearch information (ikr!).

Although the remaining part of the stats displays only 0 which could explain why the query, as showed by val, returns nothing.
(That's an issue in itself...)

"indexing": { -
        "index_total": 0,
        "index_time": "0s",
        "index_time_in_millis": 0,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time": "0s",
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time": "0s",
        "throttle_time_in_millis": 0
      },
      "get": { -
        "total": 0,
        "getTime": "0s",
        "time_in_millis": 0,
        "exists_total": 0,
        "exists_time": "0s",
        "exists_time_in_millis": 0,
        "missing_total": 0,
        "missing_time": "0s",
        "missing_time_in_millis": 0,
        "current": 0
      },
      "search": { -
        "open_contexts": 0,
        "query_total": 0,
        "query_time": "0s",
        "query_time_in_millis": 0,
        "query_current": 0,
        "fetch_total": 0,
        "fetch_time": "0s",
        "fetch_time_in_millis": 0,
        "fetch_current": 0,
        "scroll_total": 0,
        "scroll_time": "0s",
        "scroll_time_in_millis": 0,
        "scroll_current": 0,
        "suggest_total": 0,
        "suggest_time": "0s",
        "suggest_time_in_millis": 0,
        "suggest_current": 0
      },

systemd logs

I don't know onto which nodes started looking into (don't know what worker did
those ingestion).

But I don't think we keep any of those logs for "so long" (~ 7 months). For
disk reasons, our workers have little disk space (~40Gib).

My understanding is that we push the logs to elasticsearch for that purpose...

scheduler

Since looking logs is a no go, I tried to reproduce the issue locally (docker).
I needed the arguments reference so I check the scheduler but here, again, those references don't show up:

softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=mljar';
 task | type | arguments
------+------+-----------
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=ROI.plugin.ipop';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=PhaseType';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=SchemaOnRead';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=multilevelMatching';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=MetamapsDB';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=spartan';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=ActivityIndex';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=DZEXPM';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=rinat';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=ReliabilityTheory';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=IPToCountry';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=IPtoCountry';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=refGenome';
(0 rows)

Note: the query works :D, see below for an example

softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=wk';
task                                                                                                 $
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
 (337043429,load-cran,"{""args"": [], ""kwargs"": {""url"": ""https://cran.r-project.org/package=wk"", ""artifacts"": [{""url"": ""https://cran.r-project.org/src/contrib/wk_0.3.2.tar.gz"", ""version"": ""0.3.2""}]}}","2020-08-27 15:30:16.$
(1 row)

vlorentz updated the task description. (Show Details)Aug 28 2020, 12:33 PM

I don't think it matters but still. I checked all the packages listed (up to bclust) and they are no longer listed over the html listing page reached from the r-project site [1].

[1] https://cloud.r-project.org/

Hmmmm... so it could mean that earlier versions of the CRAN lister/loader didn't create origins?

I'm not checking all the revisions I see are referenced by a visit, only that their origin exists

Hmmmm... so it could mean that earlier versions of the CRAN lister/loader didn't create origins?

yes, somehow, but that does not make sense in my mind.
Initially the lister and loader created both origins (idempotently).

Now the lister no longer does. Only the loader does.
And that's by what it starts (then origin visit) prior to ingest any other dag related objects...

softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' like '%mljar%';
                                                                                                      task                                                                                                       |   type    |                                                    arguments                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+------------------------------------------------------------------------------------------------------------------
 (251148056,load-cran,"{""args"": [], ""kwargs"": {""url"": ""https://cran.r-project.org/src/contrib/mljar_0.1.1.tar.gz"", ""version"": ""0.1.1""}}","2020-01-09 12:45:08.56447+00","1 day",disabled,oneshot,3,) | load-cran | {"args": [], "kwargs": {"url": "https://cran.r-project.org/src/contrib/mljar_0.1.1.tar.gz", "version": "0.1.1"}}

Nice! Thanks. One confusing point down, cool.

Looks like a previous version of the cran loader which did not yet unify the origins, most likely.

Looks like a previous version of the cran loader which did not yet unify the origins, most likely.

well, almost there, it's about the lister (not loader) [1] but you got the gist ;)

[1] rDLS4761773631826cadf545f34e2142ffcc0bd99d3b

I don't find data to back me up right now but i would think something like that happened:

previous version of the cran lister created origins with the form https://cran.r-project.org/src/contrib/<package>_<version>.tar.gz (<- i don't find origins matching that form but i'm pretty sure it happened, thus the missing data to back me up). [1]
we migrated the lister code and the loader to deal with the new canonical cran urls (you and me had a discussion about it in a diff [2])
most probably, a migration from old format origin urls to the new format happened but some origins got lost in migration

[1]

softwareheritage=> explain select * from origin where url like 'https://cran.r-project.org/src/contrib/%';
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Bitmap Heap Scan on origin  (cost=2165.48..28612.26 rows=13868 width=52)
   Recheck Cond: (url ~~ 'https://cran.r-project.org/src/contrib/%'::text)
   ->  Bitmap Index Scan on origin_url_idx  (cost=0.00..2162.01 rows=13868 width=0)
         Index Cond: (url ~~ 'https://cran.r-project.org/src/contrib/%'::text)
(4 rows)

softwareheritage=> select * from origin where url like 'https://cran.r-project.org/src/contrib/%';
 id | url
----+-----
(0 rows)

softwareheritage=> select * from origin where url like 'https://cran.r-project.org/src/%';
 id | url
----+-----
(0 rows)
softwareheritage=> select * from origin where url like 'http://cran.r-project.org/src/contrib%';
 id | url
----+-----
(0 rows)

[2] https://forge.softwareheritage.org/D2524#60369

I don't find data to back me up right now but i would think something like that happened:
...

This comment seems to back up my hypothesis on what happened [1]

[1] https://forge.softwareheritage.org/T2029#40520

This comment seems to back up my hypothesis on what happened [1]

Follow the white rabbit in P585 from that previous link.

I found those origins back, they were cleaned up:

postgres@belvedere:~$ for package in mljar ROI.plugin.ipop PhaseType SchemaOnRead multilevelMatching MetamapsDB spartan ActivityIndex DZEXPM rinat ReliabilityTheory IPtoCountry refGenome JuniperKernel BayHap aws.sns bclust; do
>   grep "${package}" cran-origins-to-cleanup.txt
> done
91403380        https://cran.r-project.org/src/contrib/mljar_0.1.1.tar.gz
91407154        https://cran.r-project.org/src/contrib/ROI.plugin.ipop_0.2-5.tar.gz
91405069        https://cran.r-project.org/src/contrib/PhaseType_0.1.3.tar.gz
91407767        https://cran.r-project.org/src/contrib/SchemaOnRead_1.0.2.tar.gz
91403774        https://cran.r-project.org/src/contrib/multilevelMatching_1.0.0.tar.gz
91403076        https://cran.r-project.org/src/contrib/MetamapsDB_0.0.2.tar.gz
91408633        https://cran.r-project.org/src/contrib/spartan_3.0.2.tar.gz
91395500        https://cran.r-project.org/src/contrib/ActivityIndex_0.3.6.tar.gz
91398797        https://cran.r-project.org/src/contrib/DZEXPM_2.0.tar.gz
91406844        https://cran.r-project.org/src/contrib/rinat_0.1.5.tar.gz
91406599        https://cran.r-project.org/src/contrib/ReliabilityTheory_0.1.5.tar.gz
91401705        https://cran.r-project.org/src/contrib/IPtoCountry_0.0.1.tar.gz
91406548        https://cran.r-project.org/src/contrib/refGenome_1.7.7.tar.gz
91401939        https://cran.r-project.org/src/contrib/JuniperKernel_1.4.1.0.tar.gz
91396249        https://cran.r-project.org/src/contrib/BayHap_1.0.1.tar.gz
91396040        https://cran.r-project.org/src/contrib/aws.sns_0.1.7.tar.gz
91396292        https://cran.r-project.org/src/contrib/bclust_1.5.tar.gz
91402616        https://cran.r-project.org/src/contrib/lsbclust_1.1.tar.gz
91402889        https://cran.r-project.org/src/contrib/mbclusterwise_1.0.tar.gz

and what happened to the visits?

and what happened to the visits?

They were lost. As the paste [1] suggests.

Most possibly, the thought process was to trigger a run next to the migration...
But the hole in the reasoning was that origins could disappear in between migrating and the new listing. Which is apparently what happened [2].

One of the many reasons of why now, we decided to stop doing those potentially lossy migrations.
And indeed, we no longer do those. 1. because it's time consuming and 2. it can be lossy as demonstrated here.

[1] https://forge.softwareheritage.org/P585$1

[2] https://forge.softwareheritage.org/T2536#47694

vlorentz updated the task description. (Show Details)Aug 28 2020, 3:18 PM

vlorentz mentioned this in T2548: Restore CRAN visits deleted in january 2020 from backups.Aug 28 2020, 4:06 PM

closing in favor of T2548 (and T2549)

vlorentz mentioned this in P1157 Revisions with non-empty metadata field and missing raw extrinsic metadata.Sep 10 2021, 7:30 PM

This task has been migrated to GitLab.

Investigate CRAN revisions without an originClosed, MigratedEdits LockedActions