Page MenuHomeSoftware Heritage

Investigate CRAN revisions without an origin
Closed, ResolvedPublic

Description

19:58 <vlorentz> we have a revision 00a867beb2ad8e203f242e9843d2e88de0856cda
19:58 <vlorentz> which was obviously created by the CRAN loader, from its content
19:59 <vlorentz> and the origin should be https://cran.r-project.org/package=mljar
19:59 <vlorentz> but that origin doesn't exist
19:59 <vlorentz> additionally, the revision is referenced by a snapshot (f7ce27eb87dc82e02b6178db3c9d9dbc9fa055d6) which isn't reference by any visit status
19:59 <vlorentz> I could understand an orphan snapshot, but not the missing origin
20:01 <vlorentz> any idea what happened?
20:01 <vlorentz> the revision was created on 2020-01-09T13:14:43.438615+00:00, btw
20:02 <+olasd> one could fish out the cran_loader worker logs for that date, see what happened
20:03 <+olasd> (using kibana)
20:03 <+olasd> but for such old logs you'd need to reopen the elasticsearch indexes first

but I can't access the logs.

I opened the index, but Kibana doesn't see it. (and it's half unusable anyway, T2534), and ES doesn't return any document when searching in the index:

$ curl "esnode1.internal.softwareheritage.org:9200/swh_workers-2020.01.09/_search" -X GET -H "Content-Type: application/json" -d '{"query": {"match_all": {}}}'
{"took":0,"timed_out":false,"_shards":{"total":0,"successful":0,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":0.0,"hits":[]}}

List of "grand-orphan" revisions I found so far:

Event Timeline

vlorentz triaged this task as High priority.Aug 26 2020, 1:53 PM
vlorentz created this task.
vlorentz updated the task description. (Show Details)
vlorentz renamed this task from Investigate origin https://cran.r-project.org/package=mljar to Investigate CRAN revisions without an origin.Aug 26 2020, 6:01 PM
vlorentz updated the task description. (Show Details)
vlorentz updated the task description. (Show Details)
vlorentz updated the task description. (Show Details)Aug 26 2020, 6:14 PM
vlorentz updated the task description. (Show Details)Aug 26 2020, 9:26 PM
vlorentz updated the task description. (Show Details)Aug 27 2020, 12:08 AM
vlorentz updated the task description. (Show Details)Aug 27 2020, 9:49 AM
vlorentz updated the task description. (Show Details)Aug 27 2020, 10:59 AM
vlorentz updated the task description. (Show Details)Aug 27 2020, 7:46 PM
vlorentz updated the task description. (Show Details)Aug 27 2020, 9:18 PM
vlorentz updated the task description. (Show Details)Aug 27 2020, 11:35 PM
vlorentz updated the task description. (Show Details)Aug 28 2020, 8:12 AM
vlorentz updated the task description. (Show Details)Aug 28 2020, 11:46 AM

So far, i'm confused by this.

I started looking into this but i keep on finding unexpected things.

The origins, as showed do not exist (checked prod and mirror).

elasticsearch

Indeed, the index is opened.
It seems to have data (as it takes space) somehow.

Stats on that index.

{ -
  "_shards": { -
    "total": 4,
    "successful": 4,
    "failed": 0
  },
  "_all": { -
    "primaries": { -
      "docs": { -
        "count": 3975277,
        "deleted": 0
      },
      "store": { -
        "size": "2.7gb",
        "size_in_bytes": 2953638609
      },

Note: using cerebro gui to click my way into elasticsearch information (ikr!).

Although the remaining part of the stats displays only 0 which could explain why the query, as showed by val, returns nothing.
(That's an issue in itself...)

"indexing": { -
        "index_total": 0,
        "index_time": "0s",
        "index_time_in_millis": 0,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time": "0s",
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time": "0s",
        "throttle_time_in_millis": 0
      },
      "get": { -
        "total": 0,
        "getTime": "0s",
        "time_in_millis": 0,
        "exists_total": 0,
        "exists_time": "0s",
        "exists_time_in_millis": 0,
        "missing_total": 0,
        "missing_time": "0s",
        "missing_time_in_millis": 0,
        "current": 0
      },
      "search": { -
        "open_contexts": 0,
        "query_total": 0,
        "query_time": "0s",
        "query_time_in_millis": 0,
        "query_current": 0,
        "fetch_total": 0,
        "fetch_time": "0s",
        "fetch_time_in_millis": 0,
        "fetch_current": 0,
        "scroll_total": 0,
        "scroll_time": "0s",
        "scroll_time_in_millis": 0,
        "scroll_current": 0,
        "suggest_total": 0,
        "suggest_time": "0s",
        "suggest_time_in_millis": 0,
        "suggest_current": 0
      },

systemd logs

I don't know onto which nodes started looking into (don't know what worker did
those ingestion).

But I don't think we keep any of those logs for "so long" (~ 7 months). For
disk reasons, our workers have little disk space (~40Gib).

My understanding is that we push the logs to elasticsearch for that purpose...

scheduler

Since looking logs is a no go, I tried to reproduce the issue locally (docker).
I needed the arguments reference so I check the scheduler but here, again, those references don't show up:

softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=mljar';
 task | type | arguments
------+------+-----------
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=ROI.plugin.ipop';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=PhaseType';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=SchemaOnRead';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=multilevelMatching';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=MetamapsDB';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=spartan';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=ActivityIndex';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=DZEXPM';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=rinat';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=ReliabilityTheory';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=IPToCountry';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=IPtoCountry';
(0 rows)
softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=refGenome';
(0 rows)

Note: the query works :D, see below for an example

softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' = 'https://cran.r-project.org/package=wk';
task                                                                                                 $
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
 (337043429,load-cran,"{""args"": [], ""kwargs"": {""url"": ""https://cran.r-project.org/package=wk"", ""artifacts"": [{""url"": ""https://cran.r-project.org/src/contrib/wk_0.3.2.tar.gz"", ""version"": ""0.3.2""}]}}","2020-08-27 15:30:16.$
(1 row)
vlorentz updated the task description. (Show Details)Aug 28 2020, 12:33 PM

I don't think it matters but still. I checked all the packages listed (up to bclust) and they are no longer listed over the html listing page reached from the r-project site [1].

[1] https://cloud.r-project.org/

Hmmmm... so it could mean that earlier versions of the CRAN lister/loader didn't create origins?

I'm not checking all the revisions I see are referenced by a visit, only that their origin exists

ardumont added a comment.EditedAug 28 2020, 12:50 PM

Hmmmm... so it could mean that earlier versions of the CRAN lister/loader didn't create origins?

yes, somehow, but that does not make sense in my mind.
Initially the lister and loader created both origins (idempotently).

Now the lister no longer does. Only the loader does.
And that's by what it starts (then origin visit) prior to ingest any other dag related objects...

softwareheritage-scheduler=> select task, type, arguments from task where type='load-cran' and arguments->'kwargs'->>'url' like '%mljar%';
                                                                                                      task                                                                                                       |   type    |                                                    arguments                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+------------------------------------------------------------------------------------------------------------------
 (251148056,load-cran,"{""args"": [], ""kwargs"": {""url"": ""https://cran.r-project.org/src/contrib/mljar_0.1.1.tar.gz"", ""version"": ""0.1.1""}}","2020-01-09 12:45:08.56447+00","1 day",disabled,oneshot,3,) | load-cran | {"args": [], "kwargs": {"url": "https://cran.r-project.org/src/contrib/mljar_0.1.1.tar.gz", "version": "0.1.1"}}

Nice! Thanks. One confusing point down, cool.

Looks like a previous version of the cran loader which did not yet unify the origins, most likely.

Looks like a previous version of the cran loader which did not yet unify the origins, most likely.

well, almost there, it's about the lister (not loader) [1] but you got the gist ;)

[1] rDLS4761773631826cadf545f34e2142ffcc0bd99d3b

ardumont added a comment.EditedAug 28 2020, 2:32 PM

I don't find data to back me up right now but i would think something like that happened:

  • previous version of the cran lister created origins with the form https://cran.r-project.org/src/contrib/<package>_<version>.tar.gz (<- i don't find origins matching that form but i'm pretty sure it happened, thus the missing data to back me up). [1]
  • we migrated the lister code and the loader to deal with the new canonical cran urls (you and me had a discussion about it in a diff [2])
  • most probably, a migration from old format origin urls to the new format happened but some origins got lost in migration

[1]

softwareheritage=> explain select * from origin where url like 'https://cran.r-project.org/src/contrib/%';
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Bitmap Heap Scan on origin  (cost=2165.48..28612.26 rows=13868 width=52)
   Recheck Cond: (url ~~ 'https://cran.r-project.org/src/contrib/%'::text)
   ->  Bitmap Index Scan on origin_url_idx  (cost=0.00..2162.01 rows=13868 width=0)
         Index Cond: (url ~~ 'https://cran.r-project.org/src/contrib/%'::text)
(4 rows)

softwareheritage=> select * from origin where url like 'https://cran.r-project.org/src/contrib/%';
 id | url
----+-----
(0 rows)

softwareheritage=> select * from origin where url like 'https://cran.r-project.org/src/%';
 id | url
----+-----
(0 rows)
softwareheritage=> select * from origin where url like 'http://cran.r-project.org/src/contrib%';
 id | url
----+-----
(0 rows)

[2] https://forge.softwareheritage.org/D2524#60369

ardumont added a comment.EditedAug 28 2020, 2:40 PM

I don't find data to back me up right now but i would think something like that happened:
...

This comment seems to back up my hypothesis on what happened [1]

[1] https://forge.softwareheritage.org/T2029#40520

ardumont added a comment.EditedAug 28 2020, 2:46 PM

This comment seems to back up my hypothesis on what happened [1]

Follow the white rabbit in P585 from that previous link.

I found those origins back, they were cleaned up:

postgres@belvedere:~$ for package in mljar ROI.plugin.ipop PhaseType SchemaOnRead multilevelMatching MetamapsDB spartan ActivityIndex DZEXPM rinat ReliabilityTheory IPtoCountry refGenome JuniperKernel BayHap aws.sns bclust; do
>   grep "${package}" cran-origins-to-cleanup.txt
> done
91403380        https://cran.r-project.org/src/contrib/mljar_0.1.1.tar.gz
91407154        https://cran.r-project.org/src/contrib/ROI.plugin.ipop_0.2-5.tar.gz
91405069        https://cran.r-project.org/src/contrib/PhaseType_0.1.3.tar.gz
91407767        https://cran.r-project.org/src/contrib/SchemaOnRead_1.0.2.tar.gz
91403774        https://cran.r-project.org/src/contrib/multilevelMatching_1.0.0.tar.gz
91403076        https://cran.r-project.org/src/contrib/MetamapsDB_0.0.2.tar.gz
91408633        https://cran.r-project.org/src/contrib/spartan_3.0.2.tar.gz
91395500        https://cran.r-project.org/src/contrib/ActivityIndex_0.3.6.tar.gz
91398797        https://cran.r-project.org/src/contrib/DZEXPM_2.0.tar.gz
91406844        https://cran.r-project.org/src/contrib/rinat_0.1.5.tar.gz
91406599        https://cran.r-project.org/src/contrib/ReliabilityTheory_0.1.5.tar.gz
91401705        https://cran.r-project.org/src/contrib/IPtoCountry_0.0.1.tar.gz
91406548        https://cran.r-project.org/src/contrib/refGenome_1.7.7.tar.gz
91401939        https://cran.r-project.org/src/contrib/JuniperKernel_1.4.1.0.tar.gz
91396249        https://cran.r-project.org/src/contrib/BayHap_1.0.1.tar.gz
91396040        https://cran.r-project.org/src/contrib/aws.sns_0.1.7.tar.gz
91396292        https://cran.r-project.org/src/contrib/bclust_1.5.tar.gz
91402616        https://cran.r-project.org/src/contrib/lsbclust_1.1.tar.gz
91402889        https://cran.r-project.org/src/contrib/mbclusterwise_1.0.tar.gz

and what happened to the visits?

ardumont added a comment.EditedAug 28 2020, 2:53 PM

and what happened to the visits?

They were lost. As the paste [1] suggests.

Most possibly, the thought process was to trigger a run next to the migration...
But the hole in the reasoning was that origins could disappear in between migrating and the new listing. Which is apparently what happened [2].

One of the many reasons of why now, we decided to stop doing those potentially lossy migrations.
And indeed, we no longer do those. 1. because it's time consuming and 2. it can be lossy as demonstrated here.

[1] https://forge.softwareheritage.org/P585$1

[2] https://forge.softwareheritage.org/T2536#47694

vlorentz updated the task description. (Show Details)Aug 28 2020, 3:18 PM
vlorentz closed this task as Resolved.Aug 31 2020, 4:37 PM
vlorentz claimed this task.

closing in favor of T2548 (and T2549)