Page MenuHomeSoftware Heritage

loader.npm: If no upload time provided, use artifact's mtime if provided
ClosedPublic

Authored by ardumont on Jan 29 2020, 11:35 AM.

Details

Summary

Instead of failing the ingestion, fallback to use the artifact's mtime (if
provided).

Note that most origins have their upload 'time' provided. In the error
referenced [1], the origins sampled there show that they have their 'mtime'
referenced instead of the 'time'.

Also, if no upload time is provided at all, skip the artifacts altogether. This
allows ingestion of further artifacts for the same origin instead of failing
completely the origin ingestion.

[1] https://sentry.softwareheritage.org/share/issue/769f9b19f2f54e939cc7f163844bd95c/

Test Plan

tox

Diff Detail

Repository
rDLDBASE Generic VCS/Package Loader
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 10338
Build 15363: tox-on-jenkinsJenkins
Build 15362: arc lint + arc unit

Event Timeline

I don't think it's a good idea to silently skip artifacts like this.

What about updating the data model to allow missing dates?

This revision now requires changes to proceed.Jan 29 2020, 12:24 PM

I don't think it's a good idea to silently skip artifacts like this.

That's what we do for other issues we had so far when a good reason justifies it.
That good reason is that we fail completely the origin ingestion (the other potentially good artifacts for that origin are never seen if that fails early for that package).

Also it's not entirely silent, there is a warning log here (which could be improved upon).

What about updating the data model to allow missing dates?

In revision?
That date is used to create the revision because we have nothing else in that case.

Ideally we would have liked to have that data within the intrinsic metadata but that's not possible.

Improve warning log message in case of no time is provided

What about updating the data model to allow missing dates?

In revision?

Yes

What about updating the data model to allow missing dates?

In revision?

Yes

I'm not sure we want to go over that road yet (impacts in swh-model, then down to the storage...)

Let's see if the issue arise first (monitoring for the log warning to appear).
I'm not sure it happens yet.
I added the conditional because the if elseif without an else feels awkward.

Also, there are actually 2 commits in that diff:

  • one to deal with the actual problem, mtime instead of time
  • another to deal with the case where no time at all is referenced (that i wanted to deal with to be exhaustive).

I can split it if that really annoys you.

Looks good regarding the scope of that diff !

I don't think it's a good idea to silently skip artifacts like this.

What about updating the data model to allow missing dates?

I think in last resort we could extract the missing date from the creation date of the top-level directory
contained in the downloaded tarball. This directory is created when uploading a package to the npm
registry when calling npm publish.

For instance in you test example ([[ https://replicate.npmjs.com/jammit-express | jammit-express ]]), the mtime field for version 0.0.1
is equal to 2010-11-09T23:27:28Z. If we look at the content of the tarball, we get the following:

$ tar -ztvf jammit-express-0.0.1.tgz
drwxr-xr-x isaacs/_lpoperator 0 2010-11-10 00:27 1289345247862-0.009397033136337996/
-rw-r--r-- isaacs/_lpoperator 5 2010-09-25 01:38 1289345247862-0.009397033136337996/.npmignore
-rwxr--r-- isaacs/_lpoperator 184 2010-09-16 10:17 1289345247862-0.009397033136337996/._assets_json.rb
-rwxr--r-- isaacs/_lpoperator 382 2010-09-16 10:17 1289345247862-0.009397033136337996/assets_json.rb
-rw-r--r-- isaacs/_lpoperator 184 2010-09-15 22:08 1289345247862-0.009397033136337996/._init.js
-rw-r--r-- isaacs/_lpoperator 1329 2010-09-15 22:08 1289345247862-0.009397033136337996/init.js
-rw-r--r-- isaacs/_lpoperator  187 2010-09-25 01:36 1289345247862-0.009397033136337996/._package.json
-rw-r--r-- isaacs/_lpoperator  705 2010-09-25 01:36 1289345247862-0.009397033136337996/package.json
-rw-r--r-- isaacs/_lpoperator  186 2010-09-16 10:19 1289345247862-0.009397033136337996/._README.md
-rw-r--r-- isaacs/_lpoperator  362 2010-09-16 10:19 1289345247862-0.009397033136337996/README.md

Apart the timezone difference, dates look the same.

But let's see if the problem appears in production first.

Looks good regarding the scope of that diff !

cool

I don't think it's a good idea to silently skip artifacts like this.

What about updating the data model to allow missing dates?

I think in last resort we could extract the missing date from the creation date of the top-level directory
contained in the downloaded tarball. This directory is created when uploading a package to the npm
registry when calling npm publish.

yes, sounds like a better plan.

Furthermore, that'd me more "intrinsic" somehow.

But let's see if the problem appears in production first.

indeed.

swh/loader/package/npm/loader.py
106

I'll raise a value error with that message here.
So that it stands out in sentry if it even happens (simpler than to parse logs to detect the log pattern now).

I think in last resort we could extract the missing date from the creation date of the top-level directory

contained in the downloaded tarball. This directory is created when uploading a package to the npm
registry when calling npm publish.

I have a local branch ready with that behavior.

btw, I tried to have only that behavior for the mtime date extraction to use in revision (just to try and be totally consistent).
But other tests failed because hash divergences. I did not check further than that though (just a heads up)

Fail ingestion with a proper message if at least 1 artifact has no upload time