Page MenuHomeSoftware Heritage

lister.gnu: Standardize arguments to pass to the loader tar
ClosedPublic

Authored by ardumont on Oct 16 2019, 1:31 PM.

Details

Summary

This:

  • Moves version parsing logic from loader-gnu to lister
  • Parses the timestamps provided by gnu tree to isoformat dates (to provide to loader)
  • (Aligns the timestamps stored in the lister model to be also a date in the right format to clarify reading from it)

This allows to have a generic "tar" loader.
That way, the loader expects some required arguments in the right type already.
No need for it to try and do specific computations depending on the origin of the archives.

Related D2145

Note:
This actually implies a schema migration (timestamp model migration)
As the loader gnu never ran though, we might as well drop the existing data and trigger a new run when time comes.
So it's not a blocker.

Test Plan

tox

Diff Detail

Repository
rDLS Listers
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 8771
Build 12785: tox-on-jenkinsJenkins
Build 12784: arc lint + arc unit

Event Timeline

Fix using set instead of list

  • gnu.tree: Move helper utilities function to the bottom of the file
ardumont edited the summary of this revision. (Show Details)

Rebase to latest master

ardumont retitled this revision from lister.gnu: Move version parsing logic from loader-gnu to lister to lister.gnu: Standardize arguments to pass to the loader.Oct 18 2019, 3:23 AM
ardumont edited the summary of this revision. (Show Details)
ardumont retitled this revision from lister.gnu: Standardize arguments to pass to the loader to lister.gnu: Standardize arguments to pass to the loader tar.
  • gnu.lister: Format timestamp to isoformat string for the tar loader
  • gnu.lister: Unify timestamp formats to isoformat date in model
  • gnu.tree: Drop debug statement
  • gnu.lister: Update docstring samples

Build has FAILED

Yes, i did not parse the timestamp into an isoformat datestring correctly.
I'm fixing that.

Build has FAILED

Yes, it now depends on D2158

vlorentz added a subscriber: vlorentz.

I don't think this regexp is the right solution, it's quite fragile in case a package name uses one of the keywords. A possible solution is for get_version to only accept URLs, and split the path to find the package name.

swh/lister/gnu/tests/test_tree.py
151

I don't understand this sentence

swh/lister/gnu/tree.py
187–195
233

shouldn't the final + be non-greedy as well? (Don't forget to add a test for this, if relevant)

I'm guessing you'll need it for .7z.

241–244
.format(
    extensions=...
    vkeywords=...
This revision now requires changes to proceed.Oct 18 2019, 2:25 PM
swh/lister/gnu/tests/test_tree.py
151
"""Parsing version from url should yield some form of "sensible" version

Given the dataset, it's not a simple task to extract correctly version.

"""

is it better?

swh/lister/gnu/tests/test_tree.py
151

yes

I don't think this regexp is the right solution, it's quite fragile in case a package name uses one of the keywords.

Yes, it's fragile and we don't see any better solution right now.
That's how we did it initially, we improved the current version to be a tad better though.
And also it's tested even if it's limited.

A possible solution is for get_version to only accept URLs, and split the path to find the package name.

I'm not getting the difference with the current implem.
filename is extracted from the uri in any case...

Do you have an example in mind, that'd help, i think.

TIA

swh/lister/gnu/tree.py
187–195

thx!

swh/lister/gnu/tree.py
233

I added a test with a .7z and in current form, that's enough.

I don't think this regexp is the right solution, it's quite fragile in case a package name uses one of the keywords.

Yes, it's fragile and we don't see any better solution right now.
That's how we did it initially, we improved the current version to be a tad better though.
And also it's tested even if it's limited.

A possible solution is for get_version to only accept URLs, and split the path to find the package name.

I'm not getting the difference with the current implem.
filename is extracted from the uri in any case...

I meant splitting on / so you get the package name in the folder name. But I just realized it's not a reliable method, so nvm

swh/lister/gnu/tree.py
241–244

Well, i tried and it does not work.
That fails the tests somehow...

swh/lister/gnu/tree.py
241–244

pattern1 is the current implem' (i renamed the global variable to use upper case in their name):

collecting ... ('pattern1: \n'
 '^\n'
 '(?:\n'
 '    # We have a software name and a release number, separated with a\n'
 '    # -, _ or dot.\n'
 '    (?P<software_name1>.+?[-_.])\n'
 '    '
 '(?P<release_number>(cygwin_me[-]?|w32[-]?|win32[-]?|nt[-]?|cygwin[-]?|mingw[-]?|latest[-]?|alpha[-]?|beta[-]?|release[-]?|stable[-]?|hppa[-]?|solaris[-]?|sunos[-]?|sun4u[-]?|sparc[-]?|sun[-]?|aix[-]?|ibm[-]?|rs6000[-]?|i386[-]?|i686[-]?|linux[-]?|redhat[-]?|linuxlibc[-]?|mips[-]?|powerpc[-]?|macos[-]?|apple[-]?|darwin[-]?|macosx[-]?|powermacintosh[-]?|unknown[-]?|netbsd[-]?|freebsd[-]?|sgi[-]?|irix[-]?|[0-9][0-9a-zA-Z_.+:~-]*?)+)\n'
 '|\n'
 "    # We couldn't match a release number, put everything in the\n"
 '    # software name.\n'
 '    (?P<software_name2>.+?)\n'
 ')\n'

pattern2 is the format call:

PATTERN2 = r'''
^
(?:
    # We have a software name and a release number, separated with a
    # -, _ or dot.
    (?P<software_name1>.+?[-_.])
    (?P<release_number>(%(vkeywords)s|[0-9][0-9a-zA-Z_.+:~-]*?)+)
|
    # We couldn't match a release number, put everything in the
    # software name.
    (?P<software_name2>.+?)
)
(?P<extension>(?:\.(?:%(extensions)s))+)
$
'''.format(
    extensions='|'.join(EXTENSIONS),
    vkeywords='|'.join('%s[-]?' % k for k in VERSION_KEYWORDS)
)

pprint(PATTERN2)

gives:

 '(?P<extension>(?:\\.(?:zip|tar|gz|tgz|bz2|bzip2|lzma|lz|xz|Z|7z))+)\n'
 '$\n')
('pattern2: \n'
 '^\n'
 '(?:\n'
 '    # We have a software name and a release number, separated with a\n'
 '    # -, _ or dot.\n'
 '    (?P<software_name1>.+?[-_.])\n'
 '    (?P<release_number>(%(vkeywords)s|[0-9][0-9a-zA-Z_.+:~-]*?)+)\n'
 '|\n'
 "    # We couldn't match a release number, put everything in the\n"
 '    # software name.\n'
 '    (?P<software_name2>.+?)\n'
 ')\n'
 '(?P<extension>(?:\\.(?:%(extensions)s))+)\n'
 '$\n')

it seems it does not work as expected...

Adapt according to what's possible from the review

  • gnu.tree: Add missing 7z extension
  • gnu.test: Improve test docstring sentence
  • gnu.tree: Use upper case for naming global variable
swh/lister/gnu/tree.py
241–244

because you need to use "{extensions}" in the string to be formatted

swh/lister/gnu/tree.py
241–244

right, thx!

  • gnu.tree: Use .format build
This revision is now accepted and ready to land.Oct 20 2019, 8:56 PM

Rebase to latest master

(which should fix the ci)

Rebase and plug to latest master

Fix type annotations and squash some commits