Page MenuHomeSoftware Heritage

lister.gnu: Standardize arguments to pass to the loader tar
ClosedPublic

Authored by ardumont on Oct 16 2019, 1:31 PM.

Details

Summary

This:

  • Moves version parsing logic from loader-gnu to lister
  • Parses the timestamps provided by gnu tree to isoformat dates (to provide to loader)
  • (Aligns the timestamps stored in the lister model to be also a date in the right format to clarify reading from it)

This allows to have a generic "tar" loader.
That way, the loader expects some required arguments in the right type already.
No need for it to try and do specific computations depending on the origin of the archives.

Related D2145

Note:
This actually implies a schema migration (timestamp model migration)
As the loader gnu never ran though, we might as well drop the existing data and trigger a new run when time comes.
So it's not a blocker.

Test Plan

tox

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ardumont created this revision.Oct 16 2019, 1:31 PM
ardumont updated this revision to Diff 7265.Oct 16 2019, 2:34 PM

Fix using set instead of list

ardumont updated this revision to Diff 7266.Oct 16 2019, 2:45 PM
  • gnu.tree: Move helper utilities function to the bottom of the file
ardumont edited the summary of this revision. (Show Details)Oct 17 2019, 9:41 AM
ardumont added projects: Lister, Origin-GNU.
ardumont updated this revision to Diff 7307.Oct 17 2019, 2:17 PM
ardumont edited the summary of this revision. (Show Details)

Rebase to latest master

ardumont edited the summary of this revision. (Show Details)Oct 17 2019, 6:14 PM
ardumont retitled this revision from lister.gnu: Move version parsing logic from loader-gnu to lister to lister.gnu: Standardize arguments to pass to the loader.Oct 18 2019, 3:23 AM
ardumont edited the summary of this revision. (Show Details)
ardumont retitled this revision from lister.gnu: Standardize arguments to pass to the loader to lister.gnu: Standardize arguments to pass to the loader tar.
ardumont edited the summary of this revision. (Show Details)Oct 18 2019, 3:29 AM
ardumont updated this revision to Diff 7332.Oct 18 2019, 3:31 AM
  • gnu.lister: Format timestamp to isoformat string for the tar loader
  • gnu.lister: Unify timestamp formats to isoformat date in model
  • gnu.tree: Drop debug statement
  • gnu.lister: Update docstring samples
ardumont edited the summary of this revision. (Show Details)Oct 18 2019, 3:32 AM

Build has FAILED

Yes, i did not parse the timestamp into an isoformat datestring correctly.
I'm fixing that.

ardumont updated this revision to Diff 7342.Oct 18 2019, 1:06 PM

Fix parsing timestamps

Build has FAILED

Yes, it now depends on D2158

vlorentz requested changes to this revision.Oct 18 2019, 2:25 PM
vlorentz added a subscriber: vlorentz.

I don't think this regexp is the right solution, it's quite fragile in case a package name uses one of the keywords. A possible solution is for get_version to only accept URLs, and split the path to find the package name.

swh/lister/gnu/tests/test_tree.py
152

I don't understand this sentence

swh/lister/gnu/tree.py
304–312
350

shouldn't the final + be non-greedy as well? (Don't forget to add a test for this, if relevant)

I'm guessing you'll need it for .7z.

358–361
.format(
    extensions=...
    vkeywords=...
This revision now requires changes to proceed.Oct 18 2019, 2:25 PM
ardumont added inline comments.Oct 18 2019, 2:44 PM
swh/lister/gnu/tests/test_tree.py
152
"""Parsing version from url should yield some form of "sensible" version

Given the dataset, it's not a simple task to extract correctly version.

"""

is it better?

vlorentz added inline comments.Oct 18 2019, 2:45 PM
swh/lister/gnu/tests/test_tree.py
152

yes

I don't think this regexp is the right solution, it's quite fragile in case a package name uses one of the keywords.

Yes, it's fragile and we don't see any better solution right now.
That's how we did it initially, we improved the current version to be a tad better though.
And also it's tested even if it's limited.

A possible solution is for get_version to only accept URLs, and split the path to find the package name.

I'm not getting the difference with the current implem.
filename is extracted from the uri in any case...

Do you have an example in mind, that'd help, i think.

TIA

swh/lister/gnu/tree.py
304–312

thx!

ardumont added inline comments.Oct 18 2019, 3:00 PM
swh/lister/gnu/tree.py
350

I added a test with a .7z and in current form, that's enough.

I don't think this regexp is the right solution, it's quite fragile in case a package name uses one of the keywords.

Yes, it's fragile and we don't see any better solution right now.
That's how we did it initially, we improved the current version to be a tad better though.
And also it's tested even if it's limited.

A possible solution is for get_version to only accept URLs, and split the path to find the package name.

I'm not getting the difference with the current implem.
filename is extracted from the uri in any case...

I meant splitting on / so you get the package name in the folder name. But I just realized it's not a reliable method, so nvm

ardumont added inline comments.Oct 18 2019, 3:03 PM
swh/lister/gnu/tree.py
358–361

Well, i tried and it does not work.
That fails the tests somehow...

ardumont added inline comments.Oct 18 2019, 3:06 PM
swh/lister/gnu/tree.py
358–361

pattern1 is the current implem' (i renamed the global variable to use upper case in their name):

collecting ... ('pattern1: \n'
 '^\n'
 '(?:\n'
 '    # We have a software name and a release number, separated with a\n'
 '    # -, _ or dot.\n'
 '    (?P<software_name1>.+?[-_.])\n'
 '    '
 '(?P<release_number>(cygwin_me[-]?|w32[-]?|win32[-]?|nt[-]?|cygwin[-]?|mingw[-]?|latest[-]?|alpha[-]?|beta[-]?|release[-]?|stable[-]?|hppa[-]?|solaris[-]?|sunos[-]?|sun4u[-]?|sparc[-]?|sun[-]?|aix[-]?|ibm[-]?|rs6000[-]?|i386[-]?|i686[-]?|linux[-]?|redhat[-]?|linuxlibc[-]?|mips[-]?|powerpc[-]?|macos[-]?|apple[-]?|darwin[-]?|macosx[-]?|powermacintosh[-]?|unknown[-]?|netbsd[-]?|freebsd[-]?|sgi[-]?|irix[-]?|[0-9][0-9a-zA-Z_.+:~-]*?)+)\n'
 '|\n'
 "    # We couldn't match a release number, put everything in the\n"
 '    # software name.\n'
 '    (?P<software_name2>.+?)\n'
 ')\n'

pattern2 is the format call:

PATTERN2 = r'''
^
(?:
    # We have a software name and a release number, separated with a
    # -, _ or dot.
    (?P<software_name1>.+?[-_.])
    (?P<release_number>(%(vkeywords)s|[0-9][0-9a-zA-Z_.+:~-]*?)+)
|
    # We couldn't match a release number, put everything in the
    # software name.
    (?P<software_name2>.+?)
)
(?P<extension>(?:\.(?:%(extensions)s))+)
$
'''.format(
    extensions='|'.join(EXTENSIONS),
    vkeywords='|'.join('%s[-]?' % k for k in VERSION_KEYWORDS)
)

pprint(PATTERN2)

gives:

 '(?P<extension>(?:\\.(?:zip|tar|gz|tgz|bz2|bzip2|lzma|lz|xz|Z|7z))+)\n'
 '$\n')
('pattern2: \n'
 '^\n'
 '(?:\n'
 '    # We have a software name and a release number, separated with a\n'
 '    # -, _ or dot.\n'
 '    (?P<software_name1>.+?[-_.])\n'
 '    (?P<release_number>(%(vkeywords)s|[0-9][0-9a-zA-Z_.+:~-]*?)+)\n'
 '|\n'
 "    # We couldn't match a release number, put everything in the\n"
 '    # software name.\n'
 '    (?P<software_name2>.+?)\n'
 ')\n'
 '(?P<extension>(?:\\.(?:%(extensions)s))+)\n'
 '$\n')

it seems it does not work as expected...

ardumont updated this revision to Diff 7347.Oct 18 2019, 3:07 PM

Adapt according to what's possible from the review

  • gnu.tree: Add missing 7z extension
  • gnu.test: Improve test docstring sentence
  • gnu.tree: Use upper case for naming global variable
douardda added inline comments.Oct 18 2019, 4:56 PM
swh/lister/gnu/tree.py
358–361

because you need to use "{extensions}" in the string to be formatted

douardda accepted this revision.Oct 18 2019, 4:58 PM
ardumont added inline comments.Oct 18 2019, 6:45 PM
swh/lister/gnu/tree.py
358–361

right, thx!

ardumont updated this revision to Diff 7357.Oct 18 2019, 7:04 PM
  • gnu.tree: Use .format build
vlorentz accepted this revision.Oct 20 2019, 8:56 PM
This revision is now accepted and ready to land.Oct 20 2019, 8:56 PM
ardumont updated this revision to Diff 7406.Oct 23 2019, 1:06 PM

Rebase to latest master

(which should fix the ci)

ardumont updated this revision to Diff 7607.Nov 4 2019, 10:14 AM

Rebase and plug to latest master

ardumont updated this revision to Diff 7609.Nov 4 2019, 11:07 AM

Fix type annotations and squash some commits