Page MenuHomeSoftware Heritage

Add a lister of all available packages in the npm registry
ClosedPublic

Authored by anlambert on Nov 23 2018, 4:46 PM.

Details

Summary

This diff adds a new lister in order to discover all packages available
in the npm registry (https://replicate.npmjs.com/) and create loading tasks
to ingest them into the archive.

At the time of writing, 839234 packages are registered in it.
Next step, write a npm loader.

Related T1378
Closes T1380

Diff Detail

Repository
rDLS Listers
Branch
npm-lister
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2596
Build 3225: tox-on-jenkinsJenkins
Build 3224: arc lint + arc unit

Event Timeline

swh/lister/npm/lister.py
63

As per latest discussion on other reviews and irc, it's better to keep only 1 return statement per method.
Passing along information ;)

Proposal:

return repos[-1]['id'] if len(repos) == self.per_page else None
96

Is that why the class does not declare a API_URL_INDEX_RE?

Sounds like one more full lister to me \m/.

swh/lister/npm/lister.py
23

why the +1?

This revision is now accepted and ready to land.Nov 23 2018, 5:17 PM
anlambert added inline comments.
swh/lister/npm/lister.py
63

ack

96

API_URL_INDEX_RE is only declared and used in the Github lister.

For npm, there is no need to parse an url to get the next index as
it corresponds to the last entry in the current package list.
I used the recommended pagination method explained here [1].

Nevertheless, I do not like this but I could not find an other solution in order
for the tests to pass. I will check again if I can find something better
in order to remove this override.

[1] http://docs.couchdb.org/en/stable/ddocs/views/pagination.html#paging-alternate-method

anlambert added inline comments.
anlambert added inline comments.
swh/lister/npm/lister.py
96

For the record, this is what I got when running the tests without overriding that method:

self = <test_npm_lister.NpmListerTester testMethod=test_fetch_one_nodb>, http_mocker = <requests_mock.mocker.Mocker object at 0x7f73e5f84780>

    def test_fetch_one_nodb(self, http_mocker):
        http_mocker.get(self.test_re, text=self.mock_response)
        fl = self.get_fl()
    
        self.disable_storage_and_scheduler(fl)
        self.disable_db(fl)
    
>       fl.run(min_bound=self.first_index, max_bound=self.first_index)

swh/lister/core/tests/test_lister.py:191: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
swh/lister/core/indexing_lister.py:176: in run
    response, injected_repos = self.ingest_data(index)
swh/lister/core/lister_base.py:511: in ingest_data
    models_list = self.filter_before_inject(models_list)
swh/lister/core/indexing_lister.py:68: in filter_before_inject
    m for m in models_list
swh/lister/core/indexing_lister.py:69: in <listcomp>
    if self.is_within_bounds(m['indexable'], None, self.max_index)
swh/lister/core/lister_base.py:198: in is_within_bounds
    self.string_pattern_check(inner, lower, upper)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <swh.lister.npm.lister.NpmLister object at 0x7f73deeade48>, a = 'jquery-1.8', b = None, c = 'jquery'

    def string_pattern_check(self, a, b, c=None):
        """When comparing indexable types in is_within_bounds, complex strings
                may not be allowed to differ in basic structure. If they do, it
                could be a sign of not understanding the data well. For instance,
                an ISO 8601 time string cannot be compared against its urlencoded
                equivalent, but this is an easy mistake to accidentally make. This
                method acts as a friendly sanity check.
    
            Args:
                a (string): inner component of the is_within_bounds method
                b (string): lower component of the is_within_bounds method
                c (string): upper component of the is_within_bounds method
            Returns:
                nothing
            Raises:
                TypeError if strings a, b, and c don't conform to the same basic
                pattern.
            """
        if isinstance(a, str):
            a_pattern = re.sub('[a-zA-Z0-9]',
                               '[a-zA-Z0-9]',
                               re.escape(a))
            if (isinstance(b, str) and (re.match(a_pattern, b) is None)
               or isinstance(c, str) and (re.match(a_pattern, c) is None)):
                logging.debug(a_pattern)
>               raise TypeError('incomparable string patterns detected')
E               TypeError: incomparable string patterns detected

swh/lister/core/lister_base.py:437: TypeError
------------------------------------------------------------------------------------------------------------------------------ Captured log call ------------------------------------------------------------------------------------------------------------------------------
lister_base.py             203 ERROR    incomparable string patterns detected: inner=<class 'str'>jquery-1.8, lower=<class 'NoneType'>None, upper=<class 'str'>jquery

Update : keep only one return statement in get_next_target_from_response

swh/lister/npm/lister.py
23

lol, ok that's what i kinda did for the content_get_range endpoint [1]

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/storage.py$278

96

API_URL_INDEX_RE is only declared and used in the Github lister.

ok, thanks. No biggie, just wondering ;)

declared and used in Github lister.

Huh, right.
I should have checked. I thought it was used more than once!

This revision was automatically updated to reflect the committed changes.