Page MenuHomeSoftware Heritage

Add a lister of all available packages in the npm registry
ClosedPublic

Authored by anlambert on Fri, Nov 23, 4:46 PM.

Details

Summary

This diff adds a new lister in order to discover all packages available
in the npm registry (https://replicate.npmjs.com/) and create loading tasks
to ingest them into the archive.

At the time of writing, 839234 packages are registered in it.
Next step, write a npm loader.

Related T1378
Closes T1380

Diff Detail

Repository
rDLS Listers
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

anlambert created this revision.Fri, Nov 23, 4:46 PM
anlambert added a project: Origin-npm.
ardumont added inline comments.Fri, Nov 23, 5:02 PM
swh/lister/npm/lister.py
64

As per latest discussion on other reviews and irc, it's better to keep only 1 return statement per method.
Passing along information ;)

Proposal:

return repos[-1]['id'] if len(repos) == self.per_page else None
97

Is that why the class does not declare a API_URL_INDEX_RE?

ardumont accepted this revision as: ardumont.Fri, Nov 23, 5:17 PM

Sounds like one more full lister to me \m/.

swh/lister/npm/lister.py
24

why the +1?

This revision is now accepted and ready to land.Fri, Nov 23, 5:17 PM
anlambert marked 2 inline comments as done.Fri, Nov 23, 5:20 PM
anlambert added inline comments.
swh/lister/npm/lister.py
64

ack

97

API_URL_INDEX_RE is only declared and used in the Github lister.

For npm, there is no need to parse an url to get the next index as
it corresponds to the last entry in the current package list.
I used the recommended pagination method explained here [1].

Nevertheless, I do not like this but I could not find an other solution in order
for the tests to pass. I will check again if I can find something better
in order to remove this override.

[1] http://docs.couchdb.org/en/stable/ddocs/views/pagination.html#paging-alternate-method

anlambert marked an inline comment as done.Fri, Nov 23, 5:22 PM
anlambert added inline comments.
anlambert marked an inline comment as done.Fri, Nov 23, 5:27 PM
anlambert added inline comments.
swh/lister/npm/lister.py
97

For the record, this is what I got when running the tests without overriding that method:

self = <test_npm_lister.NpmListerTester testMethod=test_fetch_one_nodb>, http_mocker = <requests_mock.mocker.Mocker object at 0x7f73e5f84780>

    def test_fetch_one_nodb(self, http_mocker):
        http_mocker.get(self.test_re, text=self.mock_response)
        fl = self.get_fl()
    
        self.disable_storage_and_scheduler(fl)
        self.disable_db(fl)
    
>       fl.run(min_bound=self.first_index, max_bound=self.first_index)

swh/lister/core/tests/test_lister.py:191: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
swh/lister/core/indexing_lister.py:176: in run
    response, injected_repos = self.ingest_data(index)
swh/lister/core/lister_base.py:511: in ingest_data
    models_list = self.filter_before_inject(models_list)
swh/lister/core/indexing_lister.py:68: in filter_before_inject
    m for m in models_list
swh/lister/core/indexing_lister.py:69: in <listcomp>
    if self.is_within_bounds(m['indexable'], None, self.max_index)
swh/lister/core/lister_base.py:198: in is_within_bounds
    self.string_pattern_check(inner, lower, upper)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <swh.lister.npm.lister.NpmLister object at 0x7f73deeade48>, a = 'jquery-1.8', b = None, c = 'jquery'

    def string_pattern_check(self, a, b, c=None):
        """When comparing indexable types in is_within_bounds, complex strings
                may not be allowed to differ in basic structure. If they do, it
                could be a sign of not understanding the data well. For instance,
                an ISO 8601 time string cannot be compared against its urlencoded
                equivalent, but this is an easy mistake to accidentally make. This
                method acts as a friendly sanity check.
    
            Args:
                a (string): inner component of the is_within_bounds method
                b (string): lower component of the is_within_bounds method
                c (string): upper component of the is_within_bounds method
            Returns:
                nothing
            Raises:
                TypeError if strings a, b, and c don't conform to the same basic
                pattern.
            """
        if isinstance(a, str):
            a_pattern = re.sub('[a-zA-Z0-9]',
                               '[a-zA-Z0-9]',
                               re.escape(a))
            if (isinstance(b, str) and (re.match(a_pattern, b) is None)
               or isinstance(c, str) and (re.match(a_pattern, c) is None)):
                logging.debug(a_pattern)
>               raise TypeError('incomparable string patterns detected')
E               TypeError: incomparable string patterns detected

swh/lister/core/lister_base.py:437: TypeError
------------------------------------------------------------------------------------------------------------------------------ Captured log call ------------------------------------------------------------------------------------------------------------------------------
lister_base.py             203 ERROR    incomparable string patterns detected: inner=<class 'str'>jquery-1.8, lower=<class 'NoneType'>None, upper=<class 'str'>jquery
anlambert updated this revision to Diff 2231.Fri, Nov 23, 5:47 PM

Update : keep only one return statement in get_next_target_from_response

ardumont added inline comments.Mon, Nov 26, 10:39 AM
swh/lister/npm/lister.py
24

lol, ok that's what i kinda did for the content_get_range endpoint [1]

[1] https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/storage.py$278

97

API_URL_INDEX_RE is only declared and used in the Github lister.

ok, thanks. No biggie, just wondering ;)

declared and used in Github lister.

Huh, right.
I should have checked. I thought it was used more than once!

ardumont accepted this revision.Mon, Nov 26, 10:39 AM
This revision was automatically updated to reflect the committed changes.