Page MenuHomeSoftware Heritage

swh.lister.cgit
ClosedPublic

Authored by nahimilega on Jun 19 2019, 1:18 PM.

Diff Detail

Repository
rDLS Listers
Branch
cgit
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 6328
Build 8765: tox-on-jenkinsJenkins
Build 8764: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision now requires changes to proceed.Jun 26 2019, 10:13 AM

self.get_url finds the url of the repo page
This method could work for https://git.kernel.org/ and http://git.savannah.gnu.org/cgit/ (maybe some more) but

Well, yeah but at least, we could have a heads start beginning with those. Thus
what i was speaking earlier about deploying asap. To explicit this, i strongly
believe in the basic stanza Deploy early, deploy often That's a good way
which never fails at making you discover things early to fix (independently of
what the real code does already).

Anyway.

This approach would not work for repos like https://cgit.kde.org/ and https://cgit.freedesktop.org/ and https://gitweb.torproject.org/ and (maybe more)

Ok, so for those, the idea of having patterns to fill in seems simple.

We could have the following pattern to initialize on an instance basic
(__init__):

<url>/<main-pattern><sub-pattern>

With this, we could simplify the other instances.
Without failing to deal with the first one which are already ok.

Because -
for https://cgit.kde.org/
the clone link is in format https://anongit.kde.org/<name_of_repo> whereas the repo page link is https://cgit.kde.org/s/<name> (it is the case for almost all repos)

For https://cgit.freedesktop.org/
the clone link has no specific format

For https://gitweb.torproject.org/
The clone link is in the format https://git.torproject.org/<name> whereas repo page link is https://gitweb.torproject.org/ <name> (it is the case for almost all repos)

So that'd give (X means None or empty):

|--------------------------------+--------------+--------------|
| url                            | main-pattern | sub-pattern  |
|--------------------------------+--------------+--------------|
| https://git.kernel.org         | X            | X            |
| http://git.savannah.gnu.org    | cgit/        | X            |
| https://cgit.kde.org/          | name-of-repo | X            |
| https://cgit.freedesktop.org/  | s/           | name-of-repo |
| https://gitweb.torproject.org/ | name-of-repo | X            |
|--------------------------------+--------------+--------------|

Can you please give this a spin?

nahimilega marked 11 inline comments as done.
  • swh.lister.cgit: Remove repo page visit step
nahimilega marked 3 inline comments as done.
  • swh.lister.cgit: Remove repo page visit step
swh/lister/cgit/lister.py
66–70

I still don't think indentation is correct here, can you please show me what would be the correct indentation here

swh/lister/cgit/lister.py
22

I am not sure what you meant here, shall I put a comment in init() telling that base_url means api_base_url.
Althought we no more need to check '/' in base_url after we switched to urllib.parse . But there is a need to check for '/' in origin_url_prefix.

nahimilega marked an inline comment as not done.Jun 26 2019, 2:04 PM
swh/lister/cgit/lister.py
178–200

These are just one line function and not much of logic is present in these functions, they are just simple HTML parsing. Do we need tests for these functions too?

Ok, so for those, the idea of having patterns to fill in seems simple.

We could have the following pattern to initialize on an instance basic
(init):

<url>/<main-pattern><sub-pattern>

With this, we could simplify the other instances.
Without failing to deal with the first one which are already ok.

Here i have broken the origin url into two parts
<origin_url_prefix>/<repo_name>

<origin_url_prefix> is same for all the repos in the instance

Reason to choose this over <url>/<main-pattern><sub-pattern>

Some instance like https://cgit.kde.org/ have repo in the format https://anongit.kde.org/<repo_name>
Here you can see there is no significant relation between base_url of page and origin_url.
Same case is for https://gitweb.torproject.org/

Moreover, this pattern will only require one more parameter (ie origin_url_prefix) and also keep the things simple

swh/lister/cgit/lister.py
22

I meant (i think because that's an old comment) to name that url url.

Please explicit the argument in the constructor (__init__) docstring.

Please keep the instance as second parameter to have something consistent with other listers.

That'd give:

def __init__(self, url, instance, url_prefix...

I did not read the reasoning behind the url_prefix below yet.

66–70

In doubt, look at how generally the source code is written first in this very repository (and in others if not enough).
There is the answer in your current module.

Just align with the beginning of the line (at current indentation level):

repos_details.append({
    'name': repo_name,
     'time': time,
     'origin_url': origin_url, 
})
178–200

Yes, to avoid decreasing too much the coverage.
If they are not much logic, they are simple to test as well.

nahimilega marked an inline comment as done.
  • Add testcases and changed variable base_url to url and origin_url_prefix to url_prefix
  • rebased on master
README.md
216

Objectively, the sample is confusing.

If both url and url_prefix are the same, what's the point?
I think you want to change this to either:

  • remove url_prefix from the sample. -> As it's the same as url, no point of exposing it.
  • choose another cgit instance which do need it. -> That has the benefit to demo how to deal with some special cgit instances
  • expose both example, one with only url, another one with both url and url_prefix. -> Benefits of both prior points ;)
swh/lister/cgit/lister.py
28

Try to explicit a bit why you need that url_prefix.
We know now but we will forget.
When that happens that docstring will help first.

36

self.url_prefix = url if url_prefix is None else url_prefix

41

self.url_netloc = find_netloc(urlparse(self.PAGE))

47

I forgot to ask before, why do you need to call this like that?

Why not super().__init__...? (i think i saw why but still, can you explicit?)

Also, why isn't it in the first part of the constructor?

54

Drop the all the expression you use everywhere.
Simplify this to something like:

Find repositories metadata by parsing the html page (response's raw content).
If there are links in the html page, retrieve those repositories metadata from those pages as well.
Return the repositories as list of dictionaries.
55
Args:
    response (Response): http api request response

Returns:
    repository origin urls (as dict) included in the response

ymmv but that's the gist of it, i think.

62

Do we have roughly an idea of the number of pages we can encounter in different cgit instances?

ardumont added inline comments.
swh/lister/cgit/lister.py
62

Following the logic presented in comments below, here you want to exclude the first page so:

repos.extend(self.get_repos(pages[1:])

Note:

  • get_all_pages is renamed to get_repos (see below).
  • pages[1:] as you want to exclude the first page
103

You are not retrieving pages here, you are fetching repositories from pages (all the pages you parsed in the first html page).

So either:

  • get_repos
  • get_repos_from_pages
107

Request the available reposfrom the pages. This yields the available repositories found as beautiful object representation.

114

Yields (if you follow the snippet proposed below).

125

You can use a generator here:

for page in pages:
    response = requests.get(page)
    if not response.ok:  # deal with error as warning without impeding the listing to finish
        logger.warn('Failed to retrieve repositories from page %s', page)
        continue      
    yield from get_repo_list(response.text)

Also, remove the trick about removing the first page.
Just pass the list of pages you effectively want to parse data from when you call the method.
That simplifies the method and the docstring.
Just add a comment when you call the method if you feel it's not explicit enough.

188

Find repositories (as beautifulsoup object) available within the server response.

191

Drop this, the first line proposed is enough (i think).

197

List all the repositories as beautifulsoup object within the response.

206

Instantiates a beautiful soup object from the response object.

swh/lister/cgit/tests/test_tasks.py
27

Same as before.
If url_prefix is not really used (as it's the same as url), drop it here.

It's best to test the other cases though.
So add other tests (dubbed test_lister_no_url_prefix, test_lister_url_and_url_prefix or something...) with different instance initializations.

Possible scenarios i see are:

  • one with only url
  • one with url and url_prefix (different)
  • add other combinations i don't know about (if any ;)
This revision now requires changes to proceed.Jun 28 2019, 11:26 AM
nahimilega marked 12 inline comments as done.
  • Made recommended changes
README.md
216

url and url_prefix are not the same in this case, one ends with git and other with cgit, but I will add more examples where difference is significant.

swh/lister/cgit/lister.py
54

Sorry, now I realise all the is redundant in the docstring.

62

I think its 275, in http://hdiff.luite.com/cgit/

nahimilega added inline comments.
swh/lister/cgit/lister.py
28

I have also mentioned it in commit message

41

I don't think this a good idea because urllib object of self.PAGE is also used in line 50 (and also in line 47 for which the comment originally was)

README.md
216

oh, i misread that.

Thanks for making it more apparent!

swh/lister/cgit/lister.py
154

comment here is not necessarily used, i added it in the diff comment to explicit it to you ;)

Also, note that @anlambert did some tests on the cgit instance he found [1] with the lister.
It failed to complete for that repo (other runs worked fine so nice job btw ;).

I think it's because, you load in memory all the repositories.
And then, when the repositories listing is done, this ends up doing one big transaction to write that listing into the db.
I think from the top of my head that you probably need to change a bit the simple base lister to flush more frequently.

[1] http://hdiff.luite.com/cgit/

swh/lister/cgit/lister.py
62

Sorry, i was unclear, I meant the average number of pages.
The instance you mentioned is so far the worst case ;)

Heads up, i was checking the code to make sure my hypothesis was correct.

I think it's because, you load in memory all the repositories.
And then, when the repositories listing is done, this ends up doing one big transaction to write that listing into the db.
I think from the top of my head that you probably need to change a bit the simple base lister to flush more frequently.

In the mean time, @anlambert confirmed it by modifying the simple lister would do the trick [1]

[1] https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/core/simple_lister.py$48

swh/lister/cgit/lister.py
62

Most of the cgit instances don't have a pagination.
I think the average would be around 2-3 pages

I confirm @ardumont analysis regarding the issue with the listing of http://hdiff.luite.com/cgit/, the db transaction was too large.

Changing the n parameter here to the value of 1000 is sufficient to fix the issue.

@nahimilega , I will commit the fix. You will just have to rebase your diff before landing.

  • swh.lister.core: Increase flush frequency in simple lister
README.md
225

Add this within the same previous block comment.
No need for an extra one.

# simple cgit instance
cgit_lister(url='https://git.kernel.org/')
# cgit instance whose listed repositories differ from the base url
cgit_lister(url='https://cgit.kde.org/', url_prefix='https://anongit.kde.org/')
swh/lister/cgit/lister.py
150

Remove extra line. Start directly from the last docstring comment.
We do always have the extra line in the docstring.

@nahimilega , I will commit the fix. You will just have to rebase your diff before landing.

I made a commit in this diff itself(i saw your comment now), shall I remove it

I just committed the fix so you can remove your commit from that diff.

swh/lister/core/simple_lister.py
48 ↗(On Diff #5555)

D1660 ;)
@anlambert did as it's kinda out of the scope of this diff
When he'll land it, you will have to rebase and deal with the conflict ;)

sorry folks, the volume is just too high for me to follow here. i'll assume you can ping me if you need help with testing or a final review or something. :) thanks for the hard work!

swh/lister/cgit/lister.py
41

I don't understand your reply but not too worry.
You can keep the code as is.

75

One last modification, extract this in a method.
To avoid the list wrapping.

def _yield_repo_from_responses(self, response):
    """Yield repositories from request responses...
    
    Args:
        ...

    Yields:
        ...
    """
    html = response.text
    yield from get_repo_list(html)
    pages = self.get_pages(make_soup(html))
    if len(pages) > 1:
        yield from self.get_repos_from_pages(pages[1:])

def list_packages(self, response):
    """... <existing docstring> """
    for repo in self._list_repo_from_response(response):
       ...

sorry folks, the volume is just too high for me to follow here. i'll assume you can ping me if you need help with testing or a final review or something. :) thanks for the hard work!

No worries there.
By the way, we are almost done ;)

nahimilega marked 3 inline comments as done.
  • add method to avoid list wrapping
This revision is now accepted and ready to land.Jun 28 2019, 5:20 PM
This revision was automatically updated to reflect the committed changes.