⚙ D1610 swh.lister.cgit

ardumont requested changes to this revision.Jun 26 2019, 10:13 AM

This revision now requires changes to proceed.Jun 26 2019, 10:13 AM

self.get_url finds the url of the repo page
This method could work for https://git.kernel.org/ and http://git.savannah.gnu.org/cgit/ (maybe some more) but

Well, yeah but at least, we could have a heads start beginning with those. Thus
what i was speaking earlier about deploying asap. To explicit this, i strongly
believe in the basic stanza Deploy early, deploy often That's a good way
which never fails at making you discover things early to fix (independently of
what the real code does already).

Anyway.

This approach would not work for repos like https://cgit.kde.org/ and https://cgit.freedesktop.org/ and https://gitweb.torproject.org/ and (maybe more)

Ok, so for those, the idea of having patterns to fill in seems simple.

We could have the following pattern to initialize on an instance basic
(__init__):

With this, we could simplify the other instances.
Without failing to deal with the first one which are already ok.

Because -
for https://cgit.kde.org/
the clone link is in format https://anongit.kde.org/<name_of_repo> whereas the repo page link is https://cgit.kde.org/s/<name> (it is the case for almost all repos)

For https://cgit.freedesktop.org/
the clone link has no specific format

For https://gitweb.torproject.org/
The clone link is in the format https://git.torproject.org/<name> whereas repo page link is https://gitweb.torproject.org/ <name> (it is the case for almost all repos)

So that'd give (X means None or empty):

|--------------------------------+--------------+--------------|
| url                            | main-pattern | sub-pattern  |
|--------------------------------+--------------+--------------|
| https://git.kernel.org         | X            | X            |
| http://git.savannah.gnu.org    | cgit/        | X            |
| https://cgit.kde.org/          | name-of-repo | X            |
| https://cgit.freedesktop.org/  | s/           | name-of-repo |
| https://gitweb.torproject.org/ | name-of-repo | X            |
|--------------------------------+--------------+--------------|

Can you please give this a spin?

swh.lister.cgit: Remove repo page visit step

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/272/ for more details.

Harbormaster completed remote builds in B6500: Diff 5507.Jun 26 2019, 1:54 PM

swh.lister.cgit: Remove repo page visit step

swh/lister/cgit/lister.py
66–70	I still don't think indentation is correct here, can you please show me what would be the correct indentation here

Harbormaster completed remote builds in B6501: Diff 5508.Jun 26 2019, 2:03 PM

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/274/ for more details.

nahimilega added inline comments.Jun 26 2019, 2:03 PM

swh/lister/cgit/lister.py
22	I am not sure what you meant here, shall I put a comment in init() telling that base_url means api_base_url. Althought we no more need to check '/' in base_url after we switched to `urllib.parse` . But there is a need to check for '/' in origin_url_prefix.

nahimilega marked an inline comment as not done.Jun 26 2019, 2:04 PM

nahimilega added inline comments.Jun 26 2019, 2:07 PM

swh/lister/cgit/lister.py
178–200	These are just one line function and not much of logic is present in these functions, they are just simple HTML parsing. Do we need tests for these functions too?

Ok, so for those, the idea of having patterns to fill in seems simple.

We could have the following pattern to initialize on an instance basic
(init):

<url>/<main-pattern><sub-pattern>

With this, we could simplify the other instances.
Without failing to deal with the first one which are already ok.

Here i have broken the origin url into two parts
<origin_url_prefix>/<repo_name>

<origin_url_prefix> is same for all the repos in the instance

Reason to choose this over <url>/<main-pattern><sub-pattern>

Some instance like https://cgit.kde.org/ have repo in the format https://anongit.kde.org/<repo_name>
Here you can see there is no significant relation between base_url of page and origin_url.
Same case is for https://gitweb.torproject.org/

Moreover, this pattern will only require one more parameter (ie origin_url_prefix) and also keep the things simple

ardumont added inline comments.Jun 26 2019, 2:27 PM

swh/lister/cgit/lister.py
22	I meant (i think because that's an old comment) to name that url `url`. Please explicit the argument in the constructor (`__init__`) docstring. Please keep the instance as second parameter to have something consistent with other listers. That'd give: def __init__(self, url, instance, url_prefix... I did not read the reasoning behind the `url_prefix` below yet.
66–70	In doubt, look at how generally the source code is written first in this very repository (and in others if not enough). There is the answer in your current module. Just align with the beginning of the line (at current indentation level): repos_details.append({ 'name': repo_name, 'time': time, 'origin_url': origin_url, })
178–200	Yes, to avoid decreasing too much the coverage. If they are not much logic, they are simple to test as well.

Add testcases and changed variable base_url to url and origin_url_prefix to url_prefix
rebased on master

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/279/ for more details.

Harbormaster completed remote builds in B6510: Diff 5517.Jun 26 2019, 9:12 PM

nahimilega marked 5 inline comments as done.Jun 27 2019, 7:07 AM

ardumont added inline comments.Jun 28 2019, 10:19 AM

README.md
216	Objectively, the sample is confusing. If both `url` and `url_prefix` are the same, what's the point? I think you want to change this to either: remove `url_prefix` from the sample. -> As it's the same as url, no point of exposing it. choose another cgit instance which do need it. -> That has the benefit to demo how to deal with some special cgit instances expose both example, one with only url, another one with both url and url_prefix. -> Benefits of both prior points ;)
swh/lister/cgit/lister.py
28	Try to explicit a bit why you need that `url_prefix`. We know now but we will forget. When that happens that docstring will help first.
36	self.url_prefix = url if url_prefix is None else url_prefix
41	self.url_netloc = find_netloc(urlparse(self.PAGE))
47	I forgot to ask before, why do you need to call this like that? Why not `super().__init__...`? (i think i saw why but still, can you explicit?) Also, why isn't it in the first part of the constructor?
54	Drop the `all the` expression you use everywhere. Simplify this to something like: Find repositories metadata by parsing the html page (response's raw content). If there are links in the html page, retrieve those repositories metadata from those pages as well. Return the repositories as list of dictionaries.
55	Args: response (Response): http api request response Returns: repository origin urls (as dict) included in the response ymmv but that's the gist of it, i think.
62	Do we have roughly an idea of the number of pages we can encounter in different cgit instances?

ardumont requested changes to this revision.Jun 28 2019, 11:26 AM

ardumont added inline comments.

swh/lister/cgit/lister.py
62	Following the logic presented in comments below, here you want to exclude the first page so: repos.extend(self.get_repos(pages[1:]) Note: `get_all_pages` is renamed to `get_repos` (see below). pages[1:] as you want to exclude the first page
103	You are not retrieving pages here, you are fetching repositories from pages (all the pages you parsed in the first html page). So either: get_repos get_repos_from_pages
107	Request the `available repos`from the pages. This yields the available repositories found as beautiful object representation.
114	Yields (if you follow the snippet proposed below).
125	You can use a generator here: for page in pages: response = requests.get(page) if not response.ok: # deal with error as warning without impeding the listing to finish logger.warn('Failed to retrieve repositories from page %s', page) continue yield from get_repo_list(response.text) Also, remove the trick about removing the first page. Just pass the list of pages you effectively want to parse data from when you call the method. That simplifies the method and the docstring. Just add a comment when you call the method if you feel it's not explicit enough.
188	Find repositories (as beautifulsoup object) available within the server response.
191	Drop this, the first line proposed is enough (i think).
197	List all the repositories as beautifulsoup object within the response.
206	Instantiates a beautiful soup object from the response object.
swh/lister/cgit/tests/test_tasks.py
27	Same as before. If url_prefix is not really used (as it's the same as url), drop it here. It's best to test the other cases though. So add other tests (dubbed `test_lister_no_url_prefix`, `test_lister_url_and_url_prefix` or something...) with different instance initializations. Possible scenarios i see are: one with only url one with url and url_prefix (different) add other combinations i don't know about (if any ;)

This revision now requires changes to proceed.Jun 28 2019, 11:26 AM

Made recommended changes

README.md
216	`url` and `url_prefix` are not the same in this case, one ends with `git` and other with `cgit`, but I will add more examples where difference is significant.
swh/lister/cgit/lister.py
54	Sorry, now I realise `all the` is redundant in the docstring.
62	I think its 275, in http://hdiff.luite.com/cgit/

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/282/ for more details.

Harbormaster completed remote builds in B6550: Diff 5550.Jun 28 2019, 2:53 PM

nahimilega marked 3 inline comments as done.Jun 28 2019, 2:53 PM

nahimilega added inline comments.

swh/lister/cgit/lister.py
28	I have also mentioned it in commit message
41	I don't think this a good idea because urllib object of `self.PAGE` is also used in line 50 (and also in line 47 for which the comment originally was)

nahimilega marked 2 inline comments as done.Jun 28 2019, 2:53 PM

ardumont added inline comments.Jun 28 2019, 3:22 PM

README.md
216	oh, i misread that. Thanks for making it more apparent!
swh/lister/cgit/lister.py
154	comment here is not necessarily used, i added it in the diff comment to explicit it to you ;)

Also, note that @anlambert did some tests on the cgit instance he found [1] with the lister.
It failed to complete for that repo (other runs worked fine so nice job btw ;).

I think it's because, you load in memory all the repositories.
And then, when the repositories listing is done, this ends up doing one big transaction to write that listing into the db.
I think from the top of my head that you probably need to change a bit the simple base lister to flush more frequently.

[1] http://hdiff.luite.com/cgit/

ardumont added inline comments.Jun 28 2019, 3:29 PM

swh/lister/cgit/lister.py
62	Sorry, i was unclear, I meant the average number of pages. The instance you mentioned is so far the worst case ;)

Heads up, i was checking the code to make sure my hypothesis was correct.

I think it's because, you load in memory all the repositories.
And then, when the repositories listing is done, this ends up doing one big transaction to write that listing into the db.
I think from the top of my head that you probably need to change a bit the simple base lister to flush more frequently.

In the mean time, @anlambert confirmed it by modifying the simple lister would do the trick [1]

[1] https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/core/simple_lister.py$48

nahimilega added inline comments.Jun 28 2019, 3:35 PM

swh/lister/cgit/lister.py
62	Most of the cgit instances don't have a pagination. I think the average would be around 2-3 pages

I confirm @ardumont analysis regarding the issue with the listing of http://hdiff.luite.com/cgit/, the db transaction was too large.

Changing the n parameter here to the value of 1000 is sufficient to fix the issue.

@nahimilega , I will commit the fix. You will just have to rebase your diff before landing.

swh.lister.core: Increase flush frequency in simple lister

ardumont added inline comments.Jun 28 2019, 3:48 PM

README.md
225	Add this within the same previous block comment. No need for an extra one. # simple cgit instance cgit_lister(url='https://git.kernel.org/') # cgit instance whose listed repositories differ from the base url cgit_lister(url='https://cgit.kde.org/', url_prefix='https://anongit.kde.org/')
swh/lister/cgit/lister.py
150	Remove extra line. Start directly from the last docstring comment. We do always have the extra line in the docstring.

@nahimilega , I will commit the fix. You will just have to rebase your diff before landing.

I made a commit in this diff itself(i saw your comment now), shall I remove it

I just committed the fix so you can remove your commit from that diff.

ardumont added inline comments.Jun 28 2019, 3:52 PM

swh/lister/core/simple_lister.py
48 ↗	(On Diff #5555)	D1660 ;) @anlambert did as it's kinda out of the scope of this diff When he'll land it, you will have to rebase and deal with the conflict ;)

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/283/ for more details.

Harbormaster completed remote builds in B6556: Diff 5555.Jun 28 2019, 3:53 PM

rebase on latest master

nahimilega marked 2 inline comments as done.Jun 28 2019, 4:00 PM

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/285/ for more details.

Harbormaster completed remote builds in B6559: Diff 5558.Jun 28 2019, 4:03 PM

sorry folks, the volume is just too high for me to follow here. i'll assume you can ping me if you need help with testing or a final review or something. :) thanks for the hard work!

ardumont added inline comments.Jun 28 2019, 4:08 PM

swh/lister/cgit/lister.py

41

I don't understand your reply but not too worry.
You can keep the code as is.

75

One last modification, extract this in a method.
To avoid the list wrapping.

def _yield_repo_from_responses(self, response):
    """Yield repositories from request responses...
    
    Args:
        ...

    Yields:
        ...
    """
    html = response.text
    yield from get_repo_list(html)
    pages = self.get_pages(make_soup(html))
    if len(pages) > 1:
        yield from self.get_repos_from_pages(pages[1:])

def list_packages(self, response):
    """... <existing docstring> """
    for repo in self._list_repo_from_response(response):
       ...

sorry folks, the volume is just too high for me to follow here. i'll assume you can ping me if you need help with testing or a final review or something. :) thanks for the hard work!

No worries there.
By the way, we are almost done ;)

add method to avoid list wrapping

Build is green
See https://jenkins.softwareheritage.org/job/DLS/job/tox/286/ for more details.

Harbormaster completed remote builds in B6560: Diff 5559.Jun 28 2019, 4:41 PM

ardumont accepted this revision.Jun 28 2019, 5:20 PM

This revision is now accepted and ready to land.Jun 28 2019, 5:20 PM

Closed by commit rDLSb972a2a88d25: swh.lister.cgit (authored by nahimilega). · Explain WhyJun 28 2019, 5:22 PM

This revision was automatically updated to reflect the committed changes.

ardumont mentioned this in T1861: cgit lister: Adapt lister to deal with inconsistent `git clone uri` pattern.Jun 29 2019, 9:10 AM

swh.lister.cgit
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 5348

README.md

requirements.txt

swh/lister/cgit/init.py

swh/lister/cgit/lister.py

swh/lister/cgit/models.py

swh/lister/cgit/tasks.py

swh/lister/cgit/tests/init.py

swh/lister/cgit/tests/api_response.html

swh/lister/cgit/tests/conftest.py

swh/lister/cgit/tests/test_lister.py

swh/lister/cgit/tests/test_tasks.py

swh/lister/cli.py

swh/lister/core/tests/conftest.py

swh.lister.cgitClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 5348

README.md

requirements.txt

swh/lister/cgit/__init__.py

swh/lister/cgit/lister.py

swh/lister/cgit/models.py

swh/lister/cgit/tasks.py

swh/lister/cgit/tests/__init__.py

swh/lister/cgit/tests/api_response.html

swh/lister/cgit/tests/conftest.py

swh/lister/cgit/tests/test_lister.py

swh/lister/cgit/tests/test_tasks.py

swh/lister/cli.py

swh/lister/core/tests/conftest.py

swh.lister.cgit
ClosedPublic
Actions

Revision Contents
Changeset List

swh/lister/cgit/init.py

swh/lister/cgit/tests/init.py