Changeset View
Standalone View
swh/lister/cgit/lister.py
- This file was added.
# Copyright (C) 2019 the Software Heritage developers | |||||
# License: GNU General Public License version 3, or any later version | |||||
# See top-level LICENSE file for more information | |||||
import random | |||||
from bs4 import BeautifulSoup | |||||
from collections import defaultdict | |||||
import requests | |||||
import urllib.parse | |||||
from .models import CGitModel | |||||
from swh.lister.core.simple_lister import SimpleLister | |||||
from swh.lister.core.lister_transports import ListerOnePageApiTransport | |||||
class CGitLister(ListerOnePageApiTransport, SimpleLister): | |||||
ardumont: Don't know if you saw, but for pagination purposes, IIRC, the `PageByPageLister`.
Gitlab lister… | |||||
Done Inline ActionsI checked PageByPageLister (used in gitlab) and SWHIndexingHttpLister (used in github). They can be used here but :
2)Current technique which I used fits quite well in the code, if we change to `PageByPageLister` then it would require a lot of revamping of code
FWIW, I don't think using PageByPageLister would serve any help. I would like to know you opinion on this. Btw why do we have both PageByPageLister and SWHIndexingHttpLister , they serve quite similar purpose. nahimilega: I checked `PageByPageLister` (used in gitlab) and `SWHIndexingHttpLister` (used in github). | |||||
Done Inline Actions
I'll take a closer look at your diff.
(*urk* at the name, we should drop the SWH prefix as it's redundant within swh modules ;) ardumont: > FWIW, I don't think using PageByPageLister would serve any help. I would like to know you… | |||||
MODEL = CGitModel | |||||
LISTER_NAME = 'cgit' | |||||
Done Inline ActionsWhy not None? ardumont: Why not None? | |||||
def __init__(self, base_url, instance=None, override_config=None): | |||||
if not base_url.endswith('/'): | |||||
Done Inline ActionsWe should stop using different names around. That way, we can push the following check systematically done here (about the trailing /). ardumont: We should stop using different names around.
Latest other lister named that api_baseurl.
Maybe… | |||||
Done Inline ActionsNow I took help of urllib.parse, so there is no need of that check anymore nahimilega: Now I took help of `urllib.parse`, so there is no need of that check anymore | |||||
Done Inline ActionsI am not sure what you meant here, shall I put a comment in init() telling that base_url means api_base_url. nahimilega: I am not sure what you meant here, shall I put a comment in __init__() telling that base_url… | |||||
Done Inline ActionsI meant (i think because that's an old comment) to name that url url. Please explicit the argument in the constructor (__init__) docstring. Please keep the instance as second parameter to have something consistent with other listers. That'd give: def __init__(self, url, instance, url_prefix... I did not read the reasoning behind the url_prefix below yet. ardumont: I meant (i think because that's an old comment) to name that url `url`.
Please explicit the… | |||||
base_url = base_url+'/' | |||||
zackUnsubmitted Done Inline ActionsWhy is this needed? If it's not to work around some common gotcha with cgit, we should not disturb configuration data. If it's for the sake of uniformity, that uniformity should be enforced on the cgit lister configuration, not here. zack: Why is this needed?
If it's not to work around some common gotcha with cgit, we should not… | |||||
nahimilegaAuthorUnsubmitted Done Inline ActionsThis is for if someone enters the base url as http://abc.com . nahimilega: This is for if someone enters the base url as http://abc.com .
It will convert it to http… | |||||
nahimilegaAuthorUnsubmitted Done Inline ActionsThis is also done in phabricator lister, anlambert recommended me to do that in pharicator lister. So I used it here as well. nahimilega: This is also done in phabricator lister, anlambert recommended me to do that in pharicator… | |||||
self.base_url = base_url | |||||
# This part removes any suffix from the base url and stores it in | |||||
# next_url. For example for base_url = https://git.kernel.org/pub/scm/ | |||||
# it will convert it into https://git.kernel.org and then attach | |||||
Done Inline ActionsTry to explicit a bit why you need that url_prefix. ardumont: Try to explicit a bit why you need that `url_prefix`.
We know now but we will forget.
When… | |||||
Done Inline ActionsI have also mentioned it in commit message nahimilega: I have also mentioned it in commit message | |||||
# the suffix | |||||
(part1, part2, next_url) = self.base_url.split('/', 2) | |||||
self.next_url = next_url | |||||
Done Inline ActionsI did not get that part. Can't urllib.parse.urlparse help a bit? ardumont: I did not get that part.
Are you sure it works for multiple instances?
Can't `urllib.parse. | |||||
Done Inline ActionsAlso, i saw the previous exchange about this, still not clear to me what this does. That will have the advantage to somehow document it? (I'm still interesting by the explanations though ;) ardumont: Also, i saw the previous exchange about this, still not clear to me what this does.
Can you… | |||||
Done Inline ActionsNow I have extracted that part into a function find_netloc() and also written test cases for it. nahimilega: Now I have extracted that part into a function `find_netloc()` and also written test cases for… | |||||
if not instance: | |||||
instance = urllib.parse.urlparse(base_url).hostname | |||||
self.instance = instance | |||||
ListerOnePageApiTransport .__init__(self) | |||||
Done Inline Actionsself.url_prefix = url if url_prefix is None else url_prefix ardumont: self.url_prefix = url if url_prefix is None else url_prefix | |||||
SimpleLister.__init__(self, override_config=override_config) | |||||
def list_packages(self, response): | |||||
Done Inline Actionsdictionaries. ardumont: dictionaries. | |||||
"""List the actual cgit instance origins from the response. | |||||
Done Inline ActionsA wee bit more detail of the algo used to parse the output would be welcome. ardumont: A wee bit more detail of the algo used to parse the output would be welcome. | |||||
Not Done Inline Actionsself.url_netloc = find_netloc(urlparse(self.PAGE)) ardumont: self.url_netloc = find_netloc(urlparse(self.PAGE)) | |||||
Not Done Inline ActionsI don't think this a good idea because urllib object of self.PAGE is also used in line 50 (and also in line 47 for which the comment originally was) nahimilega: I don't think this a good idea because urllib object of `self.PAGE` is also used in line 50… | |||||
Not Done Inline ActionsI don't understand your reply but not too worry. ardumont: I don't understand your reply but not too worry.
You can keep the code as is.
| |||||
""" | |||||
repos_details = [] | |||||
soup = BeautifulSoup( | |||||
response.text, | |||||
Done Inline ActionsAs this snippet BeautifulSoup(...) used quite a lot, i'd extract this in a function and use that function call instead (including in tests). So can you please refactor to something like: repo_soup = make_repo_soup(response.text) ardumont: As this snippet `BeautifulSoup(...)` used quite a lot, i'd extract this in a function and use… | |||||
features="html.parser").find('div', {"class": "content"}) | |||||
zackUnsubmitted Done Inline Actionsthis way of indenting looks odd, wouldn't: BeautifulSoup(response.text, features="html.parser") \ .find('div', {"class": "content"}) be more customary? (Note: I haven't checked what PEP8 has to say about either version.) zack: this way of indenting looks odd, wouldn't:
```
BeautifulSoup(response.text, features="html. | |||||
repos = soup.find_all("tr", {"class": ""}) | |||||
Not Done Inline ActionsI forgot to ask before, why do you need to call this like that? Why not super().__init__...? (i think i saw why but still, can you explicit?) Also, why isn't it in the first part of the constructor? ardumont: I forgot to ask before, why do you need to call this like that?
Why not `super().__init__...`? | |||||
for repo in repos: | |||||
repo_name = repo.a.text | |||||
repo_url = self.get_url(repo_name, repo) | |||||
origin_url = find_origin_url(repo_url) | |||||
try: | |||||
time = repo.span['title'] | |||||
Done Inline ActionsDrop the all the expression you use everywhere. Find repositories metadata by parsing the html page (response's raw content). If there are links in the html page, retrieve those repositories metadata from those pages as well. Return the repositories as list of dictionaries. ardumont: Drop the `all the` expression you use everywhere.
Simplify this to something like:
```
Find… | |||||
Done Inline ActionsSorry, now I realise all the is redundant in the docstring. nahimilega: Sorry, now I realise `all the` is redundant in the docstring. | |||||
except Exception: | |||||
Done Inline ActionsArgs: response (Response): http api request response Returns: repository origin urls (as dict) included in the response ymmv but that's the gist of it, i think. ardumont: ```
Args:
response (Response): http api request response
Returns:
repository origin… | |||||
time = None | |||||
Done Inline Actionsitems in the dict should be only one extra indent level vlorentz: items in the dict should be only one extra indent level | |||||
Done Inline ActionsOk, talking a bit between us, we are wondering whether this is needed. We could have a more lightweight approach first. That'd have the benefit to:
ardumont: Ok, talking a bit between us, we are wondering whether this is needed.
Going that way (priority… | |||||
Done Inline ActionsBefore choosing this method of visiting every repo page, I did some research, here is the result: For https://cgit.freedesktop.org/ It has some clone link as https://gitlab.freedesktop.org/xdg/shared-mime-info and some as https://anongit.freedesktop.org/git/pulseaudio/pavucontrol.git.bup. There is no comman pattern which could be exploited to reconstruct the clone link. Neither any info related to clone link is available on the base_url page. Hence I decided to take the approach to visit every page nahimilega: Before choosing this method of visiting every repo page, I did some research, here is the… | |||||
Done Inline ActionsI just now thought of a clever approach which is lightweight and will do the work in a matter of seconds( unlike ~1 hour). The repositories in cgit instance are divided into groups with a group heading (in pale white colour as in https://cgit.freedesktop.org/) For all the members of the group the clone link is in format <some_url_comman_to_all members_of_a_group>/<name>.git To find the <some_url_comman_to_all members_of_a_group> we need to visit the page of only one repository for a group. This would drastically decrease the no. of requests and make the code faster. (although would increase the quantity of code) I will test this method and inform you about the result. Does this method sound good? nahimilega: I just now thought of a clever approach which is lightweight and will do the work in a matter… | |||||
Done Inline Actions
which then will be out of scope for that lister.
This repo does not list anything so that's it.
That's the loader's concern, not the lister's. Some listers currently list private repositories for example. It's not an issue.
Well, as a first approximation, i'd say let's go towards removing the visit step anyway (and compute basic git clone url, self.get_url sounds good enough IIRC). It's good to start incrementally and not overthink everything. Mostly, aside my remarks on the diff (that i'd like you to take into accounts nonetheless), the lister is almost ready.
Which is a reasonable choice.
I'd say yes. ardumont: > For https://cgit.freedesktop.org/
> It has some clone link as https://gitlab.freedesktop. | |||||
if origin_url is not None: | |||||
repos_details.append({ | |||||
'name': repo_name, | |||||
'time': time, | |||||
'origin_url': origin_url | |||||
zackUnsubmitted Done Inline Actionsadd a trailing comma after this line, it will make diff that in the future might add an additional line nicer/shorter zack: add a trailing comma after this line, it will make diff that in the future might add an… | |||||
Done Inline ActionsDo we have roughly an idea of the number of pages we can encounter in different cgit instances? ardumont: Do we have roughly an idea of the number of pages we can encounter in different cgit instances? | |||||
Not Done Inline ActionsI think its 275, in http://hdiff.luite.com/cgit/ nahimilega: I think its 275, in http://hdiff.luite.com/cgit/ | |||||
Not Done Inline ActionsSorry, i was unclear, I meant the average number of pages. ardumont: Sorry, i was unclear, I meant the average number of pages.
The instance you mentioned is so far… | |||||
Done Inline ActionsMost of the cgit instances don't have a pagination. nahimilega: Most of the cgit instances don't have a pagination.
I think the average would be around 2-3… | |||||
Done Inline ActionsFollowing the logic presented in comments below, here you want to exclude the first page so: repos.extend(self.get_repos(pages[1:]) Note:
ardumont: Following the logic presented in comments below, here you want to exclude the first page so… | |||||
}) | |||||
Done Inline ActionsIndentation is still off to me. ardumont: Indentation is still off to me. | |||||
random.shuffle(repos_details) | |||||
return repos_details | |||||
def get_url(self, repo): | |||||
Done Inline Actionsindentation is off. ardumont: indentation is off.
| |||||
""" | |||||
Finds the url of a repo page by parsing over the html of the row of | |||||
Done Inline ActionsI still don't think indentation is correct here, can you please show me what would be the correct indentation here nahimilega: I still don't think indentation is correct here, can you please show me what would be the… | |||||
Done Inline ActionsIn doubt, look at how generally the source code is written first in this very repository (and in others if not enough). Just align with the beginning of the line (at current indentation level): repos_details.append({ 'name': repo_name, 'time': time, 'origin_url': origin_url, }) ardumont: In doubt, look at how generally the source code is written first in this very repository (and… | |||||
that repo present in the base url. | |||||
Done Inline ActionsWhy the need to shuffle? ardumont: Why the need to shuffle? | |||||
Done Inline ActionsI first saw this shuffle in pypi lister, so to maintain uniformity I also used it in gnu and cran lister and here too. Although I don't really know its importance. nahimilega: I first saw this shuffle in pypi lister, so to maintain uniformity I also used it in gnu and… | |||||
Done Inline ActionsOk. I don't remember the reasons though, too bad. ardumont: Ok.
I don't remember the reasons though, too bad. | |||||
The url of the repo present is present under href attribute of the name | |||||
of url for a particular repo. | |||||
Done Inline ActionsOne last modification, extract this in a method. def _yield_repo_from_responses(self, response): """Yield repositories from request responses... Args: ... Yields: ... """ html = response.text yield from get_repo_list(html) pages = self.get_pages(make_soup(html)) if len(pages) > 1: yield from self.get_repos_from_pages(pages[1:]) def list_packages(self, response): """... <existing docstring> """ for repo in self._list_repo_from_response(response): ... ardumont: One last modification, extract this in a method.
To avoid the `list` wrapping.
```
def… | |||||
Args: | |||||
repo - a beautifulsoup object of the html code of the repo row | |||||
present in base url | |||||
Returns: | |||||
url of a repo | |||||
zackUnsubmitted Done Inline Actionsthis way of typesetting args/return is not consistent with our python coding guidelines, see, e.g., https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html zack: this way of typesetting args/return is not consistent with our python coding guidelines, see, e. | |||||
""" | |||||
Done Inline ActionsHere I don't think this is the best way to do it. It looks too messy but I can't use s.find_all('/') because find_all is method present in bs4. Can you please recommend some smart approach to do it nahimilega: Here I don't think this is the best way to do it. It looks too messy but I can't use s.find_all… | |||||
Done Inline Actions(part1, part2, next_url) = self.base_url.split('/', 2) vlorentz: `(part1, part2, next_url) = self.base_url.split('/', 2)` | |||||
Done Inline ActionsThanks, @vlorentz, This is really neat of doing it nahimilega: Thanks, @vlorentz, This is really neat of doing it | |||||
Done Inline Actions*neat way of doing it nahimilega: *neat way of doing it | |||||
suffix = repo.a['href'] | |||||
return self.next_url + suffix | |||||
def get_model_from_repo(self, repo): | |||||
"""Transform from repository representation to model | |||||
""" | |||||
Done Inline Actionsif not pages: ardumont: `if not pages:` | |||||
return { | |||||
'uid': self.base_url + repo['name'], | |||||
'name': repo['name'], | |||||
'full_name': repo['name'], | |||||
Done Inline ActionsWhat's the format of a['href']? Is that a href without the absolute url? ardumont: What's the format of a['href']?
Is that a href without the absolute url?
Is that true for all… | |||||
Done Inline Actionsa['href'] is of string format. It is without absolute url, it is in relative to the url_netloc I computed in my code. For example for http://git.savannah.gnu.org/cgit/, all the URL are respect to http://git.savannah.gnu.org
It is true for all the instance I saw so far(all mentioned in T1835 nahimilega: a['href'] is of string format.
It is without absolute url, it is in relative to the… | |||||
Done Inline Actionsgreat, thanks. ardumont: great, thanks. | |||||
'html_url': repo['origin_url'], | |||||
Done Inline ActionsAlso why not using a list comprehension here? return [self.get_url(page) for page in pages] ardumont: Also why not using a list comprehension here?
```
return [self.get_url(page) for page in… | |||||
'origin_url': repo['origin_url'], | |||||
'origin_type': 'git', | |||||
'time_updated': repo['time'], | |||||
Done Inline Actionspages ([str]): .. ardumont: pages ([str]): .. | |||||
} | |||||
def transport_response_simplified(self, response): | |||||
Done Inline Actionspages (except (missing space) ardumont: `pages (except` (missing space) | |||||
"""Transform response to list for model manipulation | |||||
Done Inline ActionsReturns: List of beautifulsoup of all the repositories (url) row present in all the pages (except the first one). Drop the of the html code of, we mention it's a beautifulsoup object, which as far as i got it is an xml/html parser. ardumont: ```
Returns:
List of beautifulsoup of all the repositories (url) row present in all the… | |||||
""" | |||||
Done Inline ActionsYou are not retrieving pages here, you are fetching repositories from pages (all the pages you parsed in the first html page). So either:
ardumont: You are not retrieving pages here, you are fetching repositories from pages (all the pages you… | |||||
return [self.get_model_from_repo(repo) for repo in response] | |||||
Done Inline Actionsinstance ardumont: instance | |||||
def find_origin_url(repo_url): | |||||
Done Inline ActionsRequest the available reposfrom the pages. This yields the available repositories found as beautiful object representation. ardumont: Request the `available repos`from the pages. This yields the available repositories found as… | |||||
""" | |||||
Finds the origin url for a particular repo by parsing over the page of | |||||
that repo | |||||
Args: | |||||
repo_url - url of the repo | |||||
Done Inline ActionsYields (if you follow the snippet proposed below). ardumont: Yields (if you follow the snippet proposed below). | |||||
Returns: | |||||
origin url for the repo | |||||
zackUnsubmitted Done Inline Actionsas before: this doesn't conform with Napoleon style zack: as before: this doesn't conform with Napoleon style | |||||
Done Inline Actionsthis looks familiar... ardumont: this looks familiar...
Please refactor this within a method or a function. | |||||
Done Inline Actionsthis looks familiar... ardumont: this looks familiar...
Please refactor this within a method or a function. | |||||
example: | |||||
Done Inline ActionsHeads up, code review stopped here! ardumont: Heads up, code review stopped here! | |||||
>>> find_origin_url( | |||||
'http://git.savannah.gnu.org/cgit/fbvbconv-py.git/') | |||||
'https://git.savannah.gnu.org/git/fbvbconv-py.git' | |||||
Done Inline Actionsdoctests should be written like this: >>> find_origin_url('http://git.savannah.gnu.org/cgit/fbvbconv-py.git/') 'https://git.savannah.gnu.org/git/fbvbconv-py.git' with no indent vlorentz: doctests should be written like this:
```
>>> find_origin_url('http://git.savannah.gnu. | |||||
""" | |||||
Done Inline ActionsYou can use a generator here: for page in pages: response = requests.get(page) if not response.ok: # deal with error as warning without impeding the listing to finish logger.warn('Failed to retrieve repositories from page %s', page) continue yield from get_repo_list(response.text) Also, remove the trick about removing the first page. ardumont: You can use a generator here:
```
for page in pages:
response = requests.get(page)
if… | |||||
response = requests.get(repo_url) | |||||
soup = BeautifulSoup(response.text, features="html.parser") | |||||
origin_urls = find_all_origin_url(soup) | |||||
return priority_origin_url(origin_urls) | |||||
def find_all_origin_url(soup): | |||||
""" | |||||
Finds all the origin url for a particular repo by parsing over the html of | |||||
repo page | |||||
Args: | |||||
soup - a beautifulsoup object of the html code of the repo | |||||
Returns: | |||||
dictionary of all possible origin urls with their protocol as key | |||||
zackUnsubmitted Done Inline Actionssame: Napoleon mismatch zack: same: Napoleon mismatch | |||||
example | |||||
if soup is beautifulsoup object of the html code at | |||||
http://git.savannah.gnu.org/cgit/fbvbconv-py.git/ | |||||
>>> find_all_origin_url(soup) | |||||
{ 'https': 'https://git.savannah.gnu.org/git/fbvbconv-py.git', | |||||
'ssh': 'ssh://git.savannah.gnu.org/srv/git/fbvbconv-py.git', | |||||
Done Inline ActionsRemove extra line. Start directly from the last docstring comment. ardumont: Remove extra line. Start directly from the last docstring comment.
We do always have the extra… | |||||
'git': 'git://git.savannah.gnu.org/fbvbconv-py.git'} | |||||
Done Inline Actionssame vlorentz: same | |||||
zackUnsubmitted Done Inline ActionsI think this is indented too much (at least the first line, not sure about the rest w.r.t. what doctest expects) zack: I think this is indented too much (at least the first line, not sure about the rest w.r.t. what… | |||||
""" | |||||
origin_urls = defaultdict(dict) | |||||
found_clone_word = False | |||||
Done Inline Actionscomment here is not necessarily used, i added it in the diff comment to explicit it to you ;) ardumont: comment here is not necessarily used, i added it in the diff comment to explicit it to you ;) | |||||
for i in soup.find_all('tr'): | |||||
Done Inline ActionsHow come response is already in the right format for this lister? ardumont: How come `response` is already in the right format for this lister? | |||||
if found_clone_word: | |||||
link = i.text | |||||
protocol = link[:link.find(':')] | |||||
origin_urls[protocol] = link | |||||
if i.text == 'Clone': | |||||
found_clone_word = True | |||||
return origin_urls | |||||
def priority_origin_url(origin_url): | |||||
""" | |||||
Finds the highest priority link from all the origin urls for a particular | |||||
repo | |||||
Priority order is https>http>git>ssh | |||||
Args: | |||||
origin_urls - a dictionary of origin links with their protocol as key | |||||
Returns : | |||||
url the highest priority | |||||
zackUnsubmitted Done Inline ActionsNapoleon zack: Napoleon | |||||
""" | |||||
for protocol in ['https', 'http', 'git', 'ssh']: | |||||
if protocol in origin_url: | |||||
return origin_url[protocol] | |||||
Done Inline Actionsorigin_urls (Dict): All possible origin urls for a repository (key 'protocol', value the associated url) ardumont: origin_urls (Dict): All possible origin urls for a repository (key 'protocol', value the… | |||||
Done Inline ActionsUrl (str) with the highest... ardumont: Url (str) with the highest... | |||||
Done Inline Actions"""Find all origin urls per repository by... ardumont: """Find all origin urls per repository by... | |||||
Done Inline Actionsa beautiful soup object repo representation ardumont: `a beautiful soup object repo representation` | |||||
Done Inline ActionsAll possible origin urls for a repository (dict with key 'protocol', value the associated url). ardumont: All possible origin urls for a repository (dict with key 'protocol', value the associated url). | |||||
Done Inline ActionsWhat's repo_url's type? please, mention it in the docstring. Args: repo_url (<TYPE>): ... Please, do so in the current diff everywhere it's missing. In case the type is from a tiers-party module, it's fine to use that as the type (for example response (Response): ...). ardumont: What's repo_url's type?
please, mention it in the docstring.
```
Args:
repo_url (<TYPE>)… | |||||
Done Inline ActionsI've already mentioned i'd prefer this named repo_soup or plain repo (as long as we mentioned in the docstring what the type is expected). ardumont: I've already mentioned i'd prefer this named repo_soup or plain repo (as long as we mentioned… | |||||
Done Inline ActionsThese are just one line function and not much of logic is present in these functions, they are just simple HTML parsing. Do we need tests for these functions too? nahimilega: These are just one line function and not much of logic is present in these functions, they are… | |||||
Done Inline ActionsYes, to avoid decreasing too much the coverage. ardumont: Yes, to avoid decreasing too much the coverage.
If they are not much logic, they are simple to… | |||||
Done Inline ActionsFind repositories (as beautifulsoup object) available within the server response. ardumont: Find repositories (as beautifulsoup object) available within the server response.
| |||||
Done Inline ActionsDrop this, the first line proposed is enough (i think). ardumont: Drop this, the first line proposed is enough (i think). | |||||
Done Inline ActionsList all the repositories as beautifulsoup object within the response. ardumont: List all the repositories as beautifulsoup object within the response. | |||||
Done Inline ActionsInstantiates a beautiful soup object from the response object. ardumont: Instantiates a beautiful soup object from the response object. |
Don't know if you saw, but for pagination purposes, IIRC, the PageByPageLister.
Gitlab lister uses it.
Maybe this one could?
I did not check the rest yet (just saw the diff update with the pagination support).