Differential D1610 Diff 5550 swh/lister/cgit/lister.py

Changeset View

Standalone View

swh/lister/cgit/lister.py

This file was added.

				# Copyright (C) 2019 the Software Heritage developers
				# License: GNU General Public License version 3, or any later version
				# See top-level LICENSE file for more information

				import random
				import logging
				from bs4 import BeautifulSoup
				import requests
				from urllib.parse import urlparse

				from .models import CGitModel

				from swh.lister.core.simple_lister import SimpleLister
				from swh.lister.core.lister_transports import ListerOnePageApiTransport


				class CGitLister(ListerOnePageApiTransport, SimpleLister):
				ardumontUnsubmitted Done Inline Actions Don't know if you saw, but for pagination purposes, IIRC, the `PageByPageLister`. Gitlab lister uses it. Maybe this one could? I did not check the rest yet (just saw the diff update with the pagination support). ardumont: Don't know if you saw, but for pagination purposes, IIRC, the `PageByPageLister`. Gitlab lister…
				nahimilegaAuthorUnsubmitted Done Inline Actions I checked `PageByPageLister` (used in gitlab) and `SWHIndexingHttpLister` (used in github). They can be used here but : Here we doing HTML parsing and in cgit we get all the links to all the pages in one time(not like github api where we get only next url), so if we use this we need to parse over the pagination part again and again. We can avoid this by some smart tricks but that might make to code difficult to understand. 2)Current technique which I used fits quite well in the code, if we change to `PageByPageLister` then it would require a lot of revamping of code Most of the cgit don't even have a pagination system and all of the biggest cgit instance that are currently listed in T1835 can be listed perfectly by using the approach currently used. FWIW, I don't think using `PageByPageLister` would serve any help. I would like to know you opinion on this. Btw why do we have both `PageByPageLister` and `SWHIndexingHttpLister` , they serve quite similar purpose. nahimilega: I checked `PageByPageLister` (used in gitlab) and `SWHIndexingHttpLister` (used in github).
				ardumontUnsubmitted Done Inline Actions FWIW, I don't think using PageByPageLister would serve any help. I would like to know you opinion on this. I'll take a closer look at your diff. I'll get back to you on this to answer ;) Btw why do we have both PageByPageLister and SWHIndexingHttpLister , they serve quite similar purpose. (urk at the name, we should drop the SWH prefix as it's redundant within swh modules ;) IIRC they serve the same purpose but the api underneath does not work the same way. And i think they could not be reconciliated within the same api... (somehow, details are fuzzy). ardumont: > FWIW, I don't think using PageByPageLister would serve any help. I would like to know you…
				MODEL = CGitModel
				LISTER_NAME = 'cgit'
				PAGE = None
				ardumontUnsubmitted Done Inline Actions Why not None? ardumont: Why not None?
				url_prefix_present = True

				ardumontUnsubmitted Done Inline Actions We should stop using different names around. Latest other lister named that api_baseurl. Maybe using url, expliciting in the docstring that the url is actually is (api base url, url to start from, etc...) That way, we can push the following check systematically done here (about the trailing /). ardumont: We should stop using different names around. Latest other lister named that api_baseurl. Maybe…
				nahimilegaAuthorUnsubmitted Done Inline Actions Now I took help of `urllib.parse`, so there is no need of that check anymore nahimilega: Now I took help of `urllib.parse`, so there is no need of that check anymore
				nahimilegaAuthorUnsubmitted Done Inline Actions I am not sure what you meant here, shall I put a comment in init() telling that base_url means api_base_url. Althought we no more need to check '/' in base_url after we switched to `urllib.parse` . But there is a need to check for '/' in origin_url_prefix. nahimilega: I am not sure what you meant here, shall I put a comment in __init__() telling that base_url…
				ardumontUnsubmitted Done Inline Actions I meant (i think because that's an old comment) to name that url `url`. Please explicit the argument in the constructor (`__init__`) docstring. Please keep the instance as second parameter to have something consistent with other listers. That'd give: def __init__(self, url, instance, url_prefix... I did not read the reasoning behind the `url_prefix` below yet. ardumont: I meant (i think because that's an old comment) to name that url `url`. Please explicit the…
				def __init__(self, url, instance=None, url_prefix=None,
				zackUnsubmitted Done Inline Actions Why is this needed? If it's not to work around some common gotcha with cgit, we should not disturb configuration data. If it's for the sake of uniformity, that uniformity should be enforced on the cgit lister configuration, not here. zack: Why is this needed? If it's not to work around some common gotcha with cgit, we should not…
				nahimilegaAuthorUnsubmitted Done Inline Actions This is for if someone enters the base url as http://abc.com . It will convert it to http://abc.com/ which would ease up the computation further. nahimilega: This is for if someone enters the base url as http://abc.com . It will convert it to http…
				nahimilegaAuthorUnsubmitted Done Inline Actions This is also done in phabricator lister, anlambert recommended me to do that in pharicator lister. So I used it here as well. nahimilega: This is also done in phabricator lister, anlambert recommended me to do that in pharicator…
				override_config=None):
				"""Inits Class with PAGE url and origin url prefix.

				Args:
				url (str): URL of the CGit instance.
				ardumontUnsubmitted Done Inline Actions Try to explicit a bit why you need that `url_prefix`. We know now but we will forget. When that happens that docstring will help first. ardumont: Try to explicit a bit why you need that `url_prefix`. We know now but we will forget. When…
				nahimilegaAuthorUnsubmitted Done Inline Actions I have also mentioned it in commit message nahimilega: I have also mentioned it in commit message
				instance (str): Name of cgit instance.
				url_prefix (str): Prefix of the origin_url. Origin link of the
				repos of some special instances do not match
				the url of the repository page, they have origin
				ardumontUnsubmitted Done Inline Actions I did not get that part. Are you sure it works for multiple instances? Can't `urllib.parse.urlparse` help a bit? ardumont: I did not get that part. Are you sure it works for multiple instances? Can't `urllib.parse.
				ardumontUnsubmitted Done Inline Actions Also, i saw the previous exchange about this, still not clear to me what this does. Can you please also extract that in a function and test it? That will have the advantage to somehow document it? (I'm still interesting by the explanations though ;) ardumont: Also, i saw the previous exchange about this, still not clear to me what this does. Can you…
				nahimilegaAuthorUnsubmitted Done Inline Actions Now I have extracted that part into a function `find_netloc()` and also written test cases for it. nahimilega: Now I have extracted that part into a function `find_netloc()` and also written test cases for…
				url in the format <url_prefix>/<repo_name>.

				"""

				ardumontUnsubmitted Done Inline Actions self.url_prefix = url if url_prefix is None else url_prefix ardumont: self.url_prefix = url if url_prefix is None else url_prefix
				self.PAGE = url
				if url_prefix is None:
				self.url_prefix = url
				ardumontUnsubmitted Done Inline Actions dictionaries. ardumont: dictionaries.
				self.url_prefix_present = False
				else:
				ardumontUnsubmitted Done Inline Actions A wee bit more detail of the algo used to parse the output would be welcome. ardumont: A wee bit more detail of the algo used to parse the output would be welcome.
				ardumontUnsubmitted Not Done Inline Actions self.url_netloc = find_netloc(urlparse(self.PAGE)) ardumont: self.url_netloc = find_netloc(urlparse(self.PAGE))
				nahimilegaAuthorUnsubmitted Not Done Inline Actions I don't think this a good idea because urllib object of `self.PAGE` is also used in line 50 (and also in line 47 for which the comment originally was) nahimilega: I don't think this a good idea because urllib object of `self.PAGE` is also used in line 50…
				ardumontUnsubmitted Not Done Inline Actions I don't understand your reply but not too worry. You can keep the code as is. ardumont: I don't understand your reply but not too worry. You can keep the code as is.
				self.url_prefix = url_prefix

				if not self.url_prefix.endswith('/'):
				self.url_prefix += '/'
				ardumontUnsubmitted Done Inline Actions As this snippet `BeautifulSoup(...)` used quite a lot, i'd extract this in a function and use that function call instead (including in tests). `soup` is not quite telling, `repo_soup` is not that good either but at least, we know we are dealing with some repository representation. It's also kind of consistent with some `repo_url` you use already. So can you please refactor to something like: repo_soup = make_repo_soup(response.text) ardumont: As this snippet `BeautifulSoup(...)` used quite a lot, i'd extract this in a function and use…
				url = urlparse(self.PAGE)
				zackUnsubmitted Done Inline Actions this way of indenting looks odd, wouldn't: BeautifulSoup(response.text, features="html.parser") \ .find('div', {"class": "content"}) be more customary? (Note: I haven't checked what PEP8 has to say about either version.) zack: this way of indenting looks odd, wouldn't: ``` BeautifulSoup(response.text, features="html.
				self.url_netloc = find_netloc(url)
				ardumontUnsubmitted Not Done Inline Actions I forgot to ask before, why do you need to call this like that? Why not `super().__init__...`? (i think i saw why but still, can you explicit?) Also, why isn't it in the first part of the constructor? ardumont: I forgot to ask before, why do you need to call this like that? Why not `super().__init__...`?

				if not instance:
				instance = url.hostname
				self.instance = instance

				ListerOnePageApiTransport .__init__(self)
				SimpleLister.__init__(self, override_config=override_config)
				ardumontUnsubmitted Done Inline Actions Drop the `all the` expression you use everywhere. Simplify this to something like: Find repositories metadata by parsing the html page (response's raw content). If there are links in the html page, retrieve those repositories metadata from those pages as well. Return the repositories as list of dictionaries. ardumont: Drop the `all the` expression you use everywhere. Simplify this to something like: ``` Find…
				nahimilegaAuthorUnsubmitted Done Inline Actions Sorry, now I realise `all the` is redundant in the docstring. nahimilega: Sorry, now I realise `all the` is redundant in the docstring.

				ardumontUnsubmitted Done Inline Actions Args: response (Response): http api request response Returns: repository origin urls (as dict) included in the response ymmv but that's the gist of it, i think. ardumont: ``` Args: response (Response): http api request response Returns: repository origin…
				def list_packages(self, response):
				vlorentzUnsubmitted Done Inline Actions items in the dict should be only one extra indent level vlorentz: items in the dict should be only one extra indent level
				ardumontUnsubmitted Done Inline Actions Ok, talking a bit between us, we are wondering whether this is needed. Going that way (priority url detection and all that), could be heavy on the server we are listing. (As many pagination requests per pagination plus as many requests as there are repositories there) We could have a more lightweight approach first. The lister computes the canonical way of exposing a repository (that'd need checking amongst different cgit instances that that computation is indeed shared between them). That'd have the benefit to: decrease the quantity of code, thus less to maintain. allow performance comparison with your first go on the git-savannah instance (which was roughly ~1 hour IIRC ). ardumont: Ok, talking a bit between us, we are wondering whether this is needed. Going that way (priority…
				nahimilegaAuthorUnsubmitted Done Inline Actions Before choosing this method of visiting every repo page, I did some research, here is the result: The extraction of clone link from the page where the repos are listed(base_url) cannot be done in a stable way. Here is an example - For https://cgit.freedesktop.org/ It has some clone link as https://gitlab.freedesktop.org/xdg/shared-mime-info and some as https://anongit.freedesktop.org/git/pulseaudio/pavucontrol.git.bup. Some pages like https://cgit.freedesktop.org/~cworth/piglit-www/ are empty There is no comman pattern which could be exploited to reconstruct the clone link. Neither any info related to clone link is available on the base_url page. Hence I decided to take the approach to visit every page nahimilega: Before choosing this method of visiting every repo page, I did some research, here is the…
				nahimilegaAuthorUnsubmitted Done Inline Actions I just now thought of a clever approach which is lightweight and will do the work in a matter of seconds( unlike ~1 hour). The repositories in cgit instance are divided into groups with a group heading (in pale white colour as in https://cgit.freedesktop.org/) For all the members of the group the clone link is in format <some_url_comman_to_all members_of_a_group>/<name>.git To find the <some_url_comman_to_all members_of_a_group> we need to visit the page of only one repository for a group. This would drastically decrease the no. of requests and make the code faster. (although would increase the quantity of code) I will test this method and inform you about the result. Does this method sound good? nahimilega: I just now thought of a clever approach which is lightweight and will do the work in a matter…
				ardumontUnsubmitted Done Inline Actions For https://cgit.freedesktop.org/ It has some clone link as https://gitlab.freedesktop.org/xdg/shared-mime-info and some as which then will be out of scope for that lister. The gitlab lister instance for freedesktop will pick up those repositories (probably done already). https://anongit.freedesktop.org/git/pulseaudio/pavucontrol.git.bup. This repo does not list anything so that's it. Some pages like https://cgit.freedesktop.org/~cworth/piglit-www/ are empty That's the loader's concern, not the lister's. Given enough failure, a loader's associated task will be disabled by the scheduler. Some listers currently list private repositories for example. Because possible they were public at the time, then, when they are loaded, they fail (401) and their task got disabled. It's not an issue. There is no comman pattern which could be exploited to reconstruct the clone link. Neither any info related to clone link is available on the base_url page. Well, as a first approximation, i'd say let's go towards removing the visit step anyway (and compute basic git clone url, `self.get_url` sounds good enough IIRC). It's good to start incrementally and not overthink everything. Let's start small, deploy the thing, see where it goes. Adapt when it fails. Mostly, aside my remarks on the diff (that i'd like you to take into accounts nonetheless), the lister is almost ready. So, if we could deploy this soon, that'd be great. And what's proposed is to simplify some code, which tends towards less to maintain ;) Hence I decided to take the approach to visit every page Which is a reasonable choice. That we are currently challenging for the good cause ;) I just now thought of a clever approach... I will test this method and inform you about the result. Does this method sound good? I'd say yes. Please, first amend the current diff with my remarks (and i have also some questions btw). Then you could try this in another branch (starting from this diff, and eventually open another diff depending on this). (After all, if that becomes really necessary, we could also imagine having different listing policies for the same lister within different instances. We have currently identified 3 distinct policies ;) ardumont: > For https://cgit.freedesktop.org/ > It has some clone link as https://gitlab.freedesktop.
				"""List the actual cgit instance origins from the response.

				Find repositories metadata by parsing the html page (response's raw
				content). If there are links in the html page, retrieve those
				repositories metadata from those pages as well. Return the
				repositories as list of dictionaries.
				zackUnsubmitted Done Inline Actions add a trailing comma after this line, it will make diff that in the future might add an additional line nicer/shorter zack: add a trailing comma after this line, it will make diff that in the future might add an…
				ardumontUnsubmitted Done Inline Actions Do we have roughly an idea of the number of pages we can encounter in different cgit instances? ardumont: Do we have roughly an idea of the number of pages we can encounter in different cgit instances?
				nahimilegaAuthorUnsubmitted Not Done Inline Actions I think its 275, in http://hdiff.luite.com/cgit/ nahimilega: I think its 275, in http://hdiff.luite.com/cgit/
				ardumontUnsubmitted Not Done Inline Actions Sorry, i was unclear, I meant the average number of pages. The instance you mentioned is so far the worst case ;) ardumont: Sorry, i was unclear, I meant the average number of pages. The instance you mentioned is so far…
				nahimilegaAuthorUnsubmitted Done Inline Actions Most of the cgit instances don't have a pagination. I think the average would be around 2-3 pages nahimilega: Most of the cgit instances don't have a pagination. I think the average would be around 2-3…
				ardumontUnsubmitted Done Inline Actions Following the logic presented in comments below, here you want to exclude the first page so: repos.extend(self.get_repos(pages[1:]) Note: `get_all_pages` is renamed to `get_repos` (see below). pages[1:] as you want to exclude the first page ardumont: Following the logic presented in comments below, here you want to exclude the first page so…

				Args:
				ardumontUnsubmitted Done Inline Actions Indentation is still off to me. ardumont: Indentation is still off to me.
				response (Response): http api request response.

				Returns:
				List of repository origin urls (as dict) included in the response.
				ardumontUnsubmitted Done Inline Actions indentation is off. ardumont: indentation is off.

				"""
				nahimilegaAuthorUnsubmitted Done Inline Actions I still don't think indentation is correct here, can you please show me what would be the correct indentation here nahimilega: I still don't think indentation is correct here, can you please show me what would be the…
				ardumontUnsubmitted Done Inline Actions In doubt, look at how generally the source code is written first in this very repository (and in others if not enough). There is the answer in your current module. Just align with the beginning of the line (at current indentation level): repos_details.append({ 'name': repo_name, 'time': time, 'origin_url': origin_url, }) ardumont: In doubt, look at how generally the source code is written first in this very repository (and…
				repos_details = []
				ardumontUnsubmitted Done Inline Actions Why the need to shuffle? ardumont: Why the need to shuffle?
				nahimilegaAuthorUnsubmitted Done Inline Actions I first saw this shuffle in pypi lister, so to maintain uniformity I also used it in gnu and cran lister and here too. Although I don't really know its importance. nahimilega: I first saw this shuffle in pypi lister, so to maintain uniformity I also used it in gnu and…
				ardumontUnsubmitted Done Inline Actions Ok. I don't remember the reasons though, too bad. ardumont: Ok. I don't remember the reasons though, too bad.
				repos = get_repo_list(response.text)
				url_soup = make_soup(response.text)
				pages = self.get_pages(url_soup)
				if len(pages) > 1:
				ardumontUnsubmitted Done Inline Actions One last modification, extract this in a method. To avoid the `list` wrapping. def _yield_repo_from_responses(self, response): """Yield repositories from request responses... Args: ... Yields: ... """ html = response.text yield from get_repo_list(html) pages = self.get_pages(make_soup(html)) if len(pages) > 1: yield from self.get_repos_from_pages(pages[1:]) def list_packages(self, response): """... <existing docstring> """ for repo in self._list_repo_from_response(response): ... ardumont: One last modification, extract this in a method. To avoid the `list` wrapping. ``` def…
				repos.extend(list(self.get_repos_from_pages(pages[1:])))

				for repo in repos:
				repo_name = repo.a.text
				origin_url = self.find_origin_url(repo, repo_name)

				zackUnsubmitted Done Inline Actions this way of typesetting args/return is not consistent with our python coding guidelines, see, e.g., https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html zack: this way of typesetting args/return is not consistent with our python coding guidelines, see, e.
				try:
				nahimilegaAuthorUnsubmitted Done Inline Actions Here I don't think this is the best way to do it. It looks too messy but I can't use s.find_all('/') because find_all is method present in bs4. Can you please recommend some smart approach to do it nahimilega: Here I don't think this is the best way to do it. It looks too messy but I can't use s.find_all…
				vlorentzUnsubmitted Done Inline Actions `(part1, part2, next_url) = self.base_url.split('/', 2)` vlorentz: `(part1, part2, next_url) = self.base_url.split('/', 2)`
				nahimilegaAuthorUnsubmitted Done Inline Actions Thanks, @vlorentz, This is really neat of doing it nahimilega: Thanks, @vlorentz, This is really neat of doing it
				nahimilegaAuthorUnsubmitted Done Inline Actions neat way of doing it nahimilega:* *neat way of doing it
				time = repo.span['title']
				except Exception:
				time = None

				if origin_url is not None:
				repos_details.append({
				'name': repo_name,
				ardumontUnsubmitted Done Inline Actions `if not pages:` ardumont: `if not pages:`
				'time': time,
				'origin_url': origin_url,
				})

				ardumontUnsubmitted Done Inline Actions What's the format of a['href']? Is that a href without the absolute url? Is that true for all cgit instances? ardumont: What's the format of a['href']? Is that a href without the absolute url? Is that true for all…
				nahimilegaAuthorUnsubmitted Done Inline Actions a['href'] is of string format. It is without absolute url, it is in relative to the `url_netloc` I computed in my code. For example for http://git.savannah.gnu.org/cgit/, all the URL are respect to http://git.savannah.gnu.org Is that true for all cgit instances? It is true for all the instance I saw so far(all mentioned in T1835 nahimilega: a['href'] is of string format. It is without absolute url, it is in relative to the…
				ardumontUnsubmitted Done Inline Actions great, thanks. ardumont: great, thanks.
				random.shuffle(repos_details)
				ardumontUnsubmitted Done Inline Actions Also why not using a list comprehension here? return [self.get_url(page) for page in pages] ardumont: Also why not using a list comprehension here? ``` return [self.get_url(page) for page in…
				return repos_details

				def find_origin_url(self, repo, repo_name):
				ardumontUnsubmitted Done Inline Actions pages ([str]): .. ardumont: pages ([str]): ..
				"""Finds the origin url for a repository

				Args:
				ardumontUnsubmitted Done Inline Actions `pages (except` (missing space) ardumont: `pages (except` (missing space)
				repo (Beautifulsoup): Beautifulsoup object of the repository
				ardumontUnsubmitted Done Inline Actions Returns: List of beautifulsoup of all the repositories (url) row present in all the pages (except the first one). Drop the `of the html code of`, we mention it's a beautifulsoup object, which as far as i got it is an xml/html parser. So it implicits we deal with html which in that case i found reasonable. ardumont: ``` Returns: List of beautifulsoup of all the repositories (url) row present in all the…
				row present in base url.
				repo_name (str): Repository name.
				ardumontUnsubmitted Done Inline Actions You are not retrieving pages here, you are fetching repositories from pages (all the pages you parsed in the first html page). So either: get_repos get_repos_from_pages ardumont: You are not retrieving pages here, you are fetching repositories from pages (all the pages you…

				Returns:
				string: origin url.
				ardumontUnsubmitted Done Inline Actions instance ardumont: instance

				ardumontUnsubmitted Done Inline Actions Request the `available repos`from the pages. This yields the available repositories found as beautiful object representation. ardumont: Request the `available repos`from the pages. This yields the available repositories found as…
				"""

				if self.url_prefix_present:
				return self.url_prefix + repo_name

				return self.get_url(repo)

				ardumontUnsubmitted Done Inline Actions Yields (if you follow the snippet proposed below). ardumont: Yields (if you follow the snippet proposed below).
				def get_pages(self, url_soup):
				"""Find URL of all pages.
				zackUnsubmitted Done Inline Actions as before: this doesn't conform with Napoleon style zack: as before: this doesn't conform with Napoleon style
				ardumontUnsubmitted Done Inline Actions this looks familiar... Please refactor this within a method or a function. ardumont: this looks familiar... Please refactor this within a method or a function.

				ardumontUnsubmitted Done Inline Actions this looks familiar... Please refactor this within a method or a function. ardumont: this looks familiar... Please refactor this within a method or a function.
				Finds URL of pages that are present by parsing over the HTML of
				ardumontUnsubmitted Done Inline Actions Heads up, code review stopped here! ardumont: Heads up, code review stopped here!
				pagination present at the end of the page.

				Args:
				url_soup (Beautifulsoup): a beautifulsoup object of base URL
				vlorentzUnsubmitted Done Inline Actions doctests should be written like this: >>> find_origin_url('http://git.savannah.gnu.org/cgit/fbvbconv-py.git/') 'https://git.savannah.gnu.org/git/fbvbconv-py.git' with no indent vlorentz: doctests should be written like this: ``` >>> find_origin_url('http://git.savannah.gnu.

				Returns:
				list: URL of pages present for a cgit instance
				ardumontUnsubmitted Done Inline Actions You can use a generator here: for page in pages: response = requests.get(page) if not response.ok: # deal with error as warning without impeding the listing to finish logger.warn('Failed to retrieve repositories from page %s', page) continue yield from get_repo_list(response.text) Also, remove the trick about removing the first page. Just pass the list of pages you effectively want to parse data from when you call the method. That simplifies the method and the docstring. Just add a comment when you call the method if you feel it's not explicit enough. ardumont: You can use a generator here: ``` for page in pages: response = requests.get(page) if…

				"""
				pages = url_soup.find('div', {"class": "content"}).find_all('li')

				if not pages:
				return [self.PAGE]

				return [self.get_url(page) for page in pages]

				def get_repos_from_pages(self, pages):
				"""Find repos from all pages.

				Request the available repos from the pages. This yields
				the available repositories found as beautiful object representation.

				Args:
				pages ([str]): list of urls of all pages present for a
				zackUnsubmitted Done Inline Actions same: Napoleon mismatch zack: same: Napoleon mismatch
				particular cgit instance.

				Yields:
				List of beautifulsoup object of repository (url) rows
				present in pages(except first).

				"""

				ardumontUnsubmitted Done Inline Actions Remove extra line. Start directly from the last docstring comment. We do always have the extra line in the docstring. ardumont: Remove extra line. Start directly from the last docstring comment. We do always have the extra…
				for page in pages:
				vlorentzUnsubmitted Done Inline Actions same vlorentz: same
				zackUnsubmitted Done Inline Actions I think this is indented too much (at least the first line, not sure about the rest w.r.t. what doctest expects) zack: I think this is indented too much (at least the first line, not sure about the rest w.r.t. what…
				response = requests.get(page)
				if not response.ok: # deal with error as warning without impeding
				# the listing to finish
				ardumontUnsubmitted Done Inline Actions comment here is not necessarily used, i added it in the diff comment to explicit it to you ;) ardumont: comment here is not necessarily used, i added it in the diff comment to explicit it to you ;)
				logging.warning('Failed to retrieve repositories from page %s',
				page)
				ardumontUnsubmitted Done Inline Actions How come `response` is already in the right format for this lister? ardumont: How come `response` is already in the right format for this lister?
				continue

				yield from get_repo_list(response.text)

				def get_url(self, repo):
				"""Finds url of a repo page.

				Finds the url of a repo page by parsing over the html of the row of
				that repo present in the base url.

				Args:
				repo (Beautifulsoup): a beautifulsoup object of the repository
				row present in base url.

				Returns:
				string: The url of a repo.

				"""
				suffix = repo.a['href']
				return self.url_netloc + suffix

				def get_model_from_repo(self, repo):
				zackUnsubmitted Done Inline Actions Napoleon zack: Napoleon
				"""Transform from repository representation to model.

				"""
				return {
				'uid': self.PAGE + repo['name'],
				'name': repo['name'],
				'full_name': repo['name'],
				'html_url': repo['origin_url'],
				'origin_url': repo['origin_url'],
				'origin_type': 'git',
				ardumontUnsubmitted Done Inline Actions """Find all origin urls per repository by... ardumont: """Find all origin urls per repository by...
				ardumontUnsubmitted Done Inline Actions Find repositories (as beautifulsoup object) available within the server response. ardumont: Find repositories (as beautifulsoup object) available within the server response.
				'time_updated': repo['time'],
				'instance': self.instance,
				}
				ardumontUnsubmitted Done Inline Actions Drop this, the first line proposed is enough (i think). ardumont: Drop this, the first line proposed is enough (i think).

				ardumontUnsubmitted Done Inline Actions `a beautiful soup object repo representation` ardumont: `a beautiful soup object repo representation`
				def transport_response_simplified(self, repos_details):
				"""Transform response to list for model manipulation.

				ardumontUnsubmitted Done Inline Actions All possible origin urls for a repository (dict with key 'protocol', value the associated url). ardumont: All possible origin urls for a repository (dict with key 'protocol', value the associated url).
				"""
				return [self.get_model_from_repo(repo) for repo in repos_details]
				ardumontUnsubmitted Done Inline Actions List all the repositories as beautifulsoup object within the response. ardumont: List all the repositories as beautifulsoup object within the response.


				def find_netloc(url):
				nahimilegaAuthorUnsubmitted Done Inline Actions These are just one line function and not much of logic is present in these functions, they are just simple HTML parsing. Do we need tests for these functions too? nahimilega: These are just one line function and not much of logic is present in these functions, they are…
				ardumontUnsubmitted Done Inline Actions Yes, to avoid decreasing too much the coverage. If they are not much logic, they are simple to test as well. ardumont: Yes, to avoid decreasing too much the coverage. If they are not much logic, they are simple to…
				"""Finds the network location from then url.

				URL in the repo are relative to the network location part of base
				URL, so we need to compute it to reconstruct URLs.

				ardumontUnsubmitted Done Inline Actions What's repo_url's type? please, mention it in the docstring. Args: repo_url (<TYPE>): ... Please, do so in the current diff everywhere it's missing. (Even if currently, we are missing those, that's a fix we are trying to fill in). In case the type is from a tiers-party module, it's fine to use that as the type (for example `response (Response):` ...). ardumont: What's repo_url's type? please, mention it in the docstring. ``` Args: repo_url (<TYPE>)…
				Args:
				ardumontUnsubmitted Done Inline Actions Instantiates a beautiful soup object from the response object. ardumont: Instantiates a beautiful soup object from the response object.
				url (urllib): urllib object of url.

				Returns:
				string: Scheme and Network location part in the base URL.

				Example:
				For url = https://git.kernel.org/pub/scm/
				>>> find_netloc(url)
				'https://git.kernel.org'

				"""
				return '%s://%s' % (url.scheme, url.netloc)


				def get_repo_list(response):
				"""Find repositories (as beautifulsoup object) available within the server
				response.

				Args:
				response (Response): server response
				ardumontUnsubmitted Done Inline Actions origin_urls (Dict): All possible origin urls for a repository (key 'protocol', value the associated url) ardumont: origin_urls (Dict): All possible origin urls for a repository (key 'protocol', value the…

				Returns:
				List all repositories as beautifulsoup object within the response.
				ardumontUnsubmitted Done Inline Actions Url (str) with the highest... ardumont: Url (str) with the highest...

				"""
				repo_soup = make_soup(response)
				ardumontUnsubmitted Done Inline Actions I've already mentioned i'd prefer this named repo_soup or plain repo (as long as we mentioned in the docstring what the type is expected). ardumont: I've already mentioned i'd prefer this named repo_soup or plain repo (as long as we mentioned…
				return repo_soup \
				.find('div', {"class": "content"}).find_all("tr", {"class": ""})


				def make_soup(response):
				"""Instantiates a beautiful soup object from the response object.

				"""
				return BeautifulSoup(response, features="html.parser")