Changeset View
Standalone View
swh/lister/gnu/lister.py
- This file was added.
# Copyright (C) 2019 the Software Heritage developers | |||||
ardumont: copyright header needs an update.
Since it's inspired from the tar-loader, it'd be 2018-2019. | |||||
# License: GNU General Public License version 3, or any later version | |||||
# See top-level LICENSE file for more information | |||||
import random | |||||
import gzip | |||||
import json | |||||
import requests | |||||
from pathlib import Path | |||||
from collections import defaultdict | |||||
from .models import GNUModel | |||||
from swh.scheduler import utils | |||||
from swh.lister.core.simple_lister import SimpleLister | |||||
class GNULister(SimpleLister): | |||||
MODEL = GNUModel | |||||
LISTER_NAME = 'gnu' | |||||
TREE_URL = 'https://ftp.gnu.org/tree.json.gz' | |||||
BASE_URL = 'https://ftp.gnu.org' | |||||
instance = 'gnu' | |||||
tarballs = defaultdict(dict) | |||||
ardumontUnsubmitted Done Inline ActionsSpecify the variable's goal. Something like, please improve as you see fit: Dict of key the project name (/id/url?), value the associated list of tarballs to ingest from the gnu mirror ardumont: Specify the variable's goal.
Something like, please improve as you see fit:
```
Dict of key… | |||||
def task_dict(self, origin_type, origin_url, **kwargs): | |||||
""" | |||||
Return task format dict | |||||
Done Inline ActionsThis is going to iter over bytes, not chunks; and it will be very inefficient. Use this instead: with open(self.path, 'rb') as f: while True: chunk = f.read(chunk_size) if not chunk: break yield chunk vlorentz: This is going to iter over bytes, not chunks; and it will be very inefficient.
Use this… | |||||
Done Inline ActionsI copied this class from loader-tar, hence I assume we also change this in the loader-tar class to make it more efficient. nahimilega: I copied this class from loader-tar, hence I assume we also change this in the loader-tar class… | |||||
Done Inline ActionsI'll update the loader-tar with this. ardumont: I'll update the loader-tar with this. | |||||
This is overridden from the lister_base as more information is | |||||
needed for the ingestion task creation. | |||||
""" | |||||
return utils.create_task_dict( | |||||
Done Inline Actionstarballs here as well. ardumont: `tarballs` here as well.
That ends up in the `future` loader part. | |||||
Done Inline Actions? ardumont: ? | |||||
Done Inline ActionsThe list of tarballs is stored as the json string format in the model and needs to be converted back to list of dictionary before using it to create a loading task nahimilega: The list of tarballs is stored as the json string format in the model and needs to be converted… | |||||
Done Inline ActionsBut why do we need the list of tarballs in the model? I'd expect we compute it and pass it along to the loader. ardumont: But why do we need the list of tarballs in the model?
I'd expect we compute it and pass it… | |||||
Done Inline ActionsPlease correct me if I am wrong somewhere - To pass the list of tarballs in the loading task we need it so that it can be accessed in the task_dict() function. To access the list of tarballs in the task_dict() we need to ensure it is present in the model. nahimilega: Please correct me if I am wrong somewhere -
To pass the list of tarballs to the loader we need… | |||||
Done Inline Actions
I'm missing the end of the sentence.
I'm not sure yet. I'd need to dig in the code to make sure. All I know is that it should not be a requisite to store the computed data (needed for the loader) in the lister model. ardumont: > Please correct me if I am wrong somewhere -
> To pass the list of tarballs to the loader we… | |||||
Done Inline Actions
Sorry for that, it is "To pass the list of tarballs to the loader we need to pass it in the loading task created."
I tried to do what I thought would work fine, and I am not able to come up with another way to send the tarballs of loader. Can you please look up suggest some better way. nahimilega: > >Please correct me if I am wrong somewhere -
>> To pass the list of tarballs to the loader we… | |||||
Done Inline ActionsOne way to avoid including tarballs in model is to make a variable instance of class named tarballs (like LISTER_NAME or TREE_URL), which would countain all the tarballs of each package and can be accessed from task_dict() function nahimilega: One way to avoid including tarballs in model is to make a variable instance of class named… | |||||
Done Inline Actions
I mean that through the scheduler (so that's what the task_dict method here is all about). ardumont: > I'd expect we compute it and pass it along to the loader.
I mean that through the scheduler… | |||||
'load-%s' % origin_type, 'recurring', kwargs.get('name'), | |||||
origin_url, tarballs=self.tarballs[kwargs.get('name')]) | |||||
def get_file(self): | |||||
''' | |||||
Downloads and unzip tree.json.gz file and returns its content | |||||
in JSON format | |||||
Returns | |||||
File content in JSON format | |||||
''' | |||||
response = requests.get(self.TREE_URL, | |||||
allow_redirects=True) | |||||
Done Inline ActionsIt's not the tar loader client, Software Heritage Lister ardumont: It's not the tar loader client, `Software Heritage Lister` | |||||
uncompressed_content = gzip.decompress(response.content) | |||||
Done Inline ActionsWhy use a string substitution here? vlorentz: Why use a string substitution here? | |||||
return json.loads(uncompressed_content.decode('utf-8')) | |||||
def safely_issue_request(self, identifier): | |||||
''' | |||||
Make network request with to download the file which | |||||
has file structure of the GNU website. | |||||
Args: | |||||
identifier: resource identifier | |||||
Returns: | |||||
Done Inline ActionsDrop the intermediary variable. ardumont: Drop the intermediary variable.
Also, if the `get_file` is only used here, you can inline… | |||||
Done Inline ActionsI did it this way because if I remove safely_issue_request() method then I have to override the whole run() method from SimpleLister class just to change one line(ie the top line where safely_issue_request() is called), unnecessary increase in code. Shall I keep it the way it is now or override the whole run() method? nahimilega: I did it this way because if I remove safely_issue_request() method then I have to override the… | |||||
Done Inline ActionsOk then. I don't like doing that, altering code for tests, but i acknowledge we need to for now. ardumont: Ok then.
Let's keep that.
I don't like doing that, altering code for tests, but i acknowledge… | |||||
server response | |||||
''' | |||||
return self.get_file() | |||||
def list_packages(self, response): | |||||
""" | |||||
List the actual gnu origins with their names and | |||||
time last updated from the response. | |||||
Args: | |||||
response : File structure of the website | |||||
in JSON format | |||||
Returns: | |||||
a list of all the packages with their names, url of their root | |||||
directory and the tarballs present for the particular package. | |||||
[ | |||||
Done Inline Actions'tarballs' is enough. ardumont: 'tarballs' is enough. | |||||
{'name': '3dldf', 'url': 'https://ftp.gnu.org/gnu/3dldf/', | |||||
'tarballs': | |||||
[ | |||||
{'archive': | |||||
'https://ftp.gnu.org/gnu/3dldf/3DLDF-1.1.3.tar.gz', | |||||
'date': '1071002600'}, | |||||
{'archive': | |||||
'https://ftp.gnu.org/gnu/3dldf/3DLDF-1.1.4.tar.gz', | |||||
'date': '1071078759'}} | |||||
] | |||||
}, | |||||
{'name': '8sync', 'url': 'https://ftp.gnu.org/gnu/8sync/', | |||||
'tarballs': | |||||
[ | |||||
{'archive': | |||||
'https://ftp.gnu.org/gnu/8sync/8sync-0.1.0.tar.gz', | |||||
Done Inline ActionsAlso, i guess, we could push this code from the loader-tar to the core if that needs to be reused. ardumont: Also, i guess, we could push this code from the loader-tar to the core if that needs to be… | |||||
Done Inline ActionsIt would be great if loader-tar can be shifted to the core. That would reduce the code to less than half nahimilega: It would be great if loader-tar can be shifted to the core. That would reduce the code to less… | |||||
Done Inline ActionsIn the end, i think it'd be best to use simply use request to download that file (which is already a module dependency). There has been discussion to refactor the distribution/package loader and that code will probably disappear (so no point in pushing it to core). ardumont: In the end, i think it'd be best to use simply use `request` to download that file (which is… | |||||
'date': '1461357336'}, | |||||
{'archive': | |||||
'https://ftp.gnu.org/gnu/8sync/8sync-0.2.0.tar.gz', | |||||
'date': '1480991830'} | |||||
] | |||||
] | |||||
Done Inline ActionsThat's an URL, not a file. And the constant's name should tell what is at that URL. (eg. TREE_URL) vlorentz: That's an URL, not a file. And the constant's name should tell what is at that URL. (eg. | |||||
""" | |||||
response = filter_directories(response) | |||||
packages = [] | |||||
for directory in response: | |||||
content = directory['contents'] | |||||
for repo in content: | |||||
if repo['type'] == 'directory': | |||||
package_url = '%s/%s/%s/' % (self.BASE_URL, | |||||
directory['name'], | |||||
Done Inline Actionstarballs = ardumont: `tarballs =` | |||||
repo['name']) | |||||
package_tarballs = find_tarballs( | |||||
repo['contents'], package_url) | |||||
if package_tarballs != []: | |||||
repo_details = { | |||||
Done Inline ActionsFYI, we are currently renaming the scheduler task names to something more rational (see T1508). _type = 'load-%s' % origin_type anlambert: FYI, we are currently renaming the scheduler task names to something more rational (see T1508). | |||||
'name': repo['name'], | |||||
'url': package_url, | |||||
'time_modified': repo['time'], | |||||
} | |||||
self.tarballs[repo['name']] = package_tarballs | |||||
packages.append(repo_details) | |||||
Done Inline ActionsThis needs to be changed to match what needs the loader-tar to ingest the tarball [1] [1] https://forge.softwareheritage.org/source/swh-loader-tar/browse/master/swh/loader/tar/tasks.py$12 ardumont: This needs to be changed to match what needs the loader-tar to ingest the tarball [1]
[1]… | |||||
Done Inline ActionsAs this lister is listing all the repos in gnu, mirror and old-gnu directory, there are some files in the listed repos that are not tarball and hence are not required to be loaded, hence I thought of creating a separate gnu loader which would remove useless files and then load. nahimilega: As this lister is listing all the repos in gnu, mirror and old-gnu directory, there are some… | |||||
Done Inline ActionsSo I wish to make a new lister then also I have to match this according to loader-tar? nahimilega: So I wish to make a new lister then also I have to match this according to loader-tar? | |||||
Done Inline Actions
Discussing with @anlambert (no task yet), we entertained the idea of having a dedicated gnu loader. So that'd go in the direction of a dedicated loader as you mentioned. For the files that are missing or files that are not tarballs, the lister simply needs to not list them (thus not passing them along to the loader).
I have trouble understanding the sentence. ardumont: > As this lister is listing all the repos in gnu, mirror and old-gnu directory, there are some… | |||||
Done Inline ActionsSorry, there was a typo in this sentence, it should be nahimilega: Sorry, there was a typo in this sentence, it should be
So I wish to make a new loader then… | |||||
Done Inline ActionsHave a look at [1], this should be simplified to create a task whose arguments are a list of dict. Dict of key values:
Later, the tarball loader should be adapted (again according to [1]) to ingest those tarballs and make the appropriate snapshot (result of the one loading visit). [1] T1389#31790 ardumont: Have a look at [1], this should be simplified to create a task whose arguments are a list of… | |||||
random.shuffle(packages) | |||||
return packages | |||||
def get_model_from_repo(self, repo): | |||||
"""Transform from repository representation to model | |||||
""" | |||||
return { | |||||
Done Inline ActionsPlease, remove the print statement ardumont: Please, remove the print statement | |||||
'uid': repo['name'], | |||||
'name': repo['name'], | |||||
'full_name': repo['name'], | |||||
Done Inline ActionsLike i said earlier, change the download implementation to use the requeststool. ardumont: Like i said earlier, change the download implementation to use the `requests`tool.
Inline… | |||||
'html_url': repo['url'], | |||||
'origin_url': repo['url'], | |||||
'time_last_updated': repo['time_modified'], | |||||
'origin_type': 'gnu', | |||||
'description': None, | |||||
Done Inline ActionsAre you sure you need to do that? ardumont: Are you sure you need to do that? | |||||
Done Inline ActionsThere is no datatype to store a list of dictionary in Postgres, hence either we make a separate table for this, or as it is a list of dictionary ,we can convert it into json string and store it in string datatype in database, and convert it back to list of dictionary when needed. I used the JSON method over other database as it is easy to write and understand. nahimilega: There is no datatype to store a list of dictionary in Postgres, hence either we make a separate… | |||||
Done Inline ActionsSo checking the base code of the simple lister class (ingest_data). ardumont: So checking the base code of the simple lister class (`ingest_data`).
The `tarball` key should… | |||||
} | |||||
def transport_response_simplified(self, response): | |||||
"""Transform response to list for model manipulation | |||||
""" | |||||
return [self.get_model_from_repo(repo) for repo in response] | |||||
def find_tarballs(package_file_structure, url): | |||||
''' | |||||
Recursively lists all the tarball present in the folder and | |||||
subfolders for a particular package url. | |||||
Done Inline ActionsAre those override really necessary (i don't remember)? ardumont: Are those override really necessary (i don't remember)? | |||||
Done Inline ActionsYes, they are necessary, they are mentioned as abc.abstract attribute in the simple lister class nahimilega: Yes, they are necessary, they are mentioned as abc.abstract attribute in the simple lister… | |||||
Done Inline ActionsYou should remove those decorators from the base class. You should then remove those unused definitions (that might also simplify the cran lister diff). ardumont: You should remove those decorators from the base class.
It's not used here and makes the code… | |||||
Done Inline ActionsI created D1566 to address this issue nahimilega: I created D1566 to address this issue | |||||
Args | |||||
Done Inline ActionsSimplify the name to find_tarballs. Improve the docstring to clarify this is done per origin/package url. ardumont: Simplify the name to `find_tarballs`.
Improve the docstring to clarify this is done per… | |||||
package_file_structure : File structure of the package root directory | |||||
url : URL of the corresponding package | |||||
Done Inline Actionstarball`s` ardumont: tarball`s` | |||||
Returns | |||||
List of all the tarball urls and the last their time of update | |||||
example- | |||||
Done Inline Actionscorresponding ardumont: corresponding | |||||
For a package called 3dldf | |||||
[ | |||||
{'archive': 'https://ftp.gnu.org/gnu/3dldf/3DLDF-1.1.3.tar.gz', | |||||
'date': '1071002600'} | |||||
Done Inline ActionsWhy the underscore prefix? vlorentz: Why the underscore prefix? | |||||
Done Inline ActionsI took inspiration from PyPI lister, there it had underscore prefix hence I opted to follow the same style. Shall I change it? nahimilega: I took inspiration from PyPI lister, there it had underscore prefix hence I opted to follow the… | |||||
Done Inline ActionsYou can drop the _, it was to be read as a private variable. It seems to be unneeded, so simply drop it ;) ardumont: You can drop the `_`, it was to be read as a private variable. It seems to be unneeded, so… | |||||
{'archive': 'https://ftp.gnu.org/gnu/3dldf/3DLDF-1.1.4.tar.gz', | |||||
'date': '1071078759'} | |||||
{'archive': 'https://ftp.gnu.org/gnu/3dldf/3DLDF-1.1.5.1.tar.gz', | |||||
'date': '1074278633'} | |||||
... | |||||
] | |||||
''' | |||||
tarballs = [] | |||||
for single_file in package_file_structure: | |||||
file_type = single_file['type'] | |||||
file_name = single_file['name'] | |||||
if file_type == 'file': | |||||
if file_extension_check(file_name): | |||||
Done Inline Actionstarballs ardumont: `tarballs` | |||||
Done Inline Actionsplease, rename to a simpler name. ardumont: please, rename to a simpler name. | |||||
tarballs .append({ | |||||
"archive": url + file_name, | |||||
Done Inline Actionsurl, not urls. And the name should tell what it actually does. (eg. _get_project_url) vlorentz: `url`, not `urls`. And the name should tell what it actually does. (eg. `_get_project_url`) | |||||
Done Inline ActionsDeclare variable for the filetype and filename to reuse within the block. ardumont: Declare variable for the filetype and filename to reuse within the block.
That's clearer and… | |||||
"date": single_file['time'] | |||||
Done Inline ActionsI am not sure this is a good method to find if the file is a tarball, can you please suggest some improvements nahimilega: I am not sure this is a good method to find if the file is a tarball, can you please suggest… | |||||
Done Inline ActionsThat'd be simpler to read. filename = single_file['name'] ".tar" in filename or ".zip" in filename ardumont: That'd be simpler to read.
```
filename = single_file['name']
".tar" in filename or ".zip" in… | |||||
Done Inline ActionsIf we do this, then it will also include files with extension tar.gz.sig. I don't think we need to add those files as they do not contain any code. nahimilega: If we do this, then it will also include files with extension tar.gz.sig. I don't think we need… | |||||
Done Inline ActionsIndeed, that's inconvenient. I guess gnu's manifest (tree.json.gz) is not moving a lot so the way forward would be to analyze and compulse the list of possible extension files from that manifest (take a snapshot of today would be enough). Could you try and investigate in that direction? Thanks ardumont: Indeed, that's inconvenient.
I guess gnu's manifest (tree.json.gz) is not moving a lot so the… | |||||
Done Inline ActionsIn my view, the best method to check for tarball would be to break the filename on " . " and check if the word between last and second last " . " is "tar" or not?. If it is "tar" then the file is useful, else the file does not have source code. nahimilega:
In my view, the best method to check for tarball would be to break the filename on " . " and… | |||||
}) | |||||
# It will recursively check for tarballs in all sub-folders | |||||
elif file_type == 'directory': | |||||
Done Inline ActionsI have used the function here, sorry in the previous revision I forgot to call the function. And sorry for the silly mistake with the spelling of extension, I will make sure to check for spelling before submitting the diff. nahimilega: I have used the function here, sorry in the previous revision I forgot to call the function. | |||||
Done Inline ActionsNo worries, that happens. ardumont: No worries, that happens. | |||||
tarballs_in_dir = find_tarballs( | |||||
single_file['contents'], | |||||
url + file_name + '/') | |||||
tarballs .extend(tarballs_in_dir) | |||||
return tarballs | |||||
def filter_directories(response): | |||||
''' | |||||
Done Inline ActionsYou should add unit tests on that function. ardumont: You should add unit tests on that function. | |||||
Keep only gnu and old-gnu folders from JSON | |||||
''' | |||||
final_response = [] | |||||
file_system = response[0]['contents'] | |||||
Done Inline Actionsfilter_directories sounds clearer enough. ardumont: `filter_directories` sounds clearer enough. | |||||
for directory in file_system: | |||||
if directory['name'] in ('gnu', 'old-gnu'): | |||||
Done Inline ActionsThat's not really what happens here. ardumont: That's not really what happens here.
You filter to some initial folders with names (gnu… | |||||
Done Inline Actionsdocstring update to: ardumont: docstring update to:
"Keep only `gnu` and `old-gnu` folders from JSON." | |||||
final_response.append(directory) | |||||
return final_response | |||||
def file_extension_check(file_name): | |||||
''' | |||||
Check for the extention of the the file, if the file is of zip format of | |||||
Done Inline ActionsSame clean_up_response should be:
ardumont: Same `clean_up_response` should be:
- renamed to explicit what it does
- unit tested | |||||
ardumontUnsubmitted Done Inline Actionsextension here as well. Please can you check the typo is fixed everywhere? ardumont: extension here as well.
Please can you check the typo is fixed everywhere? | |||||
.tar.x format where x could be anything then returns true. | |||||
Args: | |||||
file_name : name of the file for which the extentions is needs to | |||||
be checked. | |||||
Done Inline Actionsnitpick: this is equivalent to: if directory['name'] in ('gnu', 'mirrors', 'old-gnu'): vlorentz: nitpick: this is equivalent to:
```
if directory['name'] in ('gnu', 'mirrors', 'old-gnu'):
``` | |||||
Done Inline Actionsextension Although, i do not see where that function is used. ardumont: `extension`
Although, i do not see where that function is used.
Did you use it to analyze the… | |||||
Returns: | |||||
True or False | |||||
Done Inline ActionsCheck for file extension. Accepts file extension with .zip or .tar.x (x could be anything). ardumont: Check for file extension. Accepts file extension with `.zip` or `.tar.x` (x could be anything). | |||||
example | |||||
file_extension_check('abc.zip') will return True | |||||
file_extension_check('abc.tar.gz') will return True | |||||
Done Inline Actionsfile_name: filename whose extension needs to be checked. ardumont: file_name: filename whose extension needs to be checked.
| |||||
file_extension_check('abc.tar.gz.sig') will return False | |||||
''' | |||||
file_suffixes = Path(file_name).suffixes | |||||
if len(file_suffixes) == 1 and file_suffixes[-1] == '.zip': | |||||
return True | |||||
elif len(file_suffixes) > 1: | |||||
if file_suffixes[-1] == '.zip' or file_suffixes[-2] == '.tar': | |||||
return True | |||||
Done Inline Actionsexten`s`ion ardumont: exten`s`ion | |||||
return False |
copyright header needs an update.
Since it's inspired from the tar-loader, it'd be 2018-2019.