I looked into the data dumps provided on rubygems.org.
A bash script(link to the script) is provided by rubygem that will download the most recent weekly dump listed on https://rubygems.org/pages/data and load it into a PostgreSQL database.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 12 2019
In D1561#35329, @vlorentz wrote:Output of git remote -v?
I am not able to push this change.
Jun 11 2019
- rebased the branch on master
In D1482#35196, @ardumont wrote:If you do need to rebase, update the diff nonetheless (prior to push) so that phabricator sees the commits and close the diff itself.
- Cleaned up api_response.json, file_structure.json
Jun 9 2019
- Checked for typo and added comment for tarball variable
Jun 8 2019
@douardda Can you please take a look at it and give me confirmation to merge it.
I ran the lister in docker to test it, it ran without any problem
(the content of tcran.py is same 4 lines that are present in README of lister
(swh) archit@work-pc:~/swh-environment/swh-lister$ python tcran.py DEBUG:swh.lister.core.lister_base:Loading config from lister_cran INFO:swh.core.config:Loading config file /home/archit/.config/swh/lister_cran.yml DEBUG:swh.lister.core.lister_base:<swh.lister.cran.lister.CRANLister object at 0x7f03ea55c518> CONFIG={'content_size_limit': 104857600, 'log_db': 'dbname=softwareheritage-log', 'storage': {'cls': 'remote', 'args': {'url': 'http://localhost:5002/'}}, 'scheduler': {'cls': 'remote', 'args': {'url': 'http://localhost:5008/'}}, 'lister': {'cls': 'local', 'args': {'db': 'postgresql:///lister-cran'}}, 'credentials': [], 'cache_responses': True, 'cache_dir': '/home/archit/.cache/swh/lister/cran/'} DEBUG:root:models: 10000 DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5002 DEBUG:urllib3.connectionpool:http://localhost:5002 "POST /origin/add_multi HTTP/1.1" 200 1 DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:5008 DEBUG:urllib3.connectionpool:http://localhost:5008 "POST /create_tasks HTTP/1.1" 200 1 DEBUG:root:models: 4372 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:http://localhost:5002 "POST /origin/add_multi HTTP/1.1" 200 1 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:http://localhost:5008 "POST /create_tasks HTTP/1.1" 200 1
Here is one of the task that was created by the lister
https://forge.softwareheritage.org/P424
- squashed git commits
- rebased the branch on master
- removed methods that were not in use but we written because of abstractattribute
- Ran the lister in docker
I tested the lister with new changes in the docker container, it worked fine. Here is one of the loader task it created.
-Added a class variable instance to store tarballs and use it in task_dict()
- Added a class variable instance to store tarballs and use it in task_dict()
Jun 7 2019
One way to avoid including tarballs in model is to make a variable instance of class named tarballs (like LISTER_NAME or TREE_URL), which would countain all the tarballs of each package and can be accessed from task_dict() function
Hence will eliminate the need of adding tarballs in model.
@ardumont I checked it in docker, now it is working fine.
Task 765 Next run: in 3 months (2019-09-05 11:18:21+00:00) Interval: 90 days, 0:00:00 Type: load-gnu Policy: recurring Status: next_run_not_scheduled Priority: Args: 'libiconv' 'https://ftp.gnu.org/old-gnu/libiconv/' Keyword args: tarballs: [{'date': '985114279', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.6.1.tar.gz'}, {'date': '1054061763', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.1.bin.woe32.zip'}, {'date': '1053376580', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.bin.woe32.zip'}, {'date': '1053376846', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.tar.gz'}]
- swh.lister.gnu: Add tarball column in model
I am not able to get why this tarballs Keyword args: is none, it there some error in the code?
For the problem mentionned in irc, i replied to you there, here is my take on this:
20:05 <+ardumont> in general, don't only rely on documentation as this can go out of sync 20:05 <+ardumont> take a look also at the code 20:05 <+ardumont> for the docker-dev, that'd the docker-compose file 20:05 <+ardumont> archit_agrawal[m: ^ 20:10 <archit_agrawal[m> ardumont: I will surely take a look at docker-compose file 20:12 <archit_agrawal[m> ardumont: as you told, I amended conf/lister.yml with gnu lister, now how shall I proceed further to sucessfully run the lister 20:28 <archit_agrawal[m> ardumont: Do I have to run the way mentioned in readme of swh-lister ? 21:45 <archit_agrawal[m> I am trying to run gnu lister in docker . I am getting ModuleNotFoundError: No module named 'psycopg2.errors' error, can anyone please help me. 21:45 <archit_agrawal[m> https://forge.softwareheritage.org/P419 21:45 -- Notice(swhbot): P419 (author: nahimilega): request 400 from scheduler <https://forge.softwareheritage.org/P419> 22:18 <kalpitk[m]> I think 'pip install psycopg2' inside virtual env will be enough 22:34 <archit_agrawal[m> kalpitk: It is already installed in virtual env 22:36 <+pinkieval> archit_agrawal[m: is the scheduler running in the venv? 22:38 <archit_agrawal[m> pinkieval: yes 22:39 <+pinkieval> can you paste its logs? 22:40 <archit_agrawal[m> pinkieval: https://forge.softwareheritage.org/P420 docker-compose ps outpur 22:40 -- Notice(swhbot): P420 (author: nahimilega): docker-compose ps output <https://forge.softwareheritage.org/P420> 22:42 <+pinkieval> if it's running in docker, then it's not running in the venv 22:42 <+pinkieval> and that's not its logs 22:43 <+pinkieval> "docker-compose logs swh-scheduler-api" 22:43 <archit_agrawal[m> pinkieval: https://forge.softwareheritage.org/P421 22:43 -- Notice(swhbot): P421 (author: nahimilega): scheduler api logs <https://forge.softwareheritage.org/P421> 22:44 <archit_agrawal[m> >and that's not its logs, I sent the previous message before I received this message 22:47 <+pinkieval> hmm, it has no issue referring to psycopg2 22:47 <archit_agrawal[m> pinkieval: >can you paste its logs? :I sent the previous message before I received this message 22:47 <+pinkieval> so the error is coming from the unpickling 22:48 <+pinkieval> python -c "import psycopg2.errors" 22:48 <+pinkieval> does this work? 22:49 <archit_agrawal[m> ModuleNotFoundError: No module named 'psycopg2.errors' 22:49 <archit_agrawal[m> No ---- 09:38 <+ardumont> archit_agrawal[m: pinkieval: there might be 2 errors involved, one triggering the other 09:39 <+ardumont> the first one being there is probably no scheduler task-type gnu-lister referenced in the scheduler 09:39 <+ardumont> thus, when the lister asks for creating that kind of task, it's not happy about it 09:39 <+ardumont> and then the error we see here about psycopg2.error module not found 09:43 <+ardumont> archit_agrawal[m: prior to triggering your gnu lister task in your docker-env, you need to add the associated task-type 09:43 <+ardumont> swh scheduler task-type add --help
Jun 6 2019
- Fixed a typo with time_updated
In D1482#34783, @ardumont wrote:To be clear, i'm fine with the diff now.
Like @douardda mentioned in the cran lister, i'm waiting for a confirmation that the code runs sensibly well in the docker env as well.
Thanks @ardumont, as you mentioned I did follow those steps to run it in docker, although something went wrong with the docker container and I have to reinstall whole of the docker in my pc, hence I was not able to test this lister yet, I think I will fix the docker issues in my pc by the end of the day, and then I can try to run this.
Jun 5 2019
Maybe you could update D1441 with such information, what do you think?
Added an important note in lister tutorial
In D1441#34698, @vlorentz wrote:In D1441#34697, @swh-public-ci wrote:Build has FAILED
You need to rebase your change on master
Wrapped thr line within 80 character
On further investigation, I found out there are data dumps provided on rubygems.org
https://rubygems.org/pages/data
This could be used to get the list of all the packages.
Ameding the conf/lister.yml file to add the entries:
celery: task_broker: amqp://guest:guest@amqp// task_modules: ... - swh.lister.gnu.tasks task_queues: ... - swh.lister.gnu.tasks.GNUListerTask
Thanks @ardumont. This part is not present in any documentation. I guess we can add a section on how to run lister in docker environmnet under lister tutorial section.
Jun 4 2019
OK. Out of curiosity, would you be able to look at the location of dists instead? Just to get a sense of how much overlap there is with archives from GitHub.
- Chnaged list_of_tarballs to tarballs.
We have provisions in the scheduler API to deduplicate tasks, so that creating a new git loading task for an existing origin wouldn't duplicate the task but rather just reference the existing one.
In T1776#32892, @olasd wrote:In T1776#32739, @nahimilega wrote:There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.Packages iterated - 15126 (~ 6% of total packages)
VCS found - git, hg
Number of packages that were not hosted on GitHub, or bitbucket or gitlab: 51There are expected to be around 744 packages that are not hosted on GitHub, or bitbucket or Gitlab if we calculate using applying unitary method on the above data.
In your analysis, did you look at the source key (which seems to represent the upstream version control repository) or the dist key (which points at the tarball/zipfile that's actually downloaded by the package manager when installing that version of the package)?
In D1482#34418, @ardumont wrote:In D1482#34411, @nahimilega wrote:In the mean time, have you tried running this through the swh-docker-dev environment already?
I haven't tried it on swh-docker-dev environment yet, I am still struggling with some overrides to make the now lister show up in the task list. Although I will still try to figure out, the method to run this.
If that's blocking, do not hesitate to ask question in irc in that regards.
In the mean time, have you tried running this through the swh-docker-dev environment already?
I haven't tried it on swh-docker-dev environment yet, I am still struggling with some overrides to make the now lister show up in the task list. Although I will still try to figure out, the method to run this.
- Arranged conftest.py in core and README in alphabetical order
Jun 3 2019
One of the biggest challenges in the implementation of this Lister is to deduplicate all the links listed by the lister, as most of the packages are hosted on GitHub, or bitbucket or gitlab, and a large portion of those websites are already listed and ingested.
Hence we need to remove all the packages that are already listed by another lister in the process of listing all the packages in Packagist.
- Arranged conftest.py in core in alphabetical order
- Change docstring and made conftest.py in core ordered alphabetically
- swh.lister.gnu: Add function to check for file extension.
- swh.lister.gnu: Add function to check for file extention.
Jun 2 2019
I renamed the issue title to "Maven Central repository Lister" if the intent is to focus on this repository https://maven.apache.org/repository/index.html
May 31 2019
Creating Your Own Mirror
The size of the central repository is increasing steadily To save us bandwidth and you time, mirroring the entire central repository is >not allowed. (Doing so will get you automatically banned) Instead, we suggest you setup a repository manager as a proxy.
It is not recommended that you scrape or rsync:// a full copy of central as there is a large amount of data there and doing so will get you banned. You can use a program such as those described on the Repository Management page to run your internal repository's server, download from the internet as required, and then hold the artifacts in your internal repository for faster downloading later.
In my view, the best method to check for tarball would be to break the filename on " . " and check if the word between last and second last " . " is "tar" or not?. If it is "tar" then the file is useful, else the file does not have source code.
There is a total of 220570 packages.
I made a short script to analyse the VCS and code hosting platform for packages, here is the result.
May 30 2019
- Add instance variable and rebased it on latest master
Here are the extensions which have tar in their name
Here is the list of all the different extensions that are present on gnu website with a link to one example. I found only one way to know about the extensions, that is by using the way mentioned here https://stackoverflow.com/a/35188296/10424705 , but as there are a lot of gnu uses . in filenames to denote version number, hence there is no good way to uniquely find all the extensions. Although I have optimised the way to reduce redundancy, you may still find some extensions appearing more than once.
Can you please:
- try to reduce the samples though. There are too much data (painful to review and to maintain).
Here I am assuming samples mean the JSON content in api_response.json file. Am I correct ?
May 29 2019
- Change "enable tox" to "register celery task"
- Add instance parameter according to unified credentials structure between listers.
- Add test cases for find_tarball and remove_unnecessary_directories function
- Add instance parameter according to unified credentials structure between listers
@zack I agree, to the fact that archiving https://www.x.org/releases/individual/ is virtually not required because it is a git repo. However, I was concerned about archiving tarballs of other projects which are only present on https://www.x.org/releases/ like x.org/releases/X11R6.8.0/.
However, as you mention about
In my view, we can use the best of both the options to make the lister.
We can use bare API to list down the projects and then use launchpadlib to get all the branches for a project.
In this way, we could use the indexing quality of bare API and simplicity of launchpadlib.