Page MenuHomeSoftware Heritage

GNU Lister
ClosedPublic

Authored by nahimilega on May 17 2019, 12:30 PM.

Details

Summary

It is the implementation of GNU Lister, this lister download the tree.json.gz file from https://ftp.gnu.org/tree.json.gz, reades its json content and returns the origin of repos py parsing over the json data.

Related T1722

Diff Detail

Repository
rDLS Listers
Branch
gnu (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 5950
Build 8156: tox-on-jenkinsJenkins
Build 8155: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ardumont added inline comments.Jun 4 2019, 3:00 PM
swh/lister/gnu/lister.py
194

docstring update to:
"Keep only gnu and old-gnu folders from JSON."

In the mean time, have you tried running this through the swh-docker-dev environment already?

I haven't tried it on swh-docker-dev environment yet, I am still struggling with some overrides to make the now lister show up in the task list. Although I will still try to figure out, the method to run this.

If that's blocking, do not hesitate to ask question in irc in that regards.

Other can help in that matter.
And that can help us disambiguate stuff in documentation if there remains some black spot there.
In the end, everybody wins ;)

nahimilega added inline comments.Jun 4 2019, 3:03 PM
swh/lister/gnu/lister.py
146

Yes, they are necessary, they are mentioned as abc.abstract attribute in the simple lister class

ardumont added inline comments.Jun 4 2019, 3:03 PM
swh/lister/gnu/lister.py
34

tarballs here as well.
That ends up in the future loader part.

In the mean time, have you tried running this through the swh-docker-dev environment already?

I haven't tried it on swh-docker-dev environment yet, I am still struggling with some overrides to make the now lister show up in the task list. Although I will still try to figure out, the method to run this.

If that's blocking, do not hesitate to ask question in irc in that regards.

Ok @ardumont I will surely do it.

nahimilega updated this revision to Diff 5062.Jun 4 2019, 3:27 PM
nahimilega marked 7 inline comments as done.
  • Chnaged list_of_tarballs to tarballs.
ardumont added a comment.EditedJun 5 2019, 10:26 AM

In the mean time, have you tried running this through the swh-docker-dev environment already?

I haven't tried it on swh-docker-dev environment yet, I am still struggling with some overrides to make the now lister show up in the task list. Although I will still try to figure out, the method to run this.

If that's blocking, do not hesitate to ask question in irc in that regards.

Ok @ardumont I will surely do it.

For the docker setup, P414 should be enough to run the gnu lister (providing your repository is in the branch with the gnu lister code).
For the lister setup, you'd then need to add in the lister configuration the new task_modules and task_queues entry for the gnu lister.

Amending the conf/lister.yml file to add the entries:

celery:
  task_broker: amqp://guest:guest@amqp//
  task_modules:
    ...
    - swh.lister.gnu.tasks
  task_queues:
    ...
    - swh.lister.gnu.tasks.GNUListerTask

Ameding the conf/lister.yml file to add the entries:

celery:
  task_broker: amqp://guest:guest@amqp//
  task_modules:
    ...
    - swh.lister.gnu.tasks
  task_queues:
    ...
    - swh.lister.gnu.tasks.GNUListerTask

Thanks @ardumont. This part is not present in any documentation. I guess we can add a section on how to run lister in docker environmnet under lister tutorial section.

ardumont added a comment.EditedJun 5 2019, 4:21 PM

Thanks @ardumont. This part is not present in any documentation. I guess we can add a section on how to run lister in docker environmnet under lister tutorial section.

It's not.

Indeed, adding a tutorial on how to add a new lister is a good idea.

I'd:

  • include how to mount your local development modifications (docker-compose.override.yml demo-ed earlier).
  • titled it "How to run a *new* lister" for instance.

Maybe you could update D1441 with such information, what do you think?

Maybe you could update D1441 with such information, what do you think?

It would be a good idea

To be clear, i'm fine with the diff now.
Like @douardda mentioned in the cran lister, i'm waiting for a confirmation that the code runs sensibly well in the docker env as well.

Cheers,

To be clear, i'm fine with the diff now.
Like @douardda mentioned in the cran lister, i'm waiting for a confirmation that the code runs sensibly well in the docker env as well.

Thanks @ardumont, as you mentioned I did follow those steps to run it in docker, although something went wrong with the docker container and I have to reinstall whole of the docker in my pc, hence I was not able to test this lister yet, I think I will fix the docker issues in my pc by the end of the day, and then I can try to run this.

nahimilega updated this revision to Diff 5124.Jun 6 2019, 9:53 PM
  • Fixed a typo with time_updated
ardumont added a comment.EditedJun 7 2019, 10:38 AM

To be clear, i'm fine with the diff now.
Like @douardda mentioned in the cran lister, i'm waiting for a confirmation that the code runs sensibly well in the docker env as well.

Thanks @ardumont, as you mentioned I did follow those steps to run it in docker, although something went wrong with the docker container and I have to reinstall whole of the docker in my pc, hence I was not able to test this lister yet, I think I will fix the docker issues in my pc by the end of the day, and then I can try to run this.

Sure thing.
Thanks for the heads up.

For the problem mentionned in irc, i replied to you there, here is my take on this:

20:05 <+ardumont> in general, don't only rely on documentation as this can go out of sync
20:05 <+ardumont> take a look also at the code
20:05 <+ardumont> for the docker-dev, that'd the docker-compose file
20:05 <+ardumont> archit_agrawal[m: ^
20:10 <archit_agrawal[m> ardumont: I will surely take a look at docker-compose file
20:12 <archit_agrawal[m> ardumont: as you told, I amended conf/lister.yml with gnu lister, now how shall I proceed further to sucessfully run the lister
20:28 <archit_agrawal[m> ardumont:  Do I have to run the way mentioned in readme of swh-lister ?
21:45 <archit_agrawal[m> I am trying to run gnu lister in docker . I am getting ModuleNotFoundError: No module named 'psycopg2.errors' error, can anyone please help me.
21:45 <archit_agrawal[m> https://forge.softwareheritage.org/P419
21:45 -- Notice(swhbot): P419 (author: nahimilega): request 400 from scheduler <https://forge.softwareheritage.org/P419>
22:18 <kalpitk[m]> I think 'pip install psycopg2' inside virtual env will be enough
22:34 <archit_agrawal[m> kalpitk: It is already installed in virtual env 
22:36 <+pinkieval> archit_agrawal[m: is the scheduler running in the venv?
22:38 <archit_agrawal[m> pinkieval: yes
22:39 <+pinkieval> can you paste its logs?
22:40 <archit_agrawal[m> pinkieval: https://forge.softwareheritage.org/P420   docker-compose ps outpur
22:40 -- Notice(swhbot): P420 (author: nahimilega): docker-compose ps output <https://forge.softwareheritage.org/P420>
22:42 <+pinkieval> if it's running in docker, then it's not running in the venv
22:42 <+pinkieval> and that's not its logs
22:43 <+pinkieval> "docker-compose logs swh-scheduler-api"
22:43 <archit_agrawal[m> pinkieval: https://forge.softwareheritage.org/P421
22:43 -- Notice(swhbot): P421 (author: nahimilega): scheduler api logs <https://forge.softwareheritage.org/P421>
22:44 <archit_agrawal[m> >and that's not its logs, I sent the previous message before I received this message
22:47 <+pinkieval> hmm, it has no issue referring to psycopg2
22:47 <archit_agrawal[m> pinkieval:  >can you paste its logs?  :I sent the previous message before I received this message
22:47 <+pinkieval> so the error is coming from the unpickling
22:48 <+pinkieval> python -c "import psycopg2.errors"
22:48 <+pinkieval> does this work?
22:49 <archit_agrawal[m> ModuleNotFoundError: No module named 'psycopg2.errors'
22:49 <archit_agrawal[m> No
----
09:38 <+ardumont> archit_agrawal[m: pinkieval: there might be 2 errors involved, one triggering the other
09:39 <+ardumont> the first one being there is probably no scheduler task-type gnu-lister referenced in the scheduler
09:39 <+ardumont> thus, when the lister asks for creating that kind of task, it's not happy about it
09:39 <+ardumont> and then the error we see here about psycopg2.error module not found
09:43 <+ardumont> archit_agrawal[m: prior to triggering your gnu lister task in your docker-env, you need to add the associated task-type
09:43 <+ardumont> swh scheduler task-type add --help

For the problem mentionned in irc, i replied to you there, here is my take on this:

20:05 <+ardumont> in general, don't only rely on documentation as this can go out of sync
20:05 <+ardumont> take a look also at the code
20:05 <+ardumont> for the docker-dev, that'd the docker-compose file
20:05 <+ardumont> archit_agrawal[m: ^
20:10 <archit_agrawal[m> ardumont: I will surely take a look at docker-compose file
20:12 <archit_agrawal[m> ardumont: as you told, I amended conf/lister.yml with gnu lister, now how shall I proceed further to sucessfully run the lister
20:28 <archit_agrawal[m> ardumont:  Do I have to run the way mentioned in readme of swh-lister ?
21:45 <archit_agrawal[m> I am trying to run gnu lister in docker . I am getting ModuleNotFoundError: No module named 'psycopg2.errors' error, can anyone please help me.
21:45 <archit_agrawal[m> https://forge.softwareheritage.org/P419
21:45 -- Notice(swhbot): P419 (author: nahimilega): request 400 from scheduler <https://forge.softwareheritage.org/P419>
22:18 <kalpitk[m]> I think 'pip install psycopg2' inside virtual env will be enough
22:34 <archit_agrawal[m> kalpitk: It is already installed in virtual env 
22:36 <+pinkieval> archit_agrawal[m: is the scheduler running in the venv?
22:38 <archit_agrawal[m> pinkieval: yes
22:39 <+pinkieval> can you paste its logs?
22:40 <archit_agrawal[m> pinkieval: https://forge.softwareheritage.org/P420   docker-compose ps outpur
22:40 -- Notice(swhbot): P420 (author: nahimilega): docker-compose ps output <https://forge.softwareheritage.org/P420>
22:42 <+pinkieval> if it's running in docker, then it's not running in the venv
22:42 <+pinkieval> and that's not its logs
22:43 <+pinkieval> "docker-compose logs swh-scheduler-api"
22:43 <archit_agrawal[m> pinkieval: https://forge.softwareheritage.org/P421
22:43 -- Notice(swhbot): P421 (author: nahimilega): scheduler api logs <https://forge.softwareheritage.org/P421>
22:44 <archit_agrawal[m> >and that's not its logs, I sent the previous message before I received this message
22:47 <+pinkieval> hmm, it has no issue referring to psycopg2
22:47 <archit_agrawal[m> pinkieval:  >can you paste its logs?  :I sent the previous message before I received this message
22:47 <+pinkieval> so the error is coming from the unpickling
22:48 <+pinkieval> python -c "import psycopg2.errors"
22:48 <+pinkieval> does this work?
22:49 <archit_agrawal[m> ModuleNotFoundError: No module named 'psycopg2.errors'
22:49 <archit_agrawal[m> No
----
09:38 <+ardumont> archit_agrawal[m: pinkieval: there might be 2 errors involved, one triggering the other
09:39 <+ardumont> the first one being there is probably no scheduler task-type gnu-lister referenced in the scheduler
09:39 <+ardumont> thus, when the lister asks for creating that kind of task, it's not happy about it
09:39 <+ardumont> and then the error we see here about psycopg2.error module not found
09:43 <+ardumont> archit_agrawal[m: prior to triggering your gnu lister task in your docker-env, you need to add the associated task-type
09:43 <+ardumont> swh scheduler task-type add --help

@ardumont Thanks for your help, I ran the lister in docker, and it created scheduler task as

Task 51
  Next run: in 3 months (2019-09-05 09:46:26+00:00)
  Interval: 90 days, 0:00:00
  Type: load-gnu
  Policy: recurring
  Status: next_run_not_scheduled
  Priority: 
  Args:
    'apl'
    'https://ftp.gnu.org/gnu/apl/'
  Keyword args:
    tarballs: None

I am not able to get why this tarballs Keyword args: is none, it there some error in the code?

I am not able to get why this tarballs Keyword args: is none, it there some error in the code?

I got the error

nahimilega updated this revision to Diff 5129.Jun 7 2019, 1:28 PM
  • swh.lister.gnu: Add tarball column in model

@ardumont I checked it in docker, now it is working fine.

Task 765
  Next run: in 3 months (2019-09-05 11:18:21+00:00)
  Interval: 90 days, 0:00:00
  Type: load-gnu
  Policy: recurring
  Status: next_run_not_scheduled
  Priority: 
  Args:
    'libiconv'
    'https://ftp.gnu.org/old-gnu/libiconv/'
  Keyword args:
    tarballs: [{'date': '985114279', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.6.1.tar.gz'}, {'date': '1054061763', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.1.bin.woe32.zip'}, {'date': '1053376580', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.bin.woe32.zip'}, {'date': '1053376846', 'archive': 'https://ftp.gnu.org/old-gnu/libiconv/libiconv-1.9.tar.gz'}]
ardumont added inline comments.Jun 7 2019, 1:47 PM
swh/lister/gnu/lister.py
34

?

125

Please, remove the print statement

133

Are you sure you need to do that?

swh/lister/gnu/models.py
19

Why do you need to add that information here?

@ardumont I checked it in docker, now it is working fine.

Nice work on making it work!

Just a couple of questions, see before this comment.

nahimilega added inline comments.Jun 7 2019, 1:53 PM
swh/lister/gnu/lister.py
133

There is no datatype to store a list of dictionary in Postgres, hence either we make a separate table for this, or as it is a list of dictionary ,we can convert it into json string and store it in string datatype in database, and convert it back to list of dictionary when needed. I used the JSON method over other database as it is easy to write and understand.

nahimilega added inline comments.Jun 7 2019, 1:57 PM
swh/lister/gnu/models.py
19

As it will be a now column in the database hence I guess we need to add it in the model.
And without adding list of tarball in the model of the package I will not be able to access it in the task_dict() function.

nahimilega marked an inline comment as not done.Jun 7 2019, 1:59 PM
nahimilega marked an inline comment as not done.
nahimilega added inline comments.Jun 7 2019, 2:04 PM
swh/lister/gnu/lister.py
34

The list of tarballs is stored as the json string format in the model and needs to be converted back to list of dictionary before using it to create a loading task

nahimilega updated this revision to Diff 5133.Jun 7 2019, 2:11 PM
  • removed print statements
nahimilega marked an inline comment as done.Jun 7 2019, 2:15 PM
ardumont added inline comments.Jun 7 2019, 2:42 PM
swh/lister/gnu/lister.py
34

But why do we need the list of tarballs in the model?

I'd expect we compute it and pass it along to the loader.
But we do not need to keep it in the lister's model.

nahimilega added inline comments.Jun 7 2019, 2:53 PM
swh/lister/gnu/lister.py
34

Please correct me if I am wrong somewhere -
To pass the list of tarballs to the loader we need to pass it in the loading task created by the.

To pass the list of tarballs in the loading task we need it so that it can be accessed in the task_dict() function.

To access the list of tarballs in the task_dict() we need to ensure it is present in the model.

ardumont added inline comments.Jun 7 2019, 3:02 PM
swh/lister/gnu/lister.py
34

I'd expect we compute it and pass it along to the loader.

I mean that through the scheduler (so that's what the task_dict method here is all about).

34

Please correct me if I am wrong somewhere -
To pass the list of tarballs to the loader we need to pass it in the loading task created by the.

I'm missing the end of the sentence.

To pass the list of tarballs in the loading task we need it so that it can be accessed in the task_dict() function.
To access the list of tarballs in the task_dict() we need to ensure it is present in the model.

I'm not sure yet. I'd need to dig in the code to make sure.

All I know is that it should not be a requisite to store the computed data (needed for the loader) in the lister model.
So if the actual code requires it, it feels wrong to me.

nahimilega added inline comments.Jun 7 2019, 3:13 PM
swh/lister/gnu/lister.py
34

Please correct me if I am wrong somewhere -
To pass the list of tarballs to the loader we need to pass it in the loading task created by the.

I'm missing the end of the sentence.

Sorry for that, it is "To pass the list of tarballs to the loader we need to pass it in the loading task created."

I'm not sure yet. I'd need to dig in the code to make sure.
All I know is that it should not be a requisite to store the computed data (needed for the loader) in the lister model.

I tried to do what I thought would work fine, and I am not able to come up with another way to send the tarballs of loader. Can you please look up suggest some better way.

nahimilega added inline comments.Jun 7 2019, 3:23 PM
swh/lister/gnu/lister.py
34

One way to avoid including tarballs in model is to make a variable instance of class named tarballs (like LISTER_NAME or TREE_URL), which would countain all the tarballs of each package and can be accessed from task_dict() function
Hence will eliminate the need of adding tarballs in model.

ardumont added inline comments.Jun 7 2019, 4:08 PM
swh/lister/gnu/lister.py
146

You should remove those decorators from the base class.
It's not used here and makes the code immediatly harder to read and will make it harder to maintain later.

You should then remove those unused definitions (that might also simplify the cran lister diff).

ardumont requested changes to this revision.Jun 7 2019, 5:43 PM
ardumont added inline comments.
swh/lister/gnu/lister.py
133

So checking the base code of the simple lister class (ingest_data).
The tarball key should be present at the moment of the task_dict creation for the scheduler.
So i stand by my initial statement, we should not have to transform back and forth in json the tarball dict.
It should stay as is and be out of the lister model.

222

exten`s`ion

This revision now requires changes to proceed.Jun 7 2019, 5:43 PM

Related P422

Status so far: If 'tarballs' removed from model, this explodes.

my take on this:

  • 'tarballs' are needed for the loader, so it should be sent as the scheduler task for the loader to be able to do its job
  • 'tarballs' should not be referenced in the lister as it's not used for anything lister related
  • actual code (in base lister class) needs it though to actually work

One way to avoid including tarballs in model is to make a variable instance of class named tarballs (like LISTER_NAME or TREE_URL), which would countain all the tarballs of each package and can be accessed from task_dict() function
Hence will eliminate the need of adding tarballs in model.

ardumont added a comment.EditedJun 8 2019, 8:10 AM

One way to avoid including tarballs in model is to make a variable instance of class named tarballs (like LISTER_NAME or TREE_URL), which would countain all the tarballs of each package and can be accessed from task_dict() function
Hence will eliminate the need of adding tarballs in model.

Yes, please go that way.

I'm not so keen on that solution because i prefer the code being stateless as much as possible (in that context, that means letting state pass through method/function parameters instead of relying on state variables to do neat tricks).
But here, now:

  • we do not really have a choice (without possibly a big refactoring, and, not right now)
  • we do that elsewhere anyhow (loaders come to mind, especially loader-core IIRC)
  • i don't have a better solution (i mean come to think of it, the gnu lister model's constructor override proposal is less explicit than this solution ;)

Cheers,

nahimilega added inline comments.Jun 8 2019, 1:45 PM
swh/lister/gnu/lister.py
146

I created D1566 to address this issue

nahimilega updated this revision to Diff 5158.Jun 8 2019, 7:07 PM
  • Added a class variable instance to store tarballs and use it in task_dict()
nahimilega marked 6 inline comments as done.Jun 8 2019, 7:17 PM
nahimilega marked 2 inline comments as done.
nahimilega updated this revision to Diff 5159.Jun 8 2019, 7:25 PM
-Added a class variable instance to store tarballs and use it in task_dict()

I tested the lister with new changes in the docker container, it worked fine. Here is one of the loader task it created.

Task 15940
  Next run: seconds ago (2019-06-08 18:02:07+00:00)
  Interval: 90 days, 0:00:00
  Type: load-gnu
  Policy: recurring
  Status: next_run_scheduled
  Priority: 
  Args:
    'java2html'
    'https://ftp.gnu.org/old-gnu/java2html/'
  Keyword args:
    tarballs: [{'date': '944729610', 'archive': 'https://ftp.gnu.org/old-gnu/java2html/java2html-1.3.1.tar.gz'}, {'date': '947003574', 'archive': 'https://ftp.gnu.org/old-gnu/java2html/java2html-1.4.tar.gz'}, {'date': '953974733', 'archive': 'https://ftp.gnu.org/old-gnu/java2html/java2html-1.5.tar.gz'}, {'date': '977303005', 'archive': 'https://ftp.gnu.org/old-gnu/java2html/java2html-1.6.tar.gz'}, {'date': '979403803', 'archive': 'https://ftp.gnu.org/old-gnu/java2html/java2html-1.7.tar.gz'}]
nahimilega marked an inline comment as done.Jun 8 2019, 8:07 PM

Awesome.

Almost there.
Just some typos to fix, a docstring to add, and i think this is good to go.

Also, note that this is the full gnu lister.
We'll need some incremental one at some point.
Not a blocker for this diff though.

swh/lister/gnu/lister.py
25

Specify the variable's goal.

Something like, please improve as you see fit:

Dict of key the project name (/id/url?), value the associated list of tarballs to ingest from the gnu mirror
201

extension here as well.

Please can you check the typo is fixed everywhere?

nahimilega updated this revision to Diff 5162.Jun 9 2019, 12:24 PM
nahimilega marked 2 inline comments as done.
  • Checked for typo and added comment for tarball variable

So i'm mostly good with this.

Real awesome that you made it work with the docker-env, i'm looking forward for the update on D1441 with what you had to do.

Prior to merging this though, please try to clean up the test samples, keep them to a reasonable minimum (api_response.json, file_structure.json, etc...).
It's currently too noisy.

There is no need to keep all extra files (the ones which are filtered out in the end: .sig, .ogg, ogv, ...).
Some samples (2 or 3) of it are enough to explicit the filtering.

Cheers,

nahimilega updated this revision to Diff 5167.Jun 11 2019, 11:45 AM
  • Cleaned up api_response.json, file_structure.json
ardumont accepted this revision.Jun 11 2019, 12:03 PM

Thanks!

Please rebase on latest master (if needed) and land it.

This revision is now accepted and ready to land.Jun 11 2019, 12:03 PM

If you do need to rebase, update the diff nonetheless (prior to push) so that phabricator sees the commits and close the diff itself.

If you do need to rebase, update the diff nonetheless (prior to push) so that phabricator sees the commits and close the diff itself.

I have already rebased it on the latest master :)

This revision was automatically updated to reflect the committed changes.