Page MenuHomeSoftware Heritage
Feed All Stories

Jan 26 2021

swh-public-ci added a comment to D4915: simulator: record visit metrics alongside scheduler metrics.

Build is green

Jan 26 2021, 1:39 PM
swh-public-ci added a comment to D4914: simulator: stop using the database as a cache for origin data.

Build is green

Jan 26 2021, 1:36 PM
swh-public-ci added a comment to D4912: grab_next_visits: don't re-schedule visits too fast.

Build is green

Jan 26 2021, 1:33 PM
vlorentz planned changes to D4928: [wip] add materialized view origins_to_schedule and use it in grab_next_visits..
Jan 26 2021, 1:33 PM
vlorentz updated the diff for D4928: [wip] add materialized view origins_to_schedule and use it in grab_next_visits..

rebase

Jan 26 2021, 1:33 PM
vlorentz updated the diff for D4917: simulator: stop validating the scheduling policy in the CLI.

rebase

Jan 26 2021, 1:33 PM
vlorentz updated the diff for D4916: Run simulator tests on all known scheduling policies.

rebase

Jan 26 2021, 1:33 PM
vlorentz updated the diff for D4915: simulator: record visit metrics alongside scheduler metrics.

rebase

Jan 26 2021, 1:33 PM
vlorentz updated the diff for D4914: simulator: stop using the database as a cache for origin data.
  • rebase (including documentation of RAM usage)
  • document _visit_times
  • rename get_current_snapshot to get_current_snapshot_id
Jan 26 2021, 1:32 PM
swh-public-ci added a comment to D4911: Allow overriding the timestamp of grab_next_visits.

Build is green

Jan 26 2021, 1:31 PM
vlorentz added a comment to D4914: simulator: stop using the database as a cache for origin data.

Something I don't understand: why do you need to keep both _visit_times and latest_snapshots in "caches" when a snapshot is derived from this visit time (and visit type and origin)?

Jan 26 2021, 1:30 PM
swh-public-ci added a comment to D4910: Construct grab_next_visits query arguments incrementally.

Build is green

Jan 26 2021, 1:28 PM
swh-public-ci added a comment to D4909: simulator: add lister simulation.

Build is green

Jan 26 2021, 1:25 PM
vlorentz updated the diff for D4912: grab_next_visits: don't re-schedule visits too fast.

rebase

Jan 26 2021, 1:23 PM
vlorentz updated the diff for D4911: Allow overriding the timestamp of grab_next_visits.

rebase

Jan 26 2021, 1:23 PM
vlorentz updated the diff for D4910: Construct grab_next_visits query arguments incrementally.

rebase

Jan 26 2021, 1:23 PM
vlorentz updated the diff for D4909: simulator: add lister simulation.

add doc on the origin model

Jan 26 2021, 1:22 PM
ardumont added a comment to D4940: gitlab: Support authentication.

If i'm implementing this nonetheless, i'll go with the header (bearer) implementation.

Jan 26 2021, 12:59 PM
ardumont changed the status of T2993: Deploy visit-stats journal client on production, a subtask of T2444: Implement the scheduling policy for the recurrent visit scheduler, from Open to Work in Progress.
Jan 26 2021, 12:55 PM · Sprint 2021 01, Scheduling utilities
ardumont changed the status of T2993: Deploy visit-stats journal client on production from Open to Work in Progress.
Jan 26 2021, 12:55 PM · System administration, Scheduling utilities
ardumont updated the task description for T2993: Deploy visit-stats journal client on production.
Jan 26 2021, 12:55 PM · System administration, Scheduling utilities
ardumont updated the task description for T2993: Deploy visit-stats journal client on production.
Jan 26 2021, 12:54 PM · System administration, Scheduling utilities
ardumont added a comment to T2993: Deploy visit-stats journal client on production.
  • stop scheduler runner to avoid blocking queries:
Jan 26 2021, 12:52 PM · System administration, Scheduling utilities
ardumont updated the task description for T2993: Deploy visit-stats journal client on production.
Jan 26 2021, 12:49 PM · System administration, Scheduling utilities
ardumont closed D4946: Install scheduler journal client to saatchi.
Jan 26 2021, 12:46 PM
ardumont committed rSPSITEfc34778071b7: Install scheduler journal client to saatchi (authored by ardumont).
Install scheduler journal client to saatchi
Jan 26 2021, 12:46 PM
vsellier accepted D4946: Install scheduler journal client to saatchi.

LGTM

Jan 26 2021, 12:45 PM
DanSeraf requested review of D4947: scanner-benchmark: algo_min fixed, retry mechanism on request error.
Jan 26 2021, 12:43 PM
tenma added inline comments to D4925: debian: Reimplement lister using new Lister API.
Jan 26 2021, 12:31 PM
ardumont requested review of D4946: Install scheduler journal client to saatchi.
Jan 26 2021, 12:24 PM
ardumont added a revision to T2993: Deploy visit-stats journal client on production: D4946: Install scheduler journal client to saatchi.
Jan 26 2021, 12:24 PM · System administration, Scheduling utilities
ardumont updated the task description for T2993: Deploy visit-stats journal client on production.
Jan 26 2021, 12:21 PM · System administration, Scheduling utilities
ardumont accepted D4945: cran: Reimplement lister using new Lister API.

some suggestions inline.

Jan 26 2021, 12:07 PM
anlambert added inline comments to D4925: debian: Reimplement lister using new Lister API.
Jan 26 2021, 12:04 PM
anlambert added inline comments to D4925: debian: Reimplement lister using new Lister API.
Jan 26 2021, 12:00 PM
vlorentz added a comment to T2625: create and publish xml schema for the specific swh-deposit metadata.

https://forge.softwareheritage.org/source/swh-deposit/browse/master/docs/specs/swh.xsd

Jan 26 2021, 12:00 PM · SWORD deposit, Scientific Community Building
tenma added a comment to D4925: debian: Reimplement lister using new Lister API.

In any case, bravo for the effort on this tough lister!

Jan 26 2021, 11:59 AM
moranegg added a comment to T2625: create and publish xml schema for the specific swh-deposit metadata.

Specification is enough to reference than a .xsd file.

Jan 26 2021, 11:59 AM · SWORD deposit, Scientific Community Building
moranegg added a parent task for T2779: Put information (client, collection and deposit-id) inside metadata for metadata-only deposit: T2540: support the loading of metadata-only deposits in the metadata storage.
Jan 26 2021, 11:56 AM · Metadata workflow, SWORD deposit
moranegg added a subtask for T2540: support the loading of metadata-only deposits in the metadata storage: T2779: Put information (client, collection and deposit-id) inside metadata for metadata-only deposit.
Jan 26 2021, 11:56 AM · Roadmap 2020, SWORD deposit, Scientific Community Building
moranegg assigned T2942: Update deposit: `swhid` MUST exist in archive for a metadata-only deposit to vlorentz.
Jan 26 2021, 11:54 AM · SWORD deposit
moranegg moved T2937: Provide credentials to InvenioRDM for staging from Backlog to Deployed on the SWORD deposit board.
Jan 26 2021, 11:52 AM · System administration, Roadmap 2020, SWORD deposit, Scientific Community Building
moranegg closed T2937: Provide credentials to InvenioRDM for staging, a subtask of T2344: Build a connector for software deposit via Zenodo/InvenioRDM, as Resolved.
Jan 26 2021, 11:52 AM · meta-task, Roadmap 2022, Roadmap 2020, SWORD deposit, Scientific Community Building
moranegg closed T2937: Provide credentials to InvenioRDM for staging as Resolved.
Jan 26 2021, 11:51 AM · System administration, Roadmap 2020, SWORD deposit, Scientific Community Building
moranegg triaged T2997: Test metadata-only deposit with cli and via SWORD as Normal priority.
Jan 26 2021, 11:50 AM · SWORD deposit
moranegg triaged T2996: Add possibility to fetch a list of deposits on the deposit cli as Normal priority.
Jan 26 2021, 11:43 AM · SWORD deposit
moranegg triaged T2995: Add flag `metadata-only` in deposit storage as Normal priority.
Jan 26 2021, 11:40 AM · SWORD deposit
moranegg changed the status of T2894: Restructure deposit documentation with a clearer strategy from Open to Work in Progress.
Jan 26 2021, 11:34 AM · SWORD deposit, Documentation
moranegg changed the status of T2894: Restructure deposit documentation with a clearer strategy, a subtask of T2624: Create strategy for documentation with a map or a full table of content, from Open to Work in Progress.
Jan 26 2021, 11:34 AM · Roadmap 2021, meta-task, Documentation
tenma added inline comments to D4925: debian: Reimplement lister using new Lister API.
Jan 26 2021, 11:31 AM
anlambert added a comment to T2994: Use keyset pagination in Gitlab lister.

On a related note, I found this list of Community-Hosted GitLab Instances. Most of them have public access and could be added to the set of Gitlab instances listed by Software Heritage.

Jan 26 2021, 11:19 AM · Origin-GitLab, Lister
anlambert triaged T2994: Use keyset pagination in Gitlab lister as Normal priority.
Jan 26 2021, 11:08 AM · Origin-GitLab, Lister
ardumont updated the task description for T2993: Deploy visit-stats journal client on production.
Jan 26 2021, 11:04 AM · System administration, Scheduling utilities
ardumont added a comment to T2993: Deploy visit-stats journal client on production.
  • Stop the workers:
$ clush -b -w @swh-workers 'puppet agent --disable "Deploy new storage version"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'
Jan 26 2021, 11:04 AM · System administration, Scheduling utilities
ardumont updated the task description for T2993: Deploy visit-stats journal client on production.
Jan 26 2021, 10:55 AM · System administration, Scheduling utilities
vsellier added a comment to T2944: Deploy swh-search v0.4.1.

Upgrading the index configuration to speedup the indexation :

% cat >/tmp/config.json <<EOF
{
  "index" : {
"translog.sync_interval" : "60s",
"translog.durability": "async",
"refresh_interval": "60s"
  }
}
EOF
% export ES_SERVER=192.168.100.81:9200
% export INDEX=origin            
% curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @/tmp/config.json 
{"acknowledged":true}%
Jan 26 2021, 10:31 AM · System administration, Journal, Archive search
ardumont updated the task description for T2993: Deploy visit-stats journal client on production.
Jan 26 2021, 10:19 AM · System administration, Scheduling utilities
ardumont moved T2993: Deploy visit-stats journal client on production from Backlog to Weekly backlog on the System administration board.
Jan 26 2021, 10:02 AM · System administration, Scheduling utilities
ardumont updated subscribers of T2993: Deploy visit-stats journal client on production.
Jan 26 2021, 10:02 AM · System administration, Scheduling utilities
ardumont added a project to T2993: Deploy visit-stats journal client on production: System administration.
Jan 26 2021, 10:01 AM · System administration, Scheduling utilities
ardumont triaged T2993: Deploy visit-stats journal client on production as High priority.
Jan 26 2021, 10:00 AM · System administration, Scheduling utilities
ardumont closed T2984: Port cgit lister to the new Lister API, a subtask of T2442: Provide a unified API for listers to interact with the scheduler, as Resolved.
Jan 26 2021, 9:57 AM · Sprint 2021 01, Scheduling utilities
ardumont closed T2984: Port cgit lister to the new Lister API as Resolved.
Jan 26 2021, 9:57 AM · Lister, CGit lister, Sprint 2021 01
ardumont planned changes to D4936: deposit.cli: Warn users when missing origin tags are detected.
Jan 26 2021, 9:54 AM
ardumont planned changes to D4940: gitlab: Support authentication.
Jan 26 2021, 9:54 AM
ardumont added a comment to D4940: gitlab: Support authentication.

Bravo, I did not check. I trusted the old code.
It's doing what was done in the earlier version [1]
I'll check now.

Jan 26 2021, 9:53 AM
douardda added a comment to D4914: simulator: stop using the database as a cache for origin data.

And once again, this "cache" behavior makes the simulator unable to run "forever" (it will eat RAM). Maybe it's an assumed design choice, but please document it somewhere.

Jan 26 2021, 9:50 AM
douardda added a comment to D4914: simulator: stop using the database as a cache for origin data.

Something I don't understand: why do you need to keep both _visit_times and latest_snapshots in "caches" when a snapshot is derived from this visit time (and visit type and origin)?

Jan 26 2021, 9:48 AM
vsellier added a comment to T2944: Deploy swh-search v0.4.1.

Production

  • puppet disabled
  • Services stopped :
root@search1:~# systemctl stop swh-search-journal-client@objects.service 
root@search1:~# systemctl stop gunicorn-swh-search
  • Index deleted and recreated
% export ES_SERVER=search-esnode1.internal.softwareheritage.org:9200
% curl -s http://$ES_SERVER/_cat/indices\?v 
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin Mq8dnlpuRXO4yYoC6CTuQw  90   1  151716299     38861934    260.8gb          131gb
% curl -XDELETE http://$ES_SERVER/origin
{"acknowledged":true}%    
% swh search --config-file /etc/softwareheritage/search/server.yml  initialize
INFO:elasticsearch:PUT http://search-esnode1.internal.softwareheritage.org:9200/origin [status:200 request:2.216s]
INFO:elasticsearch:PUT http://search-esnode3.internal.softwareheritage.org:9200/origin/_mapping [status:200 request:0.151s]
Done.
% curl -s http://$ES_SERVER/_cat/indices\?v                                        
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin yFaqPPCnRFCnc5AA6Ah8lw  90   1          0            0     36.5kb         18.2kb
  • journal client's consumer group delete:
% export SERVER=kafka1.internal.softwareheritage.org:9092  
% ./kafka-consumer-groups.sh --bootstrap-server ${SERVER} --delete --group swh.search.journal_client
Deletion of requested consumer groups ('swh.search.journal_client') was successful.
  • journal client restarted
  • puppet enabled
Jan 26 2021, 9:39 AM · System administration, Journal, Archive search
douardda added a comment to D4909: simulator: add lister simulation.

Isn't there some inherent limitation with this lister_process (gradually eating RAM) that should be documented (maybe)?

Jan 26 2021, 9:33 AM
douardda added a comment to D4909: simulator: add lister simulation.

Note that I still think there should be something in docs/simulator.rst also...

Jan 26 2021, 9:29 AM
douardda accepted D4909: simulator: add lister simulation.

We're not claiming this is a realistic model. We only tried to do something that isn't completely naive, and exercises simple edge cases. Making it realistic is hard, and will probably be most of @olasd's work this week.

Jan 26 2021, 9:27 AM
ardumont added a comment to D4940: gitlab: Support authentication.

I do not see any section regarding basic HTTP authentication for API requests in Gitlab API doc. Are you sure it is working ?

Jan 26 2021, 9:26 AM
vsellier added a comment to T2944: Deploy swh-search v0.4.1.

The filter on visited origins is working correctly on staging. The has_visit flag looks good.
For example for the https://www.npmjs.com/package/@ehmicky/dev-tasks origin

{
  "_index" : "origin",
  "_type" : "_doc",
  "_id" : "019bd314416108304165e82dd92e00bc9ea85a53",
  "_score" : 60.56421,
  "_source" : {
    "url" : "https://www.npmjs.com/package/@ehmicky/dev-tasks",
    "sha1" : "019bd314416108304165e82dd92e00bc9ea85a53"
  },
  "sort" : [
    60.56421,
    "019bd314416108304165e82dd92e00bc9ea85a53"
  ]
}
swh=> select * from origin join origin_visit_status on id=origin where id=469380;
   id   |                       url                        | origin | visit |             date              | status  | metadata |                  snapshot                  | type 
--------+--------------------------------------------------+--------+-------+-------------------------------+---------+----------+--------------------------------------------+------
 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 |     1 | 2021-01-25 13:30:47.221937+00 | created |          |                                            | npm
 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 |     1 | 2021-01-25 13:41:59.435579+00 | partial |          | \xe3f24413d81fd3e9c309686fcfb6c8f5eb549acf | npm
Jan 26 2021, 9:16 AM · System administration, Journal, Archive search

Jan 25 2021

rdicosmo assigned T376: ingest git.eclipse.org repositories to ardumont.
Jan 25 2021, 9:03 PM · Archive coverage
rdicosmo raised the priority of T376: ingest git.eclipse.org repositories from Low to High.

Now that we have a cgit lister, this should be a no brainer.
If that's the case, we need it up and running quickly.

Jan 25 2021, 9:03 PM · Archive coverage
anlambert requested review of D4945: cran: Reimplement lister using new Lister API.
Jan 25 2021, 7:51 PM
anlambert added a revision to T2989: Port CRAN lister to the new Lister API: D4945: cran: Reimplement lister using new Lister API.
Jan 25 2021, 7:48 PM · Lister
anlambert requested changes to D4940: gitlab: Support authentication.

Sorry I accepted the diff before seeing you used Basic Auth which I think is not working.

Jan 25 2021, 7:40 PM
anlambert accepted D4940: gitlab: Support authentication.

I do not see any section regarding basic HTTP authentication for API requests in Gitlab API doc. Are you sure it is working ?

Jan 25 2021, 7:38 PM
anlambert accepted D4944: gitlab: Add support for last_update information during listing.

You should add a test to check the last_update field value in scheduler database is not None.

Jan 25 2021, 7:28 PM
anlambert requested review of D4925: debian: Reimplement lister using new Lister API.
Jan 25 2021, 7:16 PM
ardumont requested review of D4944: gitlab: Add support for last_update information during listing.
Jan 25 2021, 7:11 PM
ardumont added inline comments to D4940: gitlab: Support authentication.
Jan 25 2021, 7:11 PM
ardumont added inline comments to D4940: gitlab: Support authentication.
Jan 25 2021, 7:10 PM
swh-public-ci added a comment to D4940: gitlab: Support authentication.

Build is green

Jan 25 2021, 7:07 PM
ardumont updated the diff for D4940: gitlab: Support authentication.

Rebase

Jan 25 2021, 7:04 PM
ardumont committed rDLSbea9d6d147e4: gitlab: make url mandatory and add type (authored by ardumont).
gitlab: make url mandatory and add type
Jan 25 2021, 7:04 PM
vsellier closed D4943: cgit lister: Add missing types on the init method.
Jan 25 2021, 6:59 PM
vsellier committed rDLSd62e77c1b495: cgit lister: Add missing types on the init method (authored by vsellier).
cgit lister: Add missing types on the init method
Jan 25 2021, 6:59 PM
ardumont accepted D4943: cgit lister: Add missing types on the init method.

lgtm

Jan 25 2021, 6:58 PM
vsellier requested review of D4943: cgit lister: Add missing types on the init method.
Jan 25 2021, 6:58 PM
swh-public-ci added a comment to D4940: gitlab: Support authentication.

Build is green

Jan 25 2021, 6:42 PM
ardumont updated the diff for D4940: gitlab: Support authentication.

Update only the diff with authentication code

Jan 25 2021, 6:36 PM
vsellier added a revision to T2984: Port cgit lister to the new Lister API: D4943: cgit lister: Add missing types on the init method.
Jan 25 2021, 6:33 PM · Lister, CGit lister, Sprint 2021 01
Harbormaster failed remote builds in B18731: Diff 17593 for D4940: gitlab: Support authentication!
Jan 25 2021, 6:32 PM
swh-public-ci added a comment to D4940: gitlab: Support authentication.

Build has FAILED

Jan 25 2021, 6:32 PM
ardumont updated the diff for D4940: gitlab: Support authentication.

Rebase

Jan 25 2021, 6:29 PM
anlambert closed D4942: tests: Fix errors after swh-scheduler API update.
Jan 25 2021, 6:27 PM
anlambert committed rDLSea8ecee54185: tests: Fix errors after swh-scheduler API update (authored by anlambert).
tests: Fix errors after swh-scheduler API update
Jan 25 2021, 6:27 PM