In D4914#124243, @douardda wrote:

Something I don't understand: why do you need to keep both _visit_times and latest_snapshots in "caches" when a snapshot is derived from this visit time (and visit type and origin)?

Jan 26 2021, 1:30 PM

swh-public-ci added a comment to D4910: Construct grab_next_visits query arguments incrementally.

Build is green

Jan 26 2021, 1:28 PM

swh-public-ci added a comment to D4909: simulator: add lister simulation.

Build is green

Jan 26 2021, 1:25 PM

vlorentz updated the diff for D4912: grab_next_visits: don't re-schedule visits too fast.

rebase

Jan 26 2021, 1:23 PM

vlorentz updated the diff for D4911: Allow overriding the timestamp of grab_next_visits.

rebase

Jan 26 2021, 1:23 PM

vlorentz updated the diff for D4910: Construct grab_next_visits query arguments incrementally.

rebase

Jan 26 2021, 1:23 PM

vlorentz updated the diff for D4909: simulator: add lister simulation.

add doc on the origin model

Jan 26 2021, 1:22 PM

ardumont added a comment to D4940: gitlab: Support authentication.

If i'm implementing this nonetheless, i'll go with the header (bearer) implementation.

Jan 26 2021, 12:59 PM

ardumont changed the status of T2993: Deploy visit-stats journal client on production, a subtask of T2444: Implement the scheduling policy for the recurrent visit scheduler, from Open to Work in Progress.

Jan 26 2021, 12:55 PM · Sprint 2021 01, Scheduling utilities

ardumont changed the status of T2993: Deploy visit-stats journal client on production from Open to Work in Progress.

Jan 26 2021, 12:55 PM · System administration, Scheduling utilities

ardumont updated the task description for T2993: Deploy visit-stats journal client on production.

Jan 26 2021, 12:55 PM · System administration, Scheduling utilities

ardumont updated the task description for T2993: Deploy visit-stats journal client on production.

Jan 26 2021, 12:54 PM · System administration, Scheduling utilities

ardumont added a comment to T2993: Deploy visit-stats journal client on production.

stop scheduler runner to avoid blocking queries:

Jan 26 2021, 12:52 PM · System administration, Scheduling utilities

ardumont updated the task description for T2993: Deploy visit-stats journal client on production.

Jan 26 2021, 12:49 PM · System administration, Scheduling utilities

ardumont closed D4946: Install scheduler journal client to saatchi.

Jan 26 2021, 12:46 PM

ardumont committed rSPSITEfc34778071b7: Install scheduler journal client to saatchi (authored by ardumont).

Install scheduler journal client to saatchi

Jan 26 2021, 12:46 PM

vsellier accepted D4946: Install scheduler journal client to saatchi.

LGTM

Jan 26 2021, 12:45 PM

DanSeraf requested review of D4947: scanner-benchmark: algo_min fixed, retry mechanism on request error.

Jan 26 2021, 12:43 PM

tenma added inline comments to D4925: debian: Reimplement lister using new Lister API.

Jan 26 2021, 12:31 PM

ardumont requested review of D4946: Install scheduler journal client to saatchi.

Jan 26 2021, 12:24 PM

ardumont added a revision to T2993: Deploy visit-stats journal client on production: D4946: Install scheduler journal client to saatchi.

Jan 26 2021, 12:24 PM · System administration, Scheduling utilities

ardumont updated the task description for T2993: Deploy visit-stats journal client on production.

Jan 26 2021, 12:21 PM · System administration, Scheduling utilities

ardumont accepted D4945: cran: Reimplement lister using new Lister API.

some suggestions inline.

Jan 26 2021, 12:07 PM

anlambert added inline comments to D4925: debian: Reimplement lister using new Lister API.

Jan 26 2021, 12:04 PM

anlambert added inline comments to D4925: debian: Reimplement lister using new Lister API.

Jan 26 2021, 12:00 PM

vlorentz added a comment to T2625: create and publish xml schema for the specific swh-deposit metadata.

https://forge.softwareheritage.org/source/swh-deposit/browse/master/docs/specs/swh.xsd

Jan 26 2021, 12:00 PM · SWORD deposit, Scientific Community Building

tenma added a comment to D4925: debian: Reimplement lister using new Lister API.

In any case, bravo for the effort on this tough lister!

Jan 26 2021, 11:59 AM

moranegg added a comment to T2625: create and publish xml schema for the specific swh-deposit metadata.

Specification is enough to reference than a .xsd file.

Jan 26 2021, 11:59 AM · SWORD deposit, Scientific Community Building

moranegg added a parent task for T2779: Put information (client, collection and deposit-id) inside metadata for metadata-only deposit: T2540: support the loading of metadata-only deposits in the metadata storage.

Jan 26 2021, 11:56 AM · Metadata workflow, SWORD deposit

moranegg added a subtask for T2540: support the loading of metadata-only deposits in the metadata storage: T2779: Put information (client, collection and deposit-id) inside metadata for metadata-only deposit.

Jan 26 2021, 11:56 AM · Roadmap 2020, SWORD deposit, Scientific Community Building

moranegg assigned T2942: Update deposit: `swhid` MUST exist in archive for a metadata-only deposit to vlorentz.

Jan 26 2021, 11:54 AM · SWORD deposit

moranegg moved T2937: Provide credentials to InvenioRDM for staging from Backlog to Deployed on the SWORD deposit board.

Jan 26 2021, 11:52 AM · System administration, Roadmap 2020, SWORD deposit, Scientific Community Building

moranegg closed T2937: Provide credentials to InvenioRDM for staging, a subtask of T2344: Build a connector for software deposit via Zenodo/InvenioRDM, as Resolved.

Jan 26 2021, 11:52 AM · meta-task, Roadmap 2022, Roadmap 2020, SWORD deposit, Scientific Community Building

moranegg closed T2937: Provide credentials to InvenioRDM for staging as Resolved.

Jan 26 2021, 11:51 AM · System administration, Roadmap 2020, SWORD deposit, Scientific Community Building

moranegg triaged T2997: Test metadata-only deposit with cli and via SWORD as Normal priority.

Jan 26 2021, 11:50 AM · SWORD deposit

moranegg triaged T2996: Add possibility to fetch a list of deposits on the deposit cli as Normal priority.

Jan 26 2021, 11:43 AM · SWORD deposit

moranegg triaged T2995: Add flag `metadata-only` in deposit storage as Normal priority.

Jan 26 2021, 11:40 AM · SWORD deposit

moranegg changed the status of T2894: Restructure deposit documentation with a clearer strategy from Open to Work in Progress.

Jan 26 2021, 11:34 AM · SWORD deposit, Documentation

moranegg changed the status of T2894: Restructure deposit documentation with a clearer strategy, a subtask of T2624: Create strategy for documentation with a map or a full table of content, from Open to Work in Progress.

Jan 26 2021, 11:34 AM · Roadmap 2021, meta-task, Documentation

tenma added inline comments to D4925: debian: Reimplement lister using new Lister API.

Jan 26 2021, 11:31 AM

anlambert added a comment to T2994: Use keyset pagination in Gitlab lister.

On a related note, I found this list of Community-Hosted GitLab Instances. Most of them have public access and could be added to the set of Gitlab instances listed by Software Heritage.

Jan 26 2021, 11:19 AM · Origin-GitLab, Lister

anlambert triaged T2994: Use keyset pagination in Gitlab lister as Normal priority.

Jan 26 2021, 11:08 AM · Origin-GitLab, Lister

ardumont updated the task description for T2993: Deploy visit-stats journal client on production.

Jan 26 2021, 11:04 AM · System administration, Scheduling utilities

ardumont added a comment to T2993: Deploy visit-stats journal client on production.

Stop the workers:

$ clush -b -w @swh-workers 'puppet agent --disable "Deploy new storage version"; cd /etc/systemd/system/multi-user.target.wants; for unit in swh-worker@*; do systemctl disable $unit; done; systemctl stop swh-worker@*'

Jan 26 2021, 11:04 AM · System administration, Scheduling utilities

ardumont updated the task description for T2993: Deploy visit-stats journal client on production.

Jan 26 2021, 10:55 AM · System administration, Scheduling utilities

vsellier added a comment to T2944: Deploy swh-search v0.4.1.

Upgrading the index configuration to speedup the indexation :

% cat >/tmp/config.json <<EOF
{
  "index" : {
"translog.sync_interval" : "60s",
"translog.durability": "async",
"refresh_interval": "60s"
  }
}
EOF
% export ES_SERVER=192.168.100.81:9200
% export INDEX=origin            
% curl -s -H "Content-Type: application/json" -XPUT http://${ES_SERVER}/${INDEX}/_settings -d @/tmp/config.json 
{"acknowledged":true}%

Jan 26 2021, 10:31 AM · System administration, Journal, Archive search

ardumont updated the task description for T2993: Deploy visit-stats journal client on production.

Jan 26 2021, 10:19 AM · System administration, Scheduling utilities

ardumont moved T2993: Deploy visit-stats journal client on production from Backlog to Weekly backlog on the System administration board.

Jan 26 2021, 10:02 AM · System administration, Scheduling utilities

ardumont updated subscribers of T2993: Deploy visit-stats journal client on production.

Jan 26 2021, 10:02 AM · System administration, Scheduling utilities

ardumont added a project to T2993: Deploy visit-stats journal client on production: System administration.

Jan 26 2021, 10:01 AM · System administration, Scheduling utilities

ardumont triaged T2993: Deploy visit-stats journal client on production as High priority.

Jan 26 2021, 10:00 AM · System administration, Scheduling utilities

ardumont closed T2984: Port cgit lister to the new Lister API, a subtask of T2442: Provide a unified API for listers to interact with the scheduler, as Resolved.

Jan 26 2021, 9:57 AM · Sprint 2021 01, Scheduling utilities

ardumont closed T2984: Port cgit lister to the new Lister API as Resolved.

Jan 26 2021, 9:57 AM · Lister, CGit lister, Sprint 2021 01

ardumont planned changes to D4936: deposit.cli: Warn users when missing origin tags are detected.

Jan 26 2021, 9:54 AM

ardumont planned changes to D4940: gitlab: Support authentication.

Jan 26 2021, 9:54 AM

ardumont added a comment to D4940: gitlab: Support authentication.

Bravo, I did not check. I trusted the old code.
It's doing what was done in the earlier version [1]
I'll check now.

Jan 26 2021, 9:53 AM

douardda added a comment to D4914: simulator: stop using the database as a cache for origin data.

And once again, this "cache" behavior makes the simulator unable to run "forever" (it will eat RAM). Maybe it's an assumed design choice, but please document it somewhere.

Jan 26 2021, 9:50 AM

douardda added a comment to D4914: simulator: stop using the database as a cache for origin data.

Something I don't understand: why do you need to keep both _visit_times and latest_snapshots in "caches" when a snapshot is derived from this visit time (and visit type and origin)?

Jan 26 2021, 9:48 AM

vsellier added a comment to T2944: Deploy swh-search v0.4.1.

Production

puppet disabled
Services stopped :

root@search1:~# systemctl stop swh-search-journal-client@objects.service 
root@search1:~# systemctl stop gunicorn-swh-search

Index deleted and recreated

% export ES_SERVER=search-esnode1.internal.softwareheritage.org:9200
% curl -s http://$ES_SERVER/_cat/indices\?v 
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin Mq8dnlpuRXO4yYoC6CTuQw  90   1  151716299     38861934    260.8gb          131gb
% curl -XDELETE http://$ES_SERVER/origin
{"acknowledged":true}%    
% swh search --config-file /etc/softwareheritage/search/server.yml  initialize
INFO:elasticsearch:PUT http://search-esnode1.internal.softwareheritage.org:9200/origin [status:200 request:2.216s]
INFO:elasticsearch:PUT http://search-esnode3.internal.softwareheritage.org:9200/origin/_mapping [status:200 request:0.151s]
Done.
% curl -s http://$ES_SERVER/_cat/indices\?v                                        
health status index  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   origin yFaqPPCnRFCnc5AA6Ah8lw  90   1          0            0     36.5kb         18.2kb

journal client's consumer group delete:

% export SERVER=kafka1.internal.softwareheritage.org:9092  
% ./kafka-consumer-groups.sh --bootstrap-server ${SERVER} --delete --group swh.search.journal_client
Deletion of requested consumer groups ('swh.search.journal_client') was successful.

journal client restarted
puppet enabled

Jan 26 2021, 9:39 AM · System administration, Journal, Archive search

douardda added a comment to D4909: simulator: add lister simulation.

Isn't there some inherent limitation with this lister_process (gradually eating RAM) that should be documented (maybe)?

Jan 26 2021, 9:33 AM

douardda added a comment to D4909: simulator: add lister simulation.

Note that I still think there should be something in docs/simulator.rst also...

Jan 26 2021, 9:29 AM

douardda accepted D4909: simulator: add lister simulation.

In D4909#123949, @vlorentz wrote:

We're not claiming this is a realistic model. We only tried to do something that isn't completely naive, and exercises simple edge cases. Making it realistic is hard, and will probably be most of @olasd's work this week.

Jan 26 2021, 9:27 AM

ardumont added a comment to D4940: gitlab: Support authentication.

I do not see any section regarding basic HTTP authentication for API requests in Gitlab API doc. Are you sure it is working ?

Jan 26 2021, 9:26 AM

vsellier added a comment to T2944: Deploy swh-search v0.4.1.

The filter on visited origins is working correctly on staging. The has_visit flag looks good.
For example for the https://www.npmjs.com/package/@ehmicky/dev-tasks origin

{
  "_index" : "origin",
  "_type" : "_doc",
  "_id" : "019bd314416108304165e82dd92e00bc9ea85a53",
  "_score" : 60.56421,
  "_source" : {
    "url" : "https://www.npmjs.com/package/@ehmicky/dev-tasks",
    "sha1" : "019bd314416108304165e82dd92e00bc9ea85a53"
  },
  "sort" : [
    60.56421,
    "019bd314416108304165e82dd92e00bc9ea85a53"
  ]
}

swh=> select * from origin join origin_visit_status on id=origin where id=469380;
   id   |                       url                        | origin | visit |             date              | status  | metadata |                  snapshot                  | type 
--------+--------------------------------------------------+--------+-------+-------------------------------+---------+----------+--------------------------------------------+------
 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 |     1 | 2021-01-25 13:30:47.221937+00 | created |          |                                            | npm
 469380 | https://www.npmjs.com/package/@ehmicky/dev-tasks | 469380 |     1 | 2021-01-25 13:41:59.435579+00 | partial |          | \xe3f24413d81fd3e9c309686fcfb6c8f5eb549acf | npm

Jan 26 2021, 9:16 AM · System administration, Journal, Archive search