In D8386#229890, @anlambert wrote:

In D8386#229882, @olasd wrote:

In D8386#229677, @KShivendu wrote:

@anlambert

I noticed that https://archive.softwareheritage.org/browse/origin/directory/?origin_url=deb://Ubuntu/packages/nginx has duplicate branch names, which is very confusing. In fact, even the default branch is repeated twice and I see two check marks. If we use branch names like 0.3.9-15.fc26, won't the same happen with Fedora listers? It doesn't seem to differentiate between the editions. (or does it?)

This seems like a misfeature in the webapp:

https://archive.softwareheritage.org/api/1/snapshot/158a3f36b0bd3da461fb7458de44cfa2c94e4270/

The snapshot has multiple branches, with the same version suffix, pointing at the same objects (because the exact same version of the package is present in multiple Ubuntu suites).

I'm not 100% sure how we should be fixing that, but that bug shouldn't prevent you from giving the fedora snapshots the "semantically correct" structure.

I also noticed that yesterday evening and I was also wondering what is the best way to fix that. I see two possible options:

We change the names of the keys in snapshot branches dictionary in order to use the intrinsic version of a debian package instead of its extrinsic one (meaning releases/bionic-security/main/1.14.0-0ubuntu1.10 should rather be releases/1.14.0-0ubuntu1.10)

We update the webapp to filter duplicated releases before display as the release name is used instead of the snapshot branches key associated to the release

I would rather go for the second one as keeping the debian/ubuntu suites and components in keys of snapshot branches dictionary seems of interest.
We could do the same for the fedora case as based on my tests it is quite common that extrinsic versions in the form [0-9].[0-9].[0-9]-[0-9].fc[0-9]+
target the same intrinsic version [0-9].[0-9].[0-9]-[0-9].

Nov 15 2022, 11:02 AM

swh-public-ci added a comment to D8832: luigi: Add DownloadFromS3 task.

Build is green

Nov 15 2022, 11:02 AM

anlambert accepted D8844: origin-search: Only request 'url' field.

Nice catch !

Nov 15 2022, 11:02 AM

vlorentz added inline comments to D8832: luigi: Add DownloadFromS3 task.

Nov 15 2022, 11:02 AM

ardumont added inline comments to D8832: luigi: Add DownloadFromS3 task.

Nov 15 2022, 11:01 AM

vlorentz updated the diff for D8832: luigi: Add DownloadFromS3 task.

make the right parameter significant

Nov 15 2022, 11:00 AM

ardumont accepted D8842: metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata.

Nov 15 2022, 11:00 AM

anlambert accepted D8386: feat(fedora): Introduce fedora lister.

Looks good to me, thanks !

Nov 15 2022, 10:59 AM

vlorentz requested review of D8844: origin-search: Only request 'url' field.

Nov 15 2022, 10:59 AM

anlambert added a comment to D8386: feat(fedora): Introduce fedora lister.

In D8386#229882, @olasd wrote:

In D8386#229677, @KShivendu wrote:

@anlambert

I noticed that https://archive.softwareheritage.org/browse/origin/directory/?origin_url=deb://Ubuntu/packages/nginx has duplicate branch names, which is very confusing. In fact, even the default branch is repeated twice and I see two check marks. If we use branch names like 0.3.9-15.fc26, won't the same happen with Fedora listers? It doesn't seem to differentiate between the editions. (or does it?)

This seems like a misfeature in the webapp:

https://archive.softwareheritage.org/api/1/snapshot/158a3f36b0bd3da461fb7458de44cfa2c94e4270/

The snapshot has multiple branches, with the same version suffix, pointing at the same objects (because the exact same version of the package is present in multiple Ubuntu suites).

I'm not 100% sure how we should be fixing that, but that bug shouldn't prevent you from giving the fedora snapshots the "semantically correct" structure.

Nov 15 2022, 10:52 AM

olasd added a comment to D8386: feat(fedora): Introduce fedora lister.

In D8386#229677, @KShivendu wrote:

@anlambert

I noticed that https://archive.softwareheritage.org/browse/origin/directory/?origin_url=deb://Ubuntu/packages/nginx has duplicate branch names, which is very confusing. In fact, even the default branch is repeated twice and I see two check marks. If we use branch names like 0.3.9-15.fc26, won't the same happen with Fedora listers? It doesn't seem to differentiate between the editions. (or does it?)

Nov 15 2022, 10:35 AM

vlorentz requested review of D8843: metadata-search: Skip query to swh-indexer when its results would be discarded.

Nov 15 2022, 10:02 AM

swh-public-ci added a comment to D8663: Hackage: Implement incremental mode.

Build is green

Nov 15 2022, 9:59 AM

franckbret updated the diff for D8663: Hackage: Implement incremental mode.

Improve test for incremental listing, ensure the http searchQuery/lastUpload value is a is a date

Nov 15 2022, 9:53 AM

vlorentz requested review of D8842: metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata.

Nov 15 2022, 9:40 AM

vlorentz added a revision to T4599: Github descriptions are not used to search origins: D8842: metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata.

Nov 15 2022, 9:31 AM · Metadata workflow, Archive search

swh-public-ci added a comment to D8753: feat: Introduce RPM loader.

Build is green

Nov 15 2022, 9:26 AM

KShivendu updated the diff for D8753: feat: Introduce RPM loader.

Minor fixes in the loader docstrings

Nov 15 2022, 9:22 AM

swh-public-ci added a comment to D8386: feat(fedora): Introduce fedora lister.

Build is green

Nov 15 2022, 9:15 AM

KShivendu updated the diff for D8386: feat(fedora): Introduce fedora lister.

Add tests for handling of HTTP errors and sha1 checksum (increase test coverage)

Nov 15 2022, 9:10 AM

swh-public-ci added a comment to D8753: feat: Introduce RPM loader.

Build is green

Nov 15 2022, 8:04 AM

KShivendu updated the diff for D8753: feat: Introduce RPM loader.

Extract .tar.gz as a seperate branch (and other suggestions made by @anlambert)
Remove .tar.gz extraction logic from extract_rpm_package function. Previously, I was just replacing .tar.gz with its extracted folder but now we are creating a separate branch as well.
Updating relevant tests for the same

Nov 15 2022, 8:00 AM

Nov 14 2022

lunar updated the summary of D8838: Use a volatile resource lock for host port 5080.

Nov 14 2022, 6:08 PM

lunar updated the diff for D8838: Use a volatile resource lock for host port 5080.

Fix the issue by adding a level of indirection in the yaml (replacing the job
by an identical job-template, and instantiating it through a project).
It seems jinja2 templates aren't actually supported in direct job definitions,
only in job templates. Thanks to olasd for finding this out and suggesting a fix.

Nov 14 2022, 6:06 PM

ardumont accepted D8839: maven: Simplify tests with requests_mock_datadir fixture.

Nov 14 2022, 5:44 PM

ardumont accepted D8840: maven: Add support for md5 checkums to check download integrity.

Nov 14 2022, 5:43 PM

swh-public-ci added a comment to D6380: Allow partial snapshot creation during ingestion.

Build is green

Nov 14 2022, 5:43 PM

ardumont updated the diff for D6380: Allow partial snapshot creation during ingestion.

Fix mistyped signature

Nov 14 2022, 5:39 PM

anlambert requested review of D8841: save_code_now: Allow request creation if previous for origin is running.

Nov 14 2022, 5:35 PM

anlambert added a revision to T4548: Add a public API endpoint and documentation to trigger Save Code Now from webhook: D8841: save_code_now: Allow request creation if previous for origin is running.

Nov 14 2022, 5:27 PM · Web app

Harbormaster failed remote builds in B32791: Diff 31855 for D6380: Allow partial snapshot creation during ingestion!

Nov 14 2022, 5:23 PM

swh-public-ci added a comment to D6380: Allow partial snapshot creation during ingestion.

Build has FAILED

Nov 14 2022, 5:23 PM

ardumont updated the diff for D6380: Allow partial snapshot creation during ingestion.

Add coverage (which is a bit convoluted but we are in loader-core so no real loader to
check that actual behavior beyond what i propose).

Nov 14 2022, 5:22 PM

zack added a parent task for T4683: license dataset: use a consistent file format for CSV-like files: T4685: license dataset: add logic to convert/import dataset into a SQL database.

Nov 14 2022, 4:50 PM · Datasets

zack added a subtask for T4685: license dataset: add logic to convert/import dataset into a SQL database: T4683: license dataset: use a consistent file format for CSV-like files.

Nov 14 2022, 4:50 PM · Datasets

zack triaged T4685: license dataset: add logic to convert/import dataset into a SQL database as Low priority.

Nov 14 2022, 4:49 PM · Datasets

zack changed the edit policy for P1529 import the license dataset into sqlite.

Nov 14 2022, 4:47 PM · Datasets

zack created P1529 import the license dataset into sqlite.

Nov 14 2022, 4:47 PM · Datasets

olasd triaged T4680: GitLab lister: allow ignoring origins contained in given namespace prefixes as Normal priority.

Nov 14 2022, 4:42 PM · Easy hack, Lister

olasd created P1528 Command-Line Input.

Nov 14 2022, 4:36 PM

anlambert updated the summary of D8840: maven: Add support for md5 checkums to check download integrity.

Nov 14 2022, 4:33 PM

anlambert requested review of D8840: maven: Add support for md5 checkums to check download integrity.

Nov 14 2022, 4:30 PM

vlorentz added a project to T4684: Publish scrubber metrics and create grafana dashboard: Datastore Scrubber.

Nov 14 2022, 4:22 PM · Datastore Scrubber

vlorentz claimed T4684: Publish scrubber metrics and create grafana dashboard.

Nov 14 2022, 4:22 PM · Datastore Scrubber

vlorentz triaged T4684: Publish scrubber metrics and create grafana dashboard as High priority.

Nov 14 2022, 4:22 PM · Datastore Scrubber

anlambert requested review of D8839: maven: Simplify tests with requests_mock_datadir fixture.

Nov 14 2022, 4:17 PM

lunar updated the summary of D8838: Use a volatile resource lock for host port 5080.

Nov 14 2022, 4:03 PM

lunar requested review of D8838: Use a volatile resource lock for host port 5080.

Nov 14 2022, 4:02 PM

swh-public-ci added a comment to D6380: Allow partial snapshot creation during ingestion.

Build is green

Nov 14 2022, 3:51 PM

ardumont retitled D6380: Allow partial snapshot creation during ingestion from Improve store_data implem to allow multiple calls with partial visit to Allow partial snapshot creation during ingestion.

Nov 14 2022, 3:51 PM

ardumont added a comment to D6380: Allow partial snapshot creation during ingestion.

Only, more_data_to_fetch/create_snapshot is renamed create_partial_visit though as
that makes more sense now.

Nov 14 2022, 3:50 PM

ardumont updated the diff for D6380: Allow partial snapshot creation during ingestion.

Rebase
reword commit and diff description
adapt parameter according to review suggestion from @vlorentz

Nov 14 2022, 3:48 PM

bchauvet updated the task description for T4678: Automation of add forge now workflow.

Nov 14 2022, 3:20 PM · Add Forge Now

zack added a project to T4683: license dataset: use a consistent file format for CSV-like files: Datasets.

Nov 14 2022, 3:09 PM · Datasets

vlorentz added a comment to T4682: license dataset: missing java stuff from the replication package.

the replication/05-earliest-revision.sh script in the replication package mentions the swh-graph version it uses, and the fully qualified class name, so it can be found in the swh-graph code.

Nov 14 2022, 3:08 PM · Datasets

zack triaged T4683: license dataset: use a consistent file format for CSV-like files as Low priority.

Nov 14 2022, 3:05 PM · Datasets

zack triaged T4682: license dataset: missing java stuff from the replication package as Low priority.

Nov 14 2022, 2:45 PM · Datasets

anlambert requested review of D8837: api: Improve HTML documentation.

Nov 14 2022, 2:42 PM

anlambert closed D8836: browse: Use django FileResponse in browse-content-raw view.

Nov 14 2022, 2:27 PM

anlambert committed rDWAPPSad8558c69d88: browse: Use django FileResponse in browse-content-raw view (authored by anlambert).

browse: Use django FileResponse in browse-content-raw view

Nov 14 2022, 2:27 PM

vlorentz accepted D8836: browse: Use django FileResponse in browse-content-raw view.

Nov 14 2022, 2:26 PM

vlorentz added a task to D8836: browse: Use django FileResponse in browse-content-raw view: Unknown Object (Maniphest Task).

Nov 14 2022, 2:23 PM

zack closed D8835: changelog: document recent git loader speed improvements.

merged in abbcf03b7bb2f1425db154dbe6e43e10c647354c

Nov 14 2022, 2:08 PM

zack committed rDDOCabbcf03b7bb2: changelog: document recent git loader speed improvements (authored by zack).

changelog: document recent git loader speed improvements

Nov 14 2022, 2:07 PM

ardumont accepted D8832: luigi: Add DownloadFromS3 task.

one question inline.

Nov 14 2022, 2:02 PM

anlambert requested review of D8836: browse: Use django FileResponse in browse-content-raw view.

Nov 14 2022, 1:52 PM

olasd added a project to T4681: Add throttling/backoff to origin visit scheduler respawn logic: Easy hack.

Nov 14 2022, 1:49 PM · Easy hack, Scheduling utilities

olasd triaged T4681: Add throttling/backoff to origin visit scheduler respawn logic as Normal priority.

Nov 14 2022, 1:49 PM · Easy hack, Scheduling utilities

olasd placed T4681: Add throttling/backoff to origin visit scheduler respawn logic up for grabs.

Nov 14 2022, 1:49 PM · Easy hack, Scheduling utilities

swh-sentry-integration claimed T4681: Add throttling/backoff to origin visit scheduler respawn logic.

Nov 14 2022, 1:48 PM · Easy hack, Scheduling utilities

vlorentz added a comment to D8663: Hackage: Implement incremental mode.

One last thing: could you make tests check the request body is as expected? See https://requests-mock.readthedocs.io/en/latest/history.html

Nov 14 2022, 1:35 PM

olasd accepted D8835: changelog: document recent git loader speed improvements.

Thanks!

Nov 14 2022, 1:34 PM

vlorentz closed D8834: roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks.

Nov 14 2022, 1:08 PM

vlorentz committed rDDOC072eeb4f771e: roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks (authored by vlorentz).

roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks

Nov 14 2022, 1:08 PM

vlorentz added a comment to T4599: Github descriptions are not used to search origins.

swh-web uses swh-search as a glorified postgresql index: for every result returned by swh-search, it pulls the corresponding row from origin_intrinsic_metadata in the indexer database; which means it ignores extrinsic metadata.

Nov 14 2022, 1:07 PM · Metadata workflow, Archive search

zack requested review of D8835: changelog: document recent git loader speed improvements.

Nov 14 2022, 12:43 PM

vlorentz created P1527 (An Untitled Masterwork).

Nov 14 2022, 12:38 PM

ardumont accepted D8834: roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks.

Nov 14 2022, 11:38 AM

vlorentz requested review of D8834: roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks.

Nov 14 2022, 11:37 AM

olasd added a revision to T4657: Allow object removal from journal: D8833: Add base functionality to support object deletion.

Nov 14 2022, 11:12 AM · Journal

olasd renamed T4680: GitLab lister: allow ignoring origins contained in given namespace prefixes from GitLab lister: allow ignoring origins contained in a given namespace to GitLab lister: allow ignoring origins contained in given namespace prefixes.

Nov 14 2022, 11:04 AM · Easy hack, Lister

olasd created T4680: GitLab lister: allow ignoring origins contained in given namespace prefixes.

Nov 14 2022, 11:03 AM · Easy hack, Lister

swh-public-ci added a comment to D8663: Hackage: Implement incremental mode.

Build is green

Nov 14 2022, 10:53 AM

franckbret added a comment to D8663: Hackage: Implement incremental mode.

In D8663#229574, @vlorentz wrote:

buuuut you are using a strict inequality, so you need to subtract one day, in order not to miss uploads submitted after the previous run of the lister but on the same day.

Also, you should apply .astimezone(tz=timezone.utc) before converting to date, because the database is not guaranteed to return timestamps in UTC even when they were written in UTC.

(Sorry for the back-and-forth; hopefully I'm done now.)

Nov 14 2022, 10:53 AM

franckbret updated the diff for D8663: Hackage: Implement incremental mode.

Use greater than or equal instead of strict comparison when building http api query params for incremental listing

Nov 14 2022, 10:48 AM

franckbret abandoned D8824: Cpan: Implement incremental mode.

Abandon revision because in this case we can not really get advantages of an incremental mode

Nov 14 2022, 10:06 AM

franckbret added a comment to D8824: Cpan: Implement incremental mode.

In D8824#229544, @anlambert wrote:

@franckbret, as explained in my inline comment we cannot use the date filtering on the release index of CPAN elasticsearch.

The only incremental mode we can implement here is to filter the ListedOrigininstances sent to the scheduler according to the
last_updatevalue, if it is greater than the date from the lister state, we can yield it.

Nevertheless, I am not sure if it is worth it as a full listing takes around 10 minutes, which is pretty fast.