Page MenuHomeSoftware Heritage
Feed All Stories

Nov 15 2022

vlorentz committed rDWAPPS6531a3653102: origin-search: Only request 'url' field (authored by vlorentz).
origin-search: Only request 'url' field
Nov 15 2022, 11:05 AM
vlorentz committed rDWAPPSf59acd618560: metadata-search: Skip query to swh-indexer when its results would be discarded (authored by vlorentz).
metadata-search: Skip query to swh-indexer when its results would be discarded
Nov 15 2022, 11:05 AM
vlorentz closed D8842: metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata.
Nov 15 2022, 11:05 AM
vlorentz committed rDWAPPS76c64ea4dc9e: metadata-search: Return swh-search even when missing from idx_storage. (authored by vlorentz).
metadata-search: Return swh-search even when missing from idx_storage.
Nov 15 2022, 11:05 AM
vlorentz added a comment to D8844: origin-search: Only request 'url' field.

not really a nice catch as it wasn't a very useful optimization before D8843, which I only noticed when the useless query caused issues ;)

Nov 15 2022, 11:05 AM
ardumont accepted D8843: metadata-search: Skip query to swh-indexer when its results would be discarded.
Nov 15 2022, 11:04 AM
vlorentz added inline comments to D8832: luigi: Add DownloadFromS3 task.
Nov 15 2022, 11:04 AM
olasd added a comment to D8386: feat(fedora): Introduce fedora lister.
In D8386#229882, @olasd wrote:

@anlambert

I noticed that https://archive.softwareheritage.org/browse/origin/directory/?origin_url=deb://Ubuntu/packages/nginx has duplicate branch names, which is very confusing. In fact, even the default branch is repeated twice and I see two check marks. If we use branch names like 0.3.9-15.fc26, won't the same happen with Fedora listers? It doesn't seem to differentiate between the editions. (or does it?)

This seems like a misfeature in the webapp:

https://archive.softwareheritage.org/api/1/snapshot/158a3f36b0bd3da461fb7458de44cfa2c94e4270/

The snapshot has multiple branches, with the same version suffix, pointing at the same objects (because the exact same version of the package is present in multiple Ubuntu suites).

I'm not 100% sure how we should be fixing that, but that bug shouldn't prevent you from giving the fedora snapshots the "semantically correct" structure.

I also noticed that yesterday evening and I was also wondering what is the best way to fix that. I see two possible options:

  1. We change the names of the keys in snapshot branches dictionary in order to use the intrinsic version of a debian package instead of its extrinsic one (meaning releases/bionic-security/main/1.14.0-0ubuntu1.10 should rather be releases/1.14.0-0ubuntu1.10)
  2. We update the webapp to filter duplicated releases before display as the release name is used instead of the snapshot branches key associated to the release

I would rather go for the second one as keeping the debian/ubuntu suites and components in keys of snapshot branches dictionary seems of interest.
We could do the same for the fedora case as based on my tests it is quite common that extrinsic versions in the form [0-9].[0-9].[0-9]-[0-9].fc[0-9]+
target the same intrinsic version [0-9].[0-9].[0-9]-[0-9].

Nov 15 2022, 11:02 AM
swh-public-ci added a comment to D8832: luigi: Add DownloadFromS3 task.

Build is green

Nov 15 2022, 11:02 AM
anlambert accepted D8844: origin-search: Only request 'url' field.

Nice catch !

Nov 15 2022, 11:02 AM
vlorentz added inline comments to D8832: luigi: Add DownloadFromS3 task.
Nov 15 2022, 11:02 AM
ardumont added inline comments to D8832: luigi: Add DownloadFromS3 task.
Nov 15 2022, 11:01 AM
vlorentz updated the diff for D8832: luigi: Add DownloadFromS3 task.

make the right parameter significant

Nov 15 2022, 11:00 AM
ardumont accepted D8842: metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata.
Nov 15 2022, 11:00 AM
anlambert accepted D8386: feat(fedora): Introduce fedora lister.

Looks good to me, thanks !

Nov 15 2022, 10:59 AM
vlorentz requested review of D8844: origin-search: Only request 'url' field.
Nov 15 2022, 10:59 AM
anlambert added a comment to D8386: feat(fedora): Introduce fedora lister.
In D8386#229882, @olasd wrote:

@anlambert

I noticed that https://archive.softwareheritage.org/browse/origin/directory/?origin_url=deb://Ubuntu/packages/nginx has duplicate branch names, which is very confusing. In fact, even the default branch is repeated twice and I see two check marks. If we use branch names like 0.3.9-15.fc26, won't the same happen with Fedora listers? It doesn't seem to differentiate between the editions. (or does it?)

This seems like a misfeature in the webapp:

https://archive.softwareheritage.org/api/1/snapshot/158a3f36b0bd3da461fb7458de44cfa2c94e4270/

The snapshot has multiple branches, with the same version suffix, pointing at the same objects (because the exact same version of the package is present in multiple Ubuntu suites).

I'm not 100% sure how we should be fixing that, but that bug shouldn't prevent you from giving the fedora snapshots the "semantically correct" structure.

Nov 15 2022, 10:52 AM
olasd added a comment to D8386: feat(fedora): Introduce fedora lister.

@anlambert

I noticed that https://archive.softwareheritage.org/browse/origin/directory/?origin_url=deb://Ubuntu/packages/nginx has duplicate branch names, which is very confusing. In fact, even the default branch is repeated twice and I see two check marks. If we use branch names like 0.3.9-15.fc26, won't the same happen with Fedora listers? It doesn't seem to differentiate between the editions. (or does it?)

Nov 15 2022, 10:35 AM
vlorentz requested review of D8843: metadata-search: Skip query to swh-indexer when its results would be discarded.
Nov 15 2022, 10:02 AM
swh-public-ci added a comment to D8663: Hackage: Implement incremental mode.

Build is green

Nov 15 2022, 9:59 AM
franckbret updated the diff for D8663: Hackage: Implement incremental mode.

Improve test for incremental listing, ensure the http searchQuery/lastUpload value is a is a date

Nov 15 2022, 9:53 AM
vlorentz requested review of D8842: metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata.
Nov 15 2022, 9:40 AM
vlorentz added a revision to T4599: Github descriptions are not used to search origins: D8842: metadata-search: Return swh-search even when missing from idx_storage.origin_intrinsic_metadata.
Nov 15 2022, 9:31 AM · Metadata workflow, Archive search
swh-public-ci added a comment to D8753: feat: Introduce RPM loader.

Build is green

Nov 15 2022, 9:26 AM
KShivendu updated the diff for D8753: feat: Introduce RPM loader.

Minor fixes in the loader docstrings

Nov 15 2022, 9:22 AM
swh-public-ci added a comment to D8386: feat(fedora): Introduce fedora lister.

Build is green

Nov 15 2022, 9:15 AM
KShivendu updated the diff for D8386: feat(fedora): Introduce fedora lister.
  • Add tests for handling of HTTP errors and sha1 checksum (increase test coverage)
Nov 15 2022, 9:10 AM
swh-public-ci added a comment to D8753: feat: Introduce RPM loader.

Build is green

Nov 15 2022, 8:04 AM
KShivendu updated the diff for D8753: feat: Introduce RPM loader.
  • Extract .tar.gz as a seperate branch (and other suggestions made by @anlambert)
  • Remove .tar.gz extraction logic from extract_rpm_package function. Previously, I was just replacing .tar.gz with its extracted folder but now we are creating a separate branch as well.
  • Updating relevant tests for the same
Nov 15 2022, 8:00 AM

Nov 14 2022

lunar updated the summary of D8838: Use a volatile resource lock for host port 5080.
Nov 14 2022, 6:08 PM
lunar updated the diff for D8838: Use a volatile resource lock for host port 5080.

Fix the issue by adding a level of indirection in the yaml (replacing the job
by an identical job-template, and instantiating it through a project).
It seems jinja2 templates aren't actually supported in direct job definitions,
only in job templates. Thanks to olasd for finding this out and suggesting a fix.

Nov 14 2022, 6:06 PM
ardumont accepted D8839: maven: Simplify tests with requests_mock_datadir fixture.
Nov 14 2022, 5:44 PM
ardumont accepted D8840: maven: Add support for md5 checkums to check download integrity.
Nov 14 2022, 5:43 PM
swh-public-ci added a comment to D6380: Allow partial snapshot creation during ingestion.

Build is green

Nov 14 2022, 5:43 PM
ardumont updated the diff for D6380: Allow partial snapshot creation during ingestion.

Fix mistyped signature

Nov 14 2022, 5:39 PM
anlambert requested review of D8841: save_code_now: Allow request creation if previous for origin is running.
Nov 14 2022, 5:35 PM
anlambert added a revision to T4548: Add a public API endpoint and documentation to trigger Save Code Now from webhook: D8841: save_code_now: Allow request creation if previous for origin is running.
Nov 14 2022, 5:27 PM · Web app
Harbormaster failed remote builds in B32791: Diff 31855 for D6380: Allow partial snapshot creation during ingestion!
Nov 14 2022, 5:23 PM
swh-public-ci added a comment to D6380: Allow partial snapshot creation during ingestion.

Build has FAILED

Nov 14 2022, 5:23 PM
ardumont updated the diff for D6380: Allow partial snapshot creation during ingestion.

Add coverage (which is a bit convoluted but we are in loader-core so no real loader to
check that actual behavior beyond what i propose).

Nov 14 2022, 5:22 PM
zack added a parent task for T4683: license dataset: use a consistent file format for CSV-like files: T4685: license dataset: add logic to convert/import dataset into a SQL database.
Nov 14 2022, 4:50 PM · Datasets
zack added a subtask for T4685: license dataset: add logic to convert/import dataset into a SQL database: T4683: license dataset: use a consistent file format for CSV-like files.
Nov 14 2022, 4:50 PM · Datasets
zack triaged T4685: license dataset: add logic to convert/import dataset into a SQL database as Low priority.
Nov 14 2022, 4:49 PM · Datasets
zack changed the edit policy for P1529 import the license dataset into sqlite.
Nov 14 2022, 4:47 PM · Datasets
zack created P1529 import the license dataset into sqlite.
Nov 14 2022, 4:47 PM · Datasets
olasd triaged T4680: GitLab lister: allow ignoring origins contained in given namespace prefixes as Normal priority.
Nov 14 2022, 4:42 PM · Easy hack, Lister
olasd created P1528 Command-Line Input.
Nov 14 2022, 4:36 PM
anlambert updated the summary of D8840: maven: Add support for md5 checkums to check download integrity.
Nov 14 2022, 4:33 PM
anlambert requested review of D8840: maven: Add support for md5 checkums to check download integrity.
Nov 14 2022, 4:30 PM
vlorentz added a project to T4684: Publish scrubber metrics and create grafana dashboard: Datastore Scrubber.
Nov 14 2022, 4:22 PM · Datastore Scrubber
vlorentz claimed T4684: Publish scrubber metrics and create grafana dashboard.
Nov 14 2022, 4:22 PM · Datastore Scrubber
vlorentz triaged T4684: Publish scrubber metrics and create grafana dashboard as High priority.
Nov 14 2022, 4:22 PM · Datastore Scrubber
anlambert requested review of D8839: maven: Simplify tests with requests_mock_datadir fixture.
Nov 14 2022, 4:17 PM
lunar updated the summary of D8838: Use a volatile resource lock for host port 5080.
Nov 14 2022, 4:03 PM
lunar requested review of D8838: Use a volatile resource lock for host port 5080.
Nov 14 2022, 4:02 PM
swh-public-ci added a comment to D6380: Allow partial snapshot creation during ingestion.

Build is green

Nov 14 2022, 3:51 PM
ardumont retitled D6380: Allow partial snapshot creation during ingestion from Improve store_data implem to allow multiple calls with partial visit to Allow partial snapshot creation during ingestion.
Nov 14 2022, 3:51 PM
ardumont added a comment to D6380: Allow partial snapshot creation during ingestion.

Only, more_data_to_fetch/create_snapshot is renamed create_partial_visit though as
that makes more sense now.

Nov 14 2022, 3:50 PM
ardumont updated the diff for D6380: Allow partial snapshot creation during ingestion.
  • Rebase
  • reword commit and diff description
  • adapt parameter according to review suggestion from @vlorentz
Nov 14 2022, 3:48 PM
bchauvet updated the task description for T4678: Automation of add forge now workflow.
Nov 14 2022, 3:20 PM · Add Forge Now
zack added a project to T4683: license dataset: use a consistent file format for CSV-like files: Datasets.
Nov 14 2022, 3:09 PM · Datasets
vlorentz added a comment to T4682: license dataset: missing java stuff from the replication package.

the replication/05-earliest-revision.sh script in the replication package mentions the swh-graph version it uses, and the fully qualified class name, so it can be found in the swh-graph code.

Nov 14 2022, 3:08 PM · Datasets
zack triaged T4683: license dataset: use a consistent file format for CSV-like files as Low priority.
Nov 14 2022, 3:05 PM · Datasets
zack triaged T4682: license dataset: missing java stuff from the replication package as Low priority.
Nov 14 2022, 2:45 PM · Datasets
anlambert requested review of D8837: api: Improve HTML documentation.
Nov 14 2022, 2:42 PM
anlambert closed D8836: browse: Use django FileResponse in browse-content-raw view.
Nov 14 2022, 2:27 PM
anlambert committed rDWAPPSad8558c69d88: browse: Use django FileResponse in browse-content-raw view (authored by anlambert).
browse: Use django FileResponse in browse-content-raw view
Nov 14 2022, 2:27 PM
vlorentz accepted D8836: browse: Use django FileResponse in browse-content-raw view.
Nov 14 2022, 2:26 PM
vlorentz added a task to D8836: browse: Use django FileResponse in browse-content-raw view: Unknown Object (Maniphest Task).
Nov 14 2022, 2:23 PM
zack closed D8835: changelog: document recent git loader speed improvements.

merged in abbcf03b7bb2f1425db154dbe6e43e10c647354c

Nov 14 2022, 2:08 PM
zack committed rDDOCabbcf03b7bb2: changelog: document recent git loader speed improvements (authored by zack).
changelog: document recent git loader speed improvements
Nov 14 2022, 2:07 PM
ardumont accepted D8832: luigi: Add DownloadFromS3 task.

one question inline.

Nov 14 2022, 2:02 PM
anlambert requested review of D8836: browse: Use django FileResponse in browse-content-raw view.
Nov 14 2022, 1:52 PM
olasd added a project to T4681: Add throttling/backoff to origin visit scheduler respawn logic: Easy hack.
Nov 14 2022, 1:49 PM · Easy hack, Scheduling utilities
olasd triaged T4681: Add throttling/backoff to origin visit scheduler respawn logic as Normal priority.
Nov 14 2022, 1:49 PM · Easy hack, Scheduling utilities
olasd placed T4681: Add throttling/backoff to origin visit scheduler respawn logic up for grabs.
Nov 14 2022, 1:49 PM · Easy hack, Scheduling utilities
swh-sentry-integration claimed T4681: Add throttling/backoff to origin visit scheduler respawn logic.
Nov 14 2022, 1:48 PM · Easy hack, Scheduling utilities
vlorentz added a comment to D8663: Hackage: Implement incremental mode.

One last thing: could you make tests check the request body is as expected? See https://requests-mock.readthedocs.io/en/latest/history.html

Nov 14 2022, 1:35 PM
olasd accepted D8835: changelog: document recent git loader speed improvements.

Thanks!

Nov 14 2022, 1:34 PM
vlorentz closed D8834: roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks.
Nov 14 2022, 1:08 PM
vlorentz committed rDDOC072eeb4f771e: roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks (authored by vlorentz).
roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks
Nov 14 2022, 1:08 PM
vlorentz added a comment to T4599: Github descriptions are not used to search origins.

swh-web uses swh-search as a glorified postgresql index: for every result returned by swh-search, it pulls the corresponding row from origin_intrinsic_metadata in the indexer database; which means it ignores extrinsic metadata.

Nov 14 2022, 1:07 PM · Metadata workflow, Archive search
zack requested review of D8835: changelog: document recent git loader speed improvements.
Nov 14 2022, 12:43 PM
vlorentz created P1527 (An Untitled Masterwork).
Nov 14 2022, 12:38 PM
ardumont accepted D8834: roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks.
Nov 14 2022, 11:38 AM
vlorentz requested review of D8834: roadmap-2022: Replace seirl with vlorentz as lead on dataset/graph tasks.
Nov 14 2022, 11:37 AM
olasd added a revision to T4657: Allow object removal from journal: D8833: Add base functionality to support object deletion.
Nov 14 2022, 11:12 AM · Journal
olasd renamed T4680: GitLab lister: allow ignoring origins contained in given namespace prefixes from GitLab lister: allow ignoring origins contained in a given namespace to GitLab lister: allow ignoring origins contained in given namespace prefixes.
Nov 14 2022, 11:04 AM · Easy hack, Lister
olasd created T4680: GitLab lister: allow ignoring origins contained in given namespace prefixes.
Nov 14 2022, 11:03 AM · Easy hack, Lister
swh-public-ci added a comment to D8663: Hackage: Implement incremental mode.

Build is green

Nov 14 2022, 10:53 AM
franckbret added a comment to D8663: Hackage: Implement incremental mode.

buuuut you are using a strict inequality, so you need to subtract one day, in order not to miss uploads submitted after the previous run of the lister but on the same day.

Also, you should apply .astimezone(tz=timezone.utc) before converting to date, because the database is not guaranteed to return timestamps in UTC even when they were written in UTC.

(Sorry for the back-and-forth; hopefully I'm done now.)

Nov 14 2022, 10:53 AM
franckbret updated the diff for D8663: Hackage: Implement incremental mode.

Use greater than or equal instead of strict comparison when building http api query params for incremental listing

Nov 14 2022, 10:48 AM
franckbret abandoned D8824: Cpan: Implement incremental mode.

Abandon revision because in this case we can not really get advantages of an incremental mode

Nov 14 2022, 10:06 AM
franckbret added a comment to D8824: Cpan: Implement incremental mode.

@franckbret, as explained in my inline comment we cannot use the date filtering on the release index of CPAN elasticsearch.

The only incremental mode we can implement here is to filter the ListedOrigininstances sent to the scheduler according to the
last_updatevalue, if it is greater than the date from the lister state, we can yield it.

Nevertheless, I am not sure if it is worth it as a full listing takes around 10 minutes, which is pretty fast.

Nov 14 2022, 10:04 AM
swh-public-ci added a comment to D8748: Nuget: Implement incremental listing.

Build is green

Nov 14 2022, 9:39 AM
franckbret closed D8748: Nuget: Implement incremental listing.
Nov 14 2022, 9:33 AM
franckbret committed rDLSea146ce297d5: Nuget: Implement incremental listing (authored by franckbret).
Nuget: Implement incremental listing
Nov 14 2022, 9:33 AM
franckbret updated the diff for D8748: Nuget: Implement incremental listing.

Rebase

Nov 14 2022, 9:33 AM

Nov 13 2022

zack committed rMSLD1c9d37a84694: biennale talk: last touches (authored by zack).
biennale talk: last touches
Nov 13 2022, 2:47 PM
bchauvet updated the task description for T4678: Automation of add forge now workflow.
Nov 13 2022, 1:17 PM · Add Forge Now