Oups, sorry, didn't mean to accept this, only to remove myself from reviewers.
I'll let @anlambert finish the actual review.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 28 2019
Mar 26 2019
In D1295#27649, @zack wrote:or, actually, we can just also add a fulltext index to URLs and be done with it https://www.postgresql.org/docs/11/textsearch-intro.html#TEXTSEARCH-MATCHING
In D1295#27648, @zack wrote:@anlambert given we have a trigram index on origin URLs, have you ever tried to use the various similarity operators document at https://www.postgresql.org/docs/11/pgtrgm.html instead of generating all possible permutations for regexs?
I'm assuming (probably too naively) that you can just do a big select on the URLs, sorting by similarity and possibly filtering on a threshold to return meaningful results. But it's not like I've actually tested it…
@anlambert given we have a trigram index on origin URLs, have you ever tried to use the various similarity operators document at https://www.postgresql.org/docs/11/pgtrgm.html instead of generating all possible permutations for regexs?
I'm assuming (probably too naively) that you can just do a big select on the URLs, sorting by similarity and possibly filtering on a threshold to return meaningful results. But it's not like I've actually tested it…
Mar 25 2019
- swh-monthly-report: helper script to draft monthly activity team reports
- swh-monthly-report: filter on committer date
- swh-weekly-report: filter on committer date
Mar 24 2019
Sure, just go ahead: there is no need to "reserve" tasks as a prerequisite to work on them. Just submit a diff against the lister repo as a diff when you've something ready to review :-)
Mar 22 2019
- swhphab.py: do not crash when printing summary of repo-less diffs
- swhphab.py: include status when printing task summaries
- swh-weekly-report: further refactoring/clean-up against swhphab.py
- swh-weekly-report: split generic code to swhphab.py
Mar 20 2019
Mar 18 2019
Mar 17 2019
here's an old mail of mine to -devel with additional context:
Mar 15 2019
@vlorentz: lather, rinse, repeat.
softwareheritage-indexer=# DELETE FROM revision_intrinsic_metadata WHERE metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb ; ERROR: deadlock detected DETAIL: Process 23900 waits for ShareLock on transaction 212164862; blocked by process 20175. Process 20175 waits for ShareLock on transaction 212164381; blocked by process 23900. HINT: See server log for query details. CONTEXT: while deleting tuple (772424,55) in relation "revision_intrinsic_metadata" Time: 33048,828 ms (00:33,049)
(just happened, after indexers have been restarted including D1218)
In D1248#26555, @vlorentz wrote:This function outputs JSON-LD arrays, which are unordered.
I don't think it's useful to deduplicate, as these keywords are written by a human, so duplicates would be intentional.
Mar 12 2019
unless i'm missing something, this has been completed a while ago (if not, please reopen, ideally adding the relevant open sub-task)
Mar 11 2019
Contact information are available on our GSoC wiki page (which is in turn linked from the GSoC portal).
Mar 9 2019
In T1349#29267, @Sowmya wrote:can I get this task assigned by the administrator?
Mar 8 2019
Mar 6 2019
Thanks for this doc refactoring too!
Great, thanks for this doc refactoring!
Mar 4 2019
the job offer is now gone completely from the english page :-(
In T1549#29103, @vlorentz wrote:Once this is landed and deployed, ordering your DELETEs by revision_metadata.id will acquire locks in the same order as the idx_storage, solving the deadlock issue.
Mar 2 2019
The update completed, but a first attempt at the second DELETE failed with a deadlock (?!):
softwareheritage-indexer=# DELETE FROM revision_metadata WHERE translated_metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb ; ERROR: deadlock detected DETAIL: Process 10966 waits for ShareLock on transaction 197265813; blocked by process 11754. Process 11754 waits for ShareLock on transaction 197264487; blocked by process 10966. HINT: See server log for query details. CONTEXT: while deleting tuple (1380733,15) in relation "revision_metadata" Time: 170864,091 ms (02:50,864)
The following fix for the above (suggested by @vlorentz ) is now running:
update revision_metadata set translated_metadata = origin_intrinsic_metadata.metadata from origin_intrinsic_metadata where revision_metadata.id=origin_intrinsic_metadata.from_revision and revision_metadata.translated_metadata='{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}' and origin_intrinsic_metadata.metadata != '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}';
Mar 1 2019
As discussed on IRC, even after cleaning up origin_intrinsic_metadata, the DELETE on revision_metadata fails with:
softwareheritage-indexer=# DELETE FROM revision_metadata softwareheritage-indexer-# WHERE translated_metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::json b ;
I've started the first of following queries on somerset (in a screen of my user):
DELETE FROM origin_intrinsic_metadata WHERE metadata = '{"@context": "https://doi.org/10.5063/schema/codemeta-2.0"}'::jsonb ;
LGTM, please just add a comment on the test case (as discussed in the review) before landing
Feb 28 2019
great, thanks!
@anlambert: did you follow-up on list and/or to @singpolyma about the solution you've adopted?
Feb 26 2019
I don't understand your comment. What are the remaining arguments for using NULL instead of just deleting rows?
Yes :-)
so, do we agree that the right fix for this task is just to get rid of empty-ish rows? or are there other arguments that we haven't considered yet?
In T1549#28882, @vlorentz wrote:What is the provenance map?
My tentative proposal is to delete all table entries for which no metadata has been found.
The invariant will be: if an origin/revision has metadata, there will be an entry in the table(s); if not, the origin/revision will not appear.
(but of course if you want to have a CLI tool to generate the info, sure; I just wanted to highlight here that the end goal is the doc)
More than a CLI tool, I'd like to have documentation about how to use the CodeMeta metadata that we extract, sort of "typing information" for the content of the various intrinsic metadata tables.
It might be something as simple as:
@anlambert recently added a list origins method to the Web API. I'm pinging him here to make sure there is no overlap and/or that there is code to be reused/refactored related to this proposed change.
Feb 25 2019
That's the impression i got from testing. Either way, the current UI & semantics are bad, the proposed ones would be much better.
Feb 23 2019
I've added an item to the above list (metadata-only search); I think the ideal UI would be a single form with two checkboxes under it, one enabling URL-based search (enabled by default), one enabling metadata-based search (disabled by default).