Page MenuHomeSoftware Heritage

Origin metadata pipeline.
AbandonedPublic

Authored by vlorentz on Oct 15 2018, 6:32 PM.

Details

Reviewers
moranegg
Group Reviewers
Reviewers

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
origin-metadata
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1661
Build 2005: arc lint + arc unit

Event Timeline

  • Fix race condition in tests (???????)
  • Make the OriginMetadataIndexer work.
  • Copy revision-metadata-related IndexerStorage and SQL stuff to work with origin-metadata as well.
vlorentz retitled this revision from [WIP] Origin metadata pipeline. to Origin metadata pipeline..Oct 18 2018, 5:41 PM
vlorentz edited the summary of this revision. (Show Details)

Did you change the swh-schema.sql, I can't see it in this diff..?

sql/json/origin_metadata.translated_metadata.json
37 ↗(On Diff #1700)

type was deleted from the revision metadata set so it shouldn't be here either.

sql/swh-indexes.sql
36 ↗(On Diff #1700)

didn't we call the table origin_intrinsic_metadata so it won't have ambiguous connotation with the main db table origin_metadata?

swh/indexer/storage/__init__.py
510 ↗(On Diff #1700)

is it necessary to know that the metadata for an origin is missing?
as we need to compute every time there is a new most_recent_HEAD_revision the metadata associated

516 ↗(On Diff #1700)

an origin can't be identified by sha1
is it the origin_id?

556 ↗(On Diff #1700)

sha1_git is not the origin identifier

moranegg added a subscriber: ardumont.

The last comment was submitted rapidly, sorry..
It looks good, but I can't review tasks.py and other technical changes to the content indexers, so i'm pinging @ardumont to do a review as well.

when running the tests I have errors, one of which looks like a configuration on my side:

Unrecoverable error: ImproperlyConfigured('You need to install the redis library in order to use the Redis result store backend.'

If have a clue how to fix this, I'll be happy to hear.

Regarding the naming of the origin_metadata table, I don't mind using a different name than origin_intrinsic_metadata, but it shouldn't be origin_metadata.
I do think that origin_intrinsic_metadata is better than origin_metadata_translation.

And of course the diff on the swh-schema;sql is missing from this diff and all naming according to the table created should be adapted to the table name.

This revision now requires changes to proceed.Oct 19 2018, 11:06 AM

Another question, did we choose to have the revision_id as a column of the origin_intrinsic_metadata table? or is it in the translated_metadata?

Did you change the swh-schema.sql, I can't see it in this diff..?

I didn't have to, there's already a origin_metadata_translation table. Looks like @ardumont foresaw our need ^^

Did you change the swh-schema.sql, I can't see it in this diff..?

I didn't have to, there's already a origin_metadata_translation table. Looks like @ardumont foresaw our need ^^

I added this table, when we wanted to translate the origin_metadata we got when listing/loading origins or when getting metadata from catalogs and registries about an origin.
This is why the origin_metadata name is so ambiguous, because origin_metadata is the table where we want to store all raw metadata we capture (for now only metadata we get with deposits)
For more info see T833, T795 & T737.

If you prefer using the name origin_metadata_translation, you should change all naming in this diff to origin_metadata_translation.

Furthermore, I think we specified with @zack different columns than the ones already created.

The last comment was submitted rapidly, sorry..

(just in case: you can edit commits after submission)

Regarding the naming of the origin_metadata table, I don't mind using a different name than origin_intrinsic_metadata, but it shouldn't be origin_metadata.
I do think that origin_intrinsic_metadata is better than origin_metadata_translation.

Yep. Also, that's the name we agreed upon in our meeting, isn't it? I'm not sure why we should change it now.
(And "translation" is something specific to the way we obtain the information, which makes for bad names, as it is meaningless for the consumers of the information stored in the table.)

And of course the diff on the swh-schema;sql is missing from this diff and all naming according to the table created should be adapted to the table name.

Hear, hear. I was waiting for a schema diff to have this discussion.

  • s/origin_metadata/origin_intrinsic_metadata/g

when running the tests I have errors, one of which looks like a configuration on my side:

Unrecoverable error: ImproperlyConfigured('You need to install the redis library in order to use the Redis result store backend.'

If have a clue how to fix this, I'll be happy to hear.

I had to make the tests use Redis. I don't know why, but they don't work with the cache+memory:// backend. Ping @douardda

Regarding the naming of the origin_metadata table, I don't mind using a different name than origin_intrinsic_metadata, but it shouldn't be origin_metadata.
I do think that origin_intrinsic_metadata is better than origin_metadata_translation.

Indeed. Done.

And of course the diff on the swh-schema;sql is missing from this diff and all naming according to the table created should be adapted to the table name.

Done.

vlorentz marked 4 inline comments as done.
  • Apply changes suggested by @moranegg.
  • s/origin_intrinsic_metadata_translation/origin_intrinsic_metadata/g
swh/indexer/storage/__init__.py
510 ↗(On Diff #1700)

I don't know, but it can't hurt to have this endpoint available

sql/json/origin_metadata.translated_metadata.json
37 ↗(On Diff #1700)

reping :-)
type isn't a property we need/should list in the metadata.

sql/swh-func.sql
365 ↗(On Diff #1704)

this might change with the refactoring of the origin_intrinsic_metadata table

sql/swh-indexes.sql
36 ↗(On Diff #1700)

cool you changed the name.
Now the question is about the PKEY, is it only the origin_id or not?
Do we want to keep different origin_intrinsic_metadata for different revisions/tools?
Or just one most recent copy for the most recent tool used?

pinging @zack on this question.

sql/swh-schema.sql
129 ↗(On Diff #1704)

you should delete all translation mentions

130 ↗(On Diff #1704)

It might be more useful if more explicit
id -> origin_id

*Note that with the origin_metadata_translation the id was the object_id of an origin_metadata entry

131 ↗(On Diff #1704)

suggestion: result-> revision_metadata ?

132 ↗(On Diff #1704)

the long lasting question, is it a indexer_configuration_tool for origins or the one used on the revision?
if it's the one used in the revision, maybe revision_tool_id is more explicit

136 ↗(On Diff #1704)

view comment in 130

137 ↗(On Diff #1704)

not specific to origin_metadata
suggestion :

copy of the most recent revision intrinsic_metadata found with tool
138 ↗(On Diff #1704)

depends on comment in 132
is it the tool used for translation or for the detection of the most recent revision

sql/upgrades/121.sql
1 ↗(On Diff #1704)

Not commenting , but this should be changed with the other changes on sql files

swh/indexer/indexer.py
378

ping @ardumont
I remember that there was a reason why there is no return to run, but maybe I'm wrong

swh/indexer/metadata.py
273

Can you add explicit docs string about the arguments:

  • revisions_metadata
  • origin_head_pairs
280

seems a bit weird that the ids in this current run method are revision_ids
and the ones sent to super() are the origin_ids

more human comments about this would be nice :-)

swh/indexer/tests/test_origin_metadata.py
29

I'm not sure what tool should be used for the origin_intrinsic _metadata
the tool used for the revision is a file detector and a metadata translator to CodeMeta
in the future we might want to use a different vocabulary, a different minimal set and maybe add files to the file detection procedure
so a revision can be indexed with different tool and have different results

That's why we said it is important to have the exact revision_metadata entry identifierd in the origin_intrinsic_metadata table.
I'm not sure if this should be with the tuple <revision_id, revision_tool_id> or create a revision_metadata object_id because at the moment there is no identifier on a revision_metadata entry.

77

I think it's important to specify what type of object this metadata describes:
content_metadata
revision_metadata
origin_intrinsic_metadata

If this result is the expected result of an origin_intrinsic_metadata, I think there is missing information.

113

here both entries have the same result but shouldn't

swh/indexer/tests/test_utils.py
41 ↗(On Diff #1704)

i'd like to see a test with one of the origins here.
But you might prefer to choose for your test a real url where there is in the root directory a package.json file
for example : https://github.com/librariesio/yarn-parser

swh/indexer/indexer.py
378

Previously next_step did not return anything (triggered a new delayed task - mimetype indexer towards orchestrator text, and that was all IIRC).

I think that changes in that diff (yes, take a look at the OriginHeadIndexer's implementation below).

To sum up, as usual, it's ok to adapt if it's needed.

vlorentz added inline comments.
sql/json/origin_metadata.translated_metadata.json
37 ↗(On Diff #1700)

Hmm? I did remove it

sql/swh-indexes.sql
36 ↗(On Diff #1700)

Now the question is about the PKEY, is it only the origin_id or not?

Yes, it is.

Do we want to keep different origin_intrinsic_metadata for different revisions/tools?

We decided we wouldn't

sql/swh-schema.sql
129 ↗(On Diff #1704)

you should delete all translation mentions

Already done

swh/indexer/metadata.py
280

the ids in this current run method are revision_ids

They are not revision ids, they are the metadata itself. That's because they come from a previous Celery task, and Celery forces it to be a first argument... :/

more human comments about this would be nice :-)

Sure, will do

swh/indexer/tests/test_utils.py
41 ↗(On Diff #1704)

i'd like to see a test with one of the origins here.

What do you mean?

sql/json/origin_metadata.translated_metadata.json
37 ↗(On Diff #1700)

cool, lag between your new commit and my submition

swh/indexer/tests/test_utils.py
41 ↗(On Diff #1704)

here you have a list of origins in the MockStorage for tests.
I wasn't sure you used on of them for the OriginMetadataIndexer.
But now I see that you have.

I proposed also a real life origin url that should result with metadata with the entire pipeline.

vlorentz marked an inline comment as done.

Abandoning in favor of D557 and D558. This Diff was getting too big and was full of bugs.