Diff Detail
- Repository
- rDCIDX Metadata indexer
- Branch
- origin-metadata
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 1689 Build 2033: arc lint + arc unit
Event Timeline
- Copy revision-metadata-related IndexerStorage and SQL stuff to work with origin-metadata as well.
Did you change the swh-schema.sql, I can't see it in this diff..?
sql/json/origin_metadata.translated_metadata.json | ||
---|---|---|
37 | type was deleted from the revision metadata set so it shouldn't be here either. | |
sql/swh-indexes.sql | ||
35 | didn't we call the table origin_intrinsic_metadata so it won't have ambiguous connotation with the main db table origin_metadata? | |
swh/indexer/storage/__init__.py | ||
509 | is it necessary to know that the metadata for an origin is missing? | |
515 | an origin can't be identified by sha1 | |
555 | sha1_git is not the origin identifier |
The last comment was submitted rapidly, sorry..
It looks good, but I can't review tasks.py and other technical changes to the content indexers, so i'm pinging @ardumont to do a review as well.
when running the tests I have errors, one of which looks like a configuration on my side:
Unrecoverable error: ImproperlyConfigured('You need to install the redis library in order to use the Redis result store backend.'
If have a clue how to fix this, I'll be happy to hear.
Regarding the naming of the origin_metadata table, I don't mind using a different name than origin_intrinsic_metadata, but it shouldn't be origin_metadata.
I do think that origin_intrinsic_metadata is better than origin_metadata_translation.
And of course the diff on the swh-schema;sql is missing from this diff and all naming according to the table created should be adapted to the table name.
Another question, did we choose to have the revision_id as a column of the origin_intrinsic_metadata table? or is it in the translated_metadata?
I didn't have to, there's already a origin_metadata_translation table. Looks like @ardumont foresaw our need ^^
I added this table, when we wanted to translate the origin_metadata we got when listing/loading origins or when getting metadata from catalogs and registries about an origin.
This is why the origin_metadata name is so ambiguous, because origin_metadata is the table where we want to store all raw metadata we capture (for now only metadata we get with deposits)
For more info see T833, T795 & T737.
If you prefer using the name origin_metadata_translation, you should change all naming in this diff to origin_metadata_translation.
Furthermore, I think we specified with @zack different columns than the ones already created.
(just in case: you can edit commits after submission)
Regarding the naming of the origin_metadata table, I don't mind using a different name than origin_intrinsic_metadata, but it shouldn't be origin_metadata.
I do think that origin_intrinsic_metadata is better than origin_metadata_translation.
Yep. Also, that's the name we agreed upon in our meeting, isn't it? I'm not sure why we should change it now.
(And "translation" is something specific to the way we obtain the information, which makes for bad names, as it is meaningless for the consumers of the information stored in the table.)
And of course the diff on the swh-schema;sql is missing from this diff and all naming according to the table created should be adapted to the table name.
Hear, hear. I was waiting for a schema diff to have this discussion.
I had to make the tests use Redis. I don't know why, but they don't work with the cache+memory:// backend. Ping @douardda
Regarding the naming of the origin_metadata table, I don't mind using a different name than origin_intrinsic_metadata, but it shouldn't be origin_metadata.
I do think that origin_intrinsic_metadata is better than origin_metadata_translation.
Indeed. Done.
And of course the diff on the swh-schema;sql is missing from this diff and all naming according to the table created should be adapted to the table name.
Done.
- Apply changes suggested by @moranegg.
- s/origin_intrinsic_metadata_translation/origin_intrinsic_metadata/g
swh/indexer/storage/__init__.py | ||
---|---|---|
509 | I don't know, but it can't hurt to have this endpoint available |
sql/json/origin_metadata.translated_metadata.json | ||
---|---|---|
37 | reping :-) | |
sql/swh-func.sql | ||
364 | this might change with the refactoring of the origin_intrinsic_metadata table | |
sql/swh-indexes.sql | ||
35 | cool you changed the name. pinging @zack on this question. | |
sql/swh-schema.sql | ||
129 | It might be more useful if more explicit *Note that with the origin_metadata_translation the id was the object_id of an origin_metadata entry | |
129–130 | you should delete all translation mentions | |
131 | suggestion: result-> revision_metadata ? | |
132 | the long lasting question, is it a indexer_configuration_tool for origins or the one used on the revision? | |
135 | view comment in 130 | |
136 | not specific to origin_metadata copy of the most recent revision intrinsic_metadata found with tool | |
137 | depends on comment in 132 | |
sql/upgrades/121.sql | ||
1 | Not commenting , but this should be changed with the other changes on sql files | |
swh/indexer/indexer.py | ||
379 | ping @ardumont | |
swh/indexer/metadata.py | ||
273 | Can you add explicit docs string about the arguments:
| |
280 | seems a bit weird that the ids in this current run method are revision_ids more human comments about this would be nice :-) | |
swh/indexer/tests/test_origin_metadata.py | ||
29 | I'm not sure what tool should be used for the origin_intrinsic _metadata That's why we said it is important to have the exact revision_metadata entry identifierd in the origin_intrinsic_metadata table. | |
77 | I think it's important to specify what type of object this metadata describes: If this result is the expected result of an origin_intrinsic_metadata, I think there is missing information. | |
113 | here both entries have the same result but shouldn't | |
swh/indexer/tests/test_utils.py | ||
40 | i'd like to see a test with one of the origins here. |
swh/indexer/indexer.py | ||
---|---|---|
379 | Previously next_step did not return anything (triggered a new delayed task - mimetype indexer towards orchestrator text, and that was all IIRC). I think that changes in that diff (yes, take a look at the OriginHeadIndexer's implementation below). To sum up, as usual, it's ok to adapt if it's needed. |
sql/json/origin_metadata.translated_metadata.json | ||
---|---|---|
37 | Hmm? I did remove it | |
sql/swh-indexes.sql | ||
35 |
Yes, it is.
We decided we wouldn't | |
sql/swh-schema.sql | ||
129–130 |
Already done | |
swh/indexer/metadata.py | ||
280 |
They are not revision ids, they are the metadata itself. That's because they come from a previous Celery task, and Celery forces it to be a first argument... :/
Sure, will do | |
swh/indexer/tests/test_utils.py | ||
40 |
What do you mean? |
sql/json/origin_metadata.translated_metadata.json | ||
---|---|---|
37 | cool, lag between your new commit and my submition | |
swh/indexer/tests/test_utils.py | ||
40 | here you have a list of origins in the MockStorage for tests. I proposed also a real life origin url that should result with metadata with the entire pipeline. |