The context is loosely based on SWHID qualifiers, but not a direct
translation (revision + release instead of anchor, visit actually
references a visit, and extra snapshot key).
Details
Diff Detail
- Repository
- rDSTO Storage manager
- Branch
- impl-origin-metadata-spec
- Lint
No Linters Available - Unit
No Unit Test Coverage - Build Status
Buildable 12415 Build 18838: Phabricator diff pipeline on jenkins Jenkins console · Jenkins Build 18837: arc lint + arc unit
Event Timeline
Build is green
Patch application report for D3154 (id=11200)
Rebasing onto 306aa69b26...
Current branch diff-target is up to date.
Changes applied before test
commit 35284ff11b80e9f2f2a2196be8c4fae1ad59c997 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu May 14 14:50:46 2020 +0200 Add artifact metadata to the extrinsic metadata storage specification.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/172/ for more details.
Question: can we support both context-full and context-less metadata for arbitrary artifacts?
There are metadata that will be independent of the context, but that we still don't want to store in the main storage. If we don't support that here I'm not sure where we will store those.
I don't know if the proposal is as simple as supporting a NULL/None value for context here, or if we should rather duplicate the methods you're proposing (one set for context-full, one for context-less). Either approach would be fine with me.
I don't think it makes sense to make this difference, because if a metadata only has origin in its context, it means it's context-free on all "axis" but the origin already. Having also no origin context is just one more axis it would be context-free on.
This is a great idea to add context.
I always thought the multiplicity will be from having a lot of authorities saying different things about artifacts.
Haven't realized that the same object in a different context can have in itself different metadata.
The question will remain, about how will we show the metadata.
I've added some changes to clarify the need of context without excluding the option to add metadata without context.
docs/extrinsic-metadata-specification.rst | ||
---|---|---|
194 | I would prefer saying that:
and add the note:
| |
197 | I don't understand that.. which authorship? of the file? I would change this example because it opens question on authorship that are not in the scope of this documentation. If we have a HAL deposit on an artifact developed in GitHub, the deposit metadata is enhancing the information we might have on the repository (it might not be in contradiction to the one on GitHub). So I would say,
| |
200 | I think that Therefore is not needed. | |
212 | do we get all metadata on the identified artifact, if the context is empty? | |
222 | typo a URL instead of an URL |
Thanks for sharing this: we're definitely going forward!
Comments are in the text, but overall:
- agree that we need a way to store and retrieve metadata that is contextless (or... valid in all contexts)
- for the context, we need to use the SHWIDs themselves, not the sha1_git that is bound to version 1 of SWHIDS
All the rest LGTM.
docs/extrinsic-metadata-specification.rst | ||
---|---|---|
194 | I agree that we want to be able to store metadata that does not depend on a context. Using an empty context field, or a special distinguished value for designating "all contexts" fills the bill. | |
197 | Agree with this clarification of the text. This is why it is important to qualify the metadata with the complete context for which it is inteded. | |
212 | I agree it is important to be able to formulate a query that allows to request the metadata available for all contexts. | |
222 | For the snapshot key in the dictionary, use the snapshot core SWHID, not the visit integer (that is an internal implementation detail) | |
223–228 | Here also, we do not want to commit to using sha1_git, as in the near future we might need to store metadata for SWHIDs that use different hashes. Let's use "(a release SWHID)", "(a revision SWHID)" etc. |
for the context, we need to use the SHWIDs themselves, not the sha1_git that is bound to version 1 of SWHIDS
What's the difference? When we'll use something other than sha1_git we'll just migrate this data with the rest.
docs/extrinsic-metadata-specification.rst | ||
---|---|---|
212 | Sorry, that's a typo. I didn't mean to add this context argument. |
docs/extrinsic-metadata-specification.rst | ||
---|---|---|
222 | What do you mean? Make snapshots have a snapshot context? |
I agree we should use SWHID everywhere. We should've used them in the beginning everywhere in fact, but now we should organically migrate to them.
There are many reasons for this, but one specific to this task is that it will make exports of this DB self-contained and meaningful, only depending on a publicly available spec, while using sub-parts of SWHID will lose that property.
Build is green
Patch application report for D3154 (id=11226)
Rebasing onto 6d24ed721a...
Current branch diff-target is up to date.
Changes applied before test
commit ab45de192802e92ec24845ac95c87464475a0b91 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Apr 9 11:46:21 2020 +0200 Implement extrinsic origin metadata specification. https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/176/ for more details.
Notice also that if/when we introduce new versions of SWHIDs with other
hashing algorithms, we need to maintain backward compatibility.
Metadata introduced with a given context must not be "migrated" to the new
SWHIDs, but "duplicated" for the new SWHIDs.
Better think right now of a schema that allows to "share" metadata payloads
among multiple versions of SWHIDs without actually copying them over.
Le sam. 16 mai 2020 à 16:24, zack (Stefano Zacchiroli) <
forge@softwareheritage.org> a écrit :
zack added a comment. View Revision
https://forge.softwareheritage.org/D3154In D3154#76782 https://forge.softwareheritage.org/D3154#76782, @vlorentz
https://forge.softwareheritage.org/p/vlorentz/ wrote:for the context, we need to use the SHWIDs themselves, not the sha1_git
that is bound to version 1 of SWHIDSWhat's the difference? When we'll use something other than sha1_git we'll
just migrate this data with the rest.I agree we should use SWHID everywhere. We should've used them in the
beginning everywhere in fact, but now we should organically migrate to them.
There are many reasons for this, but one specific to *this* task is that
it will make exports of this DB self-contained and meaningful, only
depending on a publicly available spec, while using sub-parts of SWHID will
lose that property.*REPOSITORY*
rDSTO Storage manager*CHANGES SINCE LAST ACTION*
https://forge.softwareheritage.org/D3154/new/*REVISION DETAIL*
https://forge.softwareheritage.org/D3154*EMAIL PREFERENCES*
https://forge.softwareheritage.org/settings/panel/emailpreferences/*To: *vlorentz, Reviewers, moranegg, rdicosmo
*Cc: *zack
Build is green
Patch application report for D3154 (id=11228)
Rebasing onto 6d24ed721a...
Current branch diff-target is up to date.
Changes applied before test
commit ab45de192802e92ec24845ac95c87464475a0b91 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu Apr 9 11:46:21 2020 +0200 Implement extrinsic origin metadata specification. https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/178/ for more details.
I applied most of your comments, except:
For the snapshot key in the dictionary, use the snapshot core SWHID, not the visit integer (that is an internal implementation detail)
because I don't understand it.
Also, should the first argument of <X>_metadata_add and <X>_metadata_get be a core SWHID?
Build is green
Patch application report for D3154 (id=11229)
Rebasing onto 6d24ed721a...
First, rewinding head to replay your work on top of it... Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit bb531dda9c4707a41fdb828a3eb4c427b49e17fc Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu May 14 14:50:46 2020 +0200 Add artifact metadata to the extrinsic metadata storage specification.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/179/ for more details.
Build is green
Patch application report for D3154 (id=11230)
Rebasing onto 6d24ed721a...
First, rewinding head to replay your work on top of it... Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit 00b0b8c8e5a061c8699b407026107ea8a252c4e1 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu May 14 14:50:46 2020 +0200 Add artifact metadata to the extrinsic metadata storage specification.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/180/ for more details.
Great!
except:
For the snapshot key in the dictionary, use the snapshot core SWHID, not the visit integer (that is an internal implementation detail)
because I don't understand it.
I removed that comment, it was incorrect: sorry for the noise.
Also, should the first argument of <X>_metadata_add and <X>_metadata_get be a core SWHID?
Yes, please!
Build is green
Patch application report for D3154 (id=11235)
Rebasing onto 6d24ed721a...
First, rewinding head to replay your work on top of it... Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit a38509ce2449e1d2f2bef0012fae95741dab43e5 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu May 14 14:50:46 2020 +0200 Add artifact metadata to the extrinsic metadata storage specification.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/182/ for more details.
Looks good to me (1 typo in the comments above).
Note: I understand the functional reason behind wanting to store SWHID.
Just mentioning that techically, that might make things harder if we want to
join (sql speaking) between tables. That means those "join" will have to take
place python side. That might be what we want in the end as we have other
storage backends where join might not be possible (e.g cassandra comes to
mind).
docs/extrinsic-metadata-specification.rst | ||
---|---|---|
235 | SWHID of a directory |
Build is green
Patch application report for D3154 (id=11265)
Rebasing onto f1b51a7a2b...
First, rewinding head to replay your work on top of it... Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit db1ecad241db08be6924bd69769e324562574241 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu May 14 14:50:46 2020 +0200 Add artifact metadata to the extrinsic metadata storage specification.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/188/ for more details.
Adding a last comment after a last reread
docs/extrinsic-metadata-specification.rst | ||
---|---|---|
228 | It seems that the XXXX in "(the core SWHID of a XXXX)" are off by one in lines 228, 230, 232: in 228 XXXX should be snapshot, not release, etc. |
Build is green
Patch application report for D3154 (id=11293)
Rebasing onto f1b51a7a2b...
First, rewinding head to replay your work on top of it... Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit 2e574d33b74fa7dff700757f9b72f32a435a47b1 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu May 14 14:50:46 2020 +0200 Add artifact metadata to the extrinsic metadata storage specification.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/189/ for more details.
Looks good!
I added a comment about the an URL typo which we discussed on IRC (and decided it's a URL because we trust Wikipedia) :-)
docs/extrinsic-metadata-specification.rst | ||
---|---|---|
226 | typo: an URL should be a URL |
Build is green
Patch application report for D3154 (id=11302)
Rebasing onto f1b51a7a2b...
First, rewinding head to replay your work on top of it... Applying: Add artifact metadata to the extrinsic metadata storage specification.
Changes applied before test
commit ffea7a4d2801ce1b93a26d360f9bbc3704840b64 Author: Valentin Lorentz <vlorentz@softwareheritage.org> Date: Thu May 14 14:50:46 2020 +0200 Add artifact metadata to the extrinsic metadata storage specification.
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/190/ for more details.
Build is green
Patch application report for D3154 (id=11303)
Rebasing onto 213f1b1239...
First, rewinding head to replay your work on top of it... Fast-forwarded diff-target to base-revision-191-D3154.
Changes applied before test
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/191/ for more details.