ie. along with the CodeMeta translation, store either the type of file(s) they were extracted from (package.json, pom.xml, PKG-INFO, ...) or the mapping used for the translation.
Description
Revisions and Commits
Status | Assigned | Task | ||
---|---|---|---|---|
Migrated | gitlab-migration | T1485 Show stats on extracted metadata | ||
Migrated | gitlab-migration | T1484 Provide stats on extracted metadata in the indexer storage api | ||
Migrated | gitlab-migration | T1483 Store the type of intrinsic metadata that were extracted |
Event Timeline
As some revisions/origins may have more than one metadata file (in which case we merge them), there is a m2m relation between revision/origin metadata rows and mappings. I see three ways to do it:
- 1. A table storing mapping names, and a table to store the m2m relation for each of {content,revision,origin_intrinsic}_metadata
- 2. On each row, store an array of strings, each of which is the name of a mapping
- 3a. A table storing mapping names, and on each row of the metadata table, store an array of ids in this table
- 3b. Same as 3a, but exposing mapping ids in the API instead of joining
Issues with each of these, for existing queries (add + get):
- 1. code complexity (handling an m2m relation table) and storage space (with the current schema, the m2m table would need to store a sha1)
- 2. Storage space (~16 bytes per string, not counting the array overhead)
- 3a. Requires joining on array, which might be very inefficient
- 3b. Not nice for consumers of the API.
Obviously we also want to run queries on these. Right now, the only one I can imaging is counting metadata from each mapping (T1484). Even with an index, I think all of these would be equally slow (full scan of all matching rows).
(2) seems the best option to me.
(Assuming I'm reading it right. Before reading your proposal, I was thinking that the best option would be to store in the same table containing the results of metadata extraction the list of files that have been consulted/parsed to obtain that result. This is what (2) is about, right? If so, I confirm my "vote" :-))
You are correct, except I will store mapping names, not file names (eg. because gemspec files are usually named project_name.gemspec, which is harder to query).