Page MenuHomeSoftware Heritage

Store the type of intrinsic metadata that were extracted
Closed, MigratedEdits Locked

Description

ie. along with the CodeMeta translation, store either the type of file(s) they were extracted from (package.json, pom.xml, PKG-INFO, ...) or the mapping used for the translation.

Event Timeline

vlorentz triaged this task as Normal priority.Jan 21 2019, 11:37 AM
vlorentz created this task.

As some revisions/origins may have more than one metadata file (in which case we merge them), there is a m2m relation between revision/origin metadata rows and mappings. I see three ways to do it:

  • 1. A table storing mapping names, and a table to store the m2m relation for each of {content,revision,origin_intrinsic}_metadata
  • 2. On each row, store an array of strings, each of which is the name of a mapping
  • 3a. A table storing mapping names, and on each row of the metadata table, store an array of ids in this table
  • 3b. Same as 3a, but exposing mapping ids in the API instead of joining

Issues with each of these, for existing queries (add + get):

  • 1. code complexity (handling an m2m relation table) and storage space (with the current schema, the m2m table would need to store a sha1)
  • 2. Storage space (~16 bytes per string, not counting the array overhead)
  • 3a. Requires joining on array, which might be very inefficient
  • 3b. Not nice for consumers of the API.

Obviously we also want to run queries on these. Right now, the only one I can imaging is counting metadata from each mapping (T1484). Even with an index, I think all of these would be equally slow (full scan of all matching rows).

@zack @olasd Any insight?

(2) seems the best option to me.

(Assuming I'm reading it right. Before reading your proposal, I was thinking that the best option would be to store in the same table containing the results of metadata extraction the list of files that have been consulted/parsed to obtain that result. This is what (2) is about, right? If so, I confirm my "vote" :-))

You are correct, except I will store mapping names, not file names (eg. because gemspec files are usually named project_name.gemspec, which is harder to query).

You are correct, except I will store mapping names, not file names (eg. because gemspec files are usually named project_name.gemspec, which is harder to query).

*nod*