Store the type of intrinsic metadata that were extracted
Closed, MigratedEdits Locked
Actions

Assigned To

Authored By

	vlorentz
	Jan 21 2019, 11:37 AM

Description

ie. along with the CodeMeta translation, store either the type of file(s) they were extracted from (package.json, pom.xml, PKG-INFO, ...) or the mapping used for the translation.

Revisions and Commits

rDCIDX Metadata indexer
	Closed	D1010 Make metadata indexers store the mappings used to translate metadata.

Related Objects
Search...

Status	Assigned	Task
Migrated	gitlab-migration	T1485 Show stats on extracted metadata
Migrated	gitlab-migration	T1484 Provide stats on extracted metadata in the indexer storage api
Migrated	gitlab-migration	T1483 Store the type of intrinsic metadata that were extracted

Event Timeline

vlorentz triaged this task as Normal priority.Jan 21 2019, 11:37 AM

vlorentz created this task.

vlorentz added a parent task: T1484: Provide stats on extracted metadata in the indexer storage api.Jan 21 2019, 11:39 AM

zack added a project: Metadata workflow.Jan 21 2019, 11:45 AM

As some revisions/origins may have more than one metadata file (in which case we merge them), there is a m2m relation between revision/origin metadata rows and mappings. I see three ways to do it:

1. A table storing mapping names, and a table to store the m2m relation for each of {content,revision,origin_intrinsic}_metadata
2. On each row, store an array of strings, each of which is the name of a mapping
3a. A table storing mapping names, and on each row of the metadata table, store an array of ids in this table
3b. Same as 3a, but exposing mapping ids in the API instead of joining

Issues with each of these, for existing queries (add + get):

1. code complexity (handling an m2m relation table) and storage space (with the current schema, the m2m table would need to store a sha1)
2. Storage space (~16 bytes per string, not counting the array overhead)
3a. Requires joining on array, which might be very inefficient
3b. Not nice for consumers of the API.

Obviously we also want to run queries on these. Right now, the only one I can imaging is counting metadata from each mapping (T1484). Even with an index, I think all of these would be equally slow (full scan of all matching rows).

@zack @olasd Any insight?

(2) seems the best option to me.

(Assuming I'm reading it right. Before reading your proposal, I was thinking that the best option would be to store in the same table containing the results of metadata extraction the list of files that have been consulted/parsed to obtain that result. This is what (2) is about, right? If so, I confirm my "vote" :-))

You are correct, except I will store mapping names, not file names (eg. because gemspec files are usually named project_name.gemspec, which is harder to query).

In T1483#27359, @vlorentz wrote:

You are correct, except I will store mapping names, not file names (eg. because gemspec files are usually named project_name.gemspec, which is harder to query).

*nod*

vlorentz added a revision: D1010: Make metadata indexers store the mappings used to translate metadata..Jan 25 2019, 3:38 PM

Resolved by D1010.

This task has been migrated to GitLab.

Store the type of intrinsic metadata that were extractedClosed, MigratedEdits LockedActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Store the type of intrinsic metadata that were extracted
Closed, MigratedEdits Locked
Actions

Related Objects
Search...