Page MenuHomeSoftware Heritage

Normalize data sent from clients to the storage
Open, NormalPublic

Description

These days, the data normalization step in the graph storage happens at the point where we write the data out to PostgreSQL.

This is becoming an issue now that we're considering pushing data to the journal directly when it gets inserted, as journal clients end up needing to do the normalization before consuming the data.

This also makes the "normalized data schema" dependent on the PostgreSQL implementation, instead of having a proper specification, which is problematic when considering "post-Postgres" graph storage backends.

This is not entirely a problem *now* as the journal consumers are fully controlled by us and either process a few bits of the data, or just write back to postgresql; It's going to be a problem in the close future.

We should:

  • make sure the normalized data schema is (more cleanly) specified
  • make loaders normalize their data before sending it to the storage backend
  • make the journal backfiller normalize its fetched data before sending it to the journal
  • (plausibly) make the storage backend check data normalization before accepting to store it

Event Timeline

olasd triaged this task as Normal priority.Apr 4 2019, 12:31 PM
olasd created this task.
olasd updated the task description. (Show Details)