Changeset View
Standalone View
swh/provenance/postgresql/provenancedb_base.py
Show All 27 Lines | class ProvenanceDBBase: | ||||
@property | @property | ||||
def with_path(self) -> bool: | def with_path(self) -> bool: | ||||
return self.flavor == "with-path" | return self.flavor == "with-path" | ||||
def commit(self, data: Dict[str, Any], raise_on_commit: bool = False) -> bool: | def commit(self, data: Dict[str, Any], raise_on_commit: bool = False) -> bool: | ||||
try: | try: | ||||
# First insert entities | # First insert entities | ||||
for entity in ("content", "directory", "revision"): | for entity in ("content", "directory", "revision"): | ||||
self.insert_entity( | self.insert_entity( | ||||
entity, | entity, | ||||
{ | { | ||||
sha1: data[entity]["data"][sha1] | sha1: data[entity]["data"][sha1] | ||||
for sha1 in data[entity]["added"] | for sha1 in data[entity]["added"] | ||||
}, | }, | ||||
) | ) | ||||
douardda: why clearing the cache? (It somewhat defeats the purpose of a cache). | |||||
Done Inline ActionsThis was originally done to avoid trying to insert again was has already succeeded in the case of a commit fail. It doesn't defeat the purpose of the cache since we are precisely in the commit method. This cache was designed to be cleared after every commit, so that the worker gets the same value for a given entry if it is queried more than once during the same batch of revisions (even if another worker has modified that entry in parallel). But after that, we want the worker to see the updated value for the next batch of revisions. aeviso: This was originally done to avoid trying to insert again was has already succeeded in the case… | |||||
Not Done Inline ActionsI do discuss the reason for clearing the write cache (now, the cache["added"] only), but I don't see the purpose of clearing the data cache, ie to )potentially) enforce querying the same data from the db over and over. I know that the big clear_cache called from provenance.py will also clear all these caches, so it actually does not make a difference, but what I don't understand is why we don't keep these caches as long as possible (obvisoulsy, at some point we may need a proper lru handling). douardda: I do discuss the reason for clearing the write cache (now, the `cache["added"]` only), but I… | |||||
Done Inline ActionsWe don't keep then as long as possible because this will be shadowing the worker from modifications made concurrently by other workers. Thus the decision of making cache only last for the current processing batch. It makes the processing of the batch consistent if some data is queried several times and it improves performance, but you don't want to keep outdated data forever aeviso: We don't keep then as long as possible because this will be shadowing the worker from… | |||||
Not Done Inline Actionsoh yes you are right. Now I wonder how much, practically, these worker-local caches hurt the algorithm when running with massive concurrency... but that's another story for sure. douardda: oh yes you are right. Now I wonder how much, practically, these worker-local caches hurt the… | |||||
Not Done Inline ActionsI did not do this partial clear because of the global clear_cache() being called from ProvenanceBackend.commit. But in any case, this is unrelated with any of the 2 git revisions in this diff (it's not a origin-revision layer thing, neither it's a code styling issue). douardda: I did not do this partial clear because of the global `clear_cache()` being called from… | |||||
Done Inline ActionsIf everything goes well during a commit, the final call to clear_cache is indeed not required, it was done just for sanity (ie. to be sure that all cache were cleared after a successful commit). Also, I agree it was not supposed to be part of this commit. All this cache logic will be actually reworked as part of the backend refactoring, since we need to find a way to guarantee that a revision can be safely reprocessed in case of a critical db error. Anyway, I also think this shouldn't have been changed in the first place as, IMHO, it is better to keep the same semantics as before for now. aeviso: If everything goes well during a commit, the final call to `clear_cache` is indeed not required… | |||||
Not Done Inline Actions
ok, but I'd much better like a code that does not need a "sanity cache clearing", because if it's needed, then there is something wrong we don't detect at runtime.
ok douardda: > If everything goes well during a commit, the final call to `clear_cache` is indeed not… | |||||
Done Inline ActionsIt's not really needed, but it was there before this other cleanings were introduced. What's the harm of keeping it? I don't get it aeviso: It's not really needed, but it was there before this other cleanings were introduced. What's… | |||||
Done Inline ActionsAlso, this clearing of the structure is to somehow keep track of what was already processed, the fact that this is clearing the outer cache is simply because we forward to this methods the actual cache of the upper class. But we may want to forward a deep copy of it instead, to make things clearer. Hence, this class just clears what was already processed and the upper class clears it cache independently aeviso: Also, this clearing of the structure is to somehow keep track of what was already processed… | |||||
Not Done Inline Actions
The harm will not be at processing/execution time, it's more of a human-reader concern: it's a bit confusing; what's the intention of the developer when doing this global clear cache in addition to step-by-step clearing? what does it say about how the code is supposed to work? etc.
After this discussion, I'm more i favor of keeping the cache clearings piece by piece as we wove along the insertion process and kill the global clear_cache (that I find confusing on the intentions of the code).
I don't think we should copy the structures at all actually; it makes a lot of copy but I believe is in fact not needed. I see no harm in having these caches being filled in provenance.py and consumed by the low level (db) part of the code. douardda: > It's not really needed, but it was there before this other cleanings were introduced. What's… | |||||
Done Inline ActionsWell, I think the proper thing to do is forward only the write cache to the lower class, have it cleared as inserts succeed but let the upper class decide whether to clear the read cache or not. What I don't like about the current state of the solution is that we are forwarding the complete cache to the db class, instead of having a clear separation between write and read caches. This need to be reworked as part of the backend refactoring. aeviso: Well, I think the proper thing to do is forward only the write cache to the lower class, have… | |||||
Not Done Inline ActionsWe currently have only one cache, the "write cache" is not a cache per se, it's the list (well the set) of elements in the cache that needs to be written in the db. Having 2 caches gives no real advantage but instead oblige to look in 2 caches (read cache then write cache) when looking for an element. I don't see the drawback of the current solution (keeping only one cache + a list of cache entries that needs to be written in the db). douardda: We currently have only one cache, the "write cache" is not a cache per se, it's the list (well… | |||||
# Relations should come after ids for entities were resolved | # Relations should come after ids for entities were resolved | ||||
for rel_table in ( | for relation in ( | ||||
"content_in_revision", | "content_in_revision", | ||||
"content_in_directory", | "content_in_directory", | ||||
"directory_in_revision", | "directory_in_revision", | ||||
): | ): | ||||
self.insert_relation(rel_table, data[rel_table]) | self.insert_relation(relation, data[relation]) | ||||
# TODO: this should be updated when origin-revision layer gets properly | # TODO: this should be updated when origin-revision layer gets properly | ||||
# updated. | # updated. | ||||
# if data["revision_before_revision"]: | # if data["revision_before_revision"]: | ||||
# psycopg2.extras.execute_values( | # psycopg2.extras.execute_values( | ||||
# self.cursor, | # self.cursor, | ||||
# """ | # """ | ||||
# LOCK TABLE ONLY revision_before_revision; | # LOCK TABLE ONLY revision_before_revision; | ||||
Show All 9 Lines | def commit(self, data: Dict[str, Any], raise_on_commit: bool = False) -> bool: | ||||
# self.cursor, | # self.cursor, | ||||
# """ | # """ | ||||
# LOCK TABLE ONLY revision_in_origin; | # LOCK TABLE ONLY revision_in_origin; | ||||
# INSERT INTO revision_in_origin VALUES %s | # INSERT INTO revision_in_origin VALUES %s | ||||
# ON CONFLICT DO NOTHING | # ON CONFLICT DO NOTHING | ||||
# """, | # """, | ||||
# data["revision_in_origin"], | # data["revision_in_origin"], | ||||
# ) | # ) | ||||
# data["revision_in_origin"].clear() | # data["revision_in_origin"].clear() | ||||
Done Inline Actionsnitpick: I'd rather put these insertions snippets in dedicated methods instead of having a "giant" commit() method (aka one for each type of table/relation, as it's now the case for the content-to-revision layer). douardda: nitpick: I'd rather put these insertions snippets in dedicated methods instead of having a… | |||||
Done Inline ActionsYeah, I just got the feeling that the queries for the origin-revision layer are still too ad-hoc to be properly abstracted. All this class definitely requires some reworking. Will do anyway aeviso: Yeah, I just got the feeling that the queries for the origin-revision layer are still too ad… | |||||
return True | return True | ||||
except: # noqa: E722 | except: # noqa: E722 | ||||
# Unexpected error occurred, rollback all changes and log message | # Unexpected error occurred, rollback all changes and log message | ||||
logging.exception("Unexpected error") | logging.exception("Unexpected error") | ||||
if raise_on_commit: | if raise_on_commit: | ||||
raise | raise | ||||
return False | return False | ||||
Show All 11 Lines | class ProvenanceDBBase: | ||||
def insert_entity(self, entity: str, data: Dict[bytes, datetime]): | def insert_entity(self, entity: str, data: Dict[bytes, datetime]): | ||||
if data: | if data: | ||||
psycopg2.extras.execute_values( | psycopg2.extras.execute_values( | ||||
self.cursor, | self.cursor, | ||||
f""" | f""" | ||||
LOCK TABLE ONLY {entity}; | LOCK TABLE ONLY {entity}; | ||||
INSERT INTO {entity}(sha1, date) VALUES %s | INSERT INTO {entity}(sha1, date) VALUES %s | ||||
ON CONFLICT (sha1) DO | ON CONFLICT (sha1) DO | ||||
UPDATE SET date=LEAST(EXCLUDED.date,{entity}.date) | UPDATE SET date=LEAST(EXCLUDED.date,{entity}.date) | ||||
""", | """, | ||||
data.items(), | data.items(), | ||||
) | ) | ||||
# XXX: not sure if Python takes a reference or a copy. | # XXX: not sure if Python takes a reference or a copy. | ||||
# This might be useless! | # This might be useless! | ||||
data.clear() | data.clear() | ||||
def insert_relation(self, relation: str, data: Set[Tuple[bytes, bytes, bytes]]): | def insert_relation(self, relation: str, data: Set[Tuple[bytes, bytes, bytes]]): | ||||
Show All 10 Lines | ) -> Generator[Tuple[bytes, bytes, datetime, bytes], None, None]: | ||||
... | ... | ||||
def origin_get_id(self, url: str) -> int: | def origin_get_id(self, url: str) -> int: | ||||
# Insert origin in the DB and return the assigned id | # Insert origin in the DB and return the assigned id | ||||
self.cursor.execute( | self.cursor.execute( | ||||
""" | """ | ||||
LOCK TABLE ONLY origin; | LOCK TABLE ONLY origin; | ||||
INSERT INTO origin(url) VALUES (%s) | INSERT INTO origin(url) VALUES (%s) | ||||
ON CONFLICT DO NOTHING | ON CONFLICT DO NOTHING | ||||
RETURNING id | RETURNING id | ||||
""", | """, | ||||
(url,), | (url,), | ||||
) | ) | ||||
return self.cursor.fetchone()[0] | return self.cursor.fetchone()[0] | ||||
def revision_get_preferred_origin(self, revision: bytes) -> int: | def revision_get_preferred_origin(self, revision: bytes) -> int: | ||||
self.cursor.execute( | self.cursor.execute( | ||||
"""SELECT COALESCE(origin, 0) FROM revision WHERE sha1=%s""", (revision,) | """SELECT COALESCE(origin, 0) FROM revision WHERE sha1=%s""", (revision,) | ||||
Show All 36 Lines |
why clearing the cache? (It somewhat defeats the purpose of a cache).