Differential D5880 Diff 21191 swh/provenance/postgresql/provenancedb_base.py

Changeset View

Standalone View

swh/provenance/postgresql/provenancedb_base.py

Show All 27 Lines	class ProvenanceDBBase:
@property		@property
def with_path(self) -> bool:		def with_path(self) -> bool:
return self.flavor == "with-path"		return self.flavor == "with-path"

def commit(self, data: Dict[str, Any], raise_on_commit: bool = False) -> bool:		def commit(self, data: Dict[str, Any], raise_on_commit: bool = False) -> bool:
try:		try:
# First insert entities		# First insert entities
for entity in ("content", "directory", "revision"):		for entity in ("content", "directory", "revision"):

self.insert_entity(		self.insert_entity(
entity,		entity,
{		{
sha1: data[entity]["data"][sha1]		sha1: data[entity]["data"][sha1]
for sha1 in data[entity]["added"]		for sha1 in data[entity]["added"]
},		},
)		)
		douarddaUnsubmitted Not Done Inline Actions why clearing the cache? (It somewhat defeats the purpose of a cache). douardda: why clearing the cache? (It somewhat defeats the purpose of a cache).
		aevisoAuthorUnsubmitted Done Inline Actions This was originally done to avoid trying to insert again was has already succeeded in the case of a commit fail. It doesn't defeat the purpose of the cache since we are precisely in the `commit` method. This cache was designed to be cleared after every commit, so that the worker gets the same value for a given entry if it is queried more than once during the same batch of revisions (even if another worker has modified that entry in parallel). But after that, we want the worker to see the updated value for the next batch of revisions. aeviso: This was originally done to avoid trying to insert again was has already succeeded in the case…
		douarddaUnsubmitted Not Done Inline Actions I do discuss the reason for clearing the write cache (now, the `cache["added"]` only), but I don't see the purpose of clearing the data cache, ie to )potentially) enforce querying the same data from the db over and over. I know that the big `clear_cache` called from provenance.py will also clear all these caches, so it actually does not make a difference, but what I don't understand is why we don't keep these caches as long as possible (obvisoulsy, at some point we may need a proper lru handling). douardda: I do discuss the reason for clearing the write cache (now, the `cache["added"]` only), but I…
		aevisoAuthorUnsubmitted Done Inline Actions We don't keep then as long as possible because this will be shadowing the worker from modifications made concurrently by other workers. Thus the decision of making cache only last for the current processing batch. It makes the processing of the batch consistent if some data is queried several times and it improves performance, but you don't want to keep outdated data forever aeviso: We don't keep then as long as possible because this will be shadowing the worker from…
		douarddaUnsubmitted Not Done Inline Actions oh yes you are right. Now I wonder how much, practically, these worker-local caches hurt the algorithm when running with massive concurrency... but that's another story for sure. douardda: oh yes you are right. Now I wonder how much, practically, these worker-local caches hurt the…

		douarddaUnsubmitted Not Done Inline Actions I did not do this partial clear because of the global `clear_cache()` being called from `ProvenanceBackend.commit`. I guess if we do handle (write) cache clearings at this level, then there is no need for this global clear_cache in ProvenanceBackend. But in any case, this is unrelated with any of the 2 git revisions in this diff (it's not a origin-revision layer thing, neither it's a code styling issue). douardda: I did not do this partial clear because of the global `clear_cache()` being called from…
		aevisoAuthorUnsubmitted Done Inline Actions If everything goes well during a commit, the final call to `clear_cache` is indeed not required, it was done just for sanity (ie. to be sure that all cache were cleared after a successful commit). Also, I agree it was not supposed to be part of this commit. All this cache logic will be actually reworked as part of the backend refactoring, since we need to find a way to guarantee that a revision can be safely reprocessed in case of a critical db error. Anyway, I also think this shouldn't have been changed in the first place as, IMHO, it is better to keep the same semantics as before for now. aeviso: If everything goes well during a commit, the final call to `clear_cache` is indeed not required…
		douarddaUnsubmitted Not Done Inline Actions If everything goes well during a commit, the final call to `clear_cache` is indeed not required, it was done just for sanity (ie. to be sure that all cache were cleared after a successful commit). ok, but I'd much better like a code that does not need a "sanity cache clearing", because if it's needed, then there is something wrong we don't detect at runtime. If any, we should replace this "sanity final clear cache" by a "sanity final cache check" that ensures that no write cache contains an entry, and raise an exception if not true. Also, I agree it was not supposed to be part of this commit. All this cache logic will be actually reworked as part of the backend refactoring, since we need to find a way to guarantee that a revision can be safely reprocessed in case of a critical db error. Anyway, I also think this shouldn't have been changed in the first place as, IMHO, it is better to keep the same semantics as before for now. ok douardda: > If everything goes well during a commit, the final call to `clear_cache` is indeed not…
		aevisoAuthorUnsubmitted Done Inline Actions It's not really needed, but it was there before this other cleanings were introduced. What's the harm of keeping it? I don't get it aeviso: It's not really needed, but it was there before this other cleanings were introduced. What's…
		aevisoAuthorUnsubmitted Done Inline Actions Also, this clearing of the structure is to somehow keep track of what was already processed, the fact that this is clearing the outer cache is simply because we forward to this methods the actual cache of the upper class. But we may want to forward a deep copy of it instead, to make things clearer. Hence, this class just clears what was already processed and the upper class clears it cache independently aeviso: Also, this clearing of the structure is to somehow keep track of what was already processed…
		douarddaUnsubmitted Not Done Inline Actions It's not really needed, but it was there before this other cleanings were introduced. What's the harm of keeping it? I don't get it The harm will not be at processing/execution time, it's more of a human-reader concern: it's a bit confusing; what's the intention of the developer when doing this global clear cache in addition to step-by-step clearing? what does it say about how the code is supposed to work? etc. If you think a sanity check is needed there, then make it a real sanity check (i.e. check that the cache is empty at the end of the commit/flush), not hide under the carpet any discrepancy ;-) Also, this clearing of the structure is to somehow keep track of what was already processed, After this discussion, I'm more i favor of keeping the cache clearings piece by piece as we wove along the insertion process and kill the global clear_cache (that I find confusing on the intentions of the code). But we may want to forward a deep copy of it instead, I don't think we should copy the structures at all actually; it makes a lot of copy but I believe is in fact not needed. I see no harm in having these caches being filled in provenance.py and consumed by the low level (db) part of the code. As long as it's clearly documented. douardda: > It's not really needed, but it was there before this other cleanings were introduced. What's…
		aevisoAuthorUnsubmitted Done Inline Actions Well, I think the proper thing to do is forward only the write cache to the lower class, have it cleared as inserts succeed but let the upper class decide whether to clear the read cache or not. What I don't like about the current state of the solution is that we are forwarding the complete cache to the db class, instead of having a clear separation between write and read caches. This need to be reworked as part of the backend refactoring. aeviso: Well, I think the proper thing to do is forward only the write cache to the lower class, have…
		douarddaUnsubmitted Not Done Inline Actions We currently have only one cache, the "write cache" is not a cache per se, it's the list (well the set) of elements in the cache that needs to be written in the db. Having 2 caches gives no real advantage but instead oblige to look in 2 caches (read cache then write cache) when looking for an element. I don't see the drawback of the current solution (keeping only one cache + a list of cache entries that needs to be written in the db). douardda: We currently have only one cache, the "write cache" is not a cache per se, it's the list (well…
# Relations should come after ids for entities were resolved		# Relations should come after ids for entities were resolved
for rel_table in (		for relation in (
"content_in_revision",		"content_in_revision",
"content_in_directory",		"content_in_directory",
"directory_in_revision",		"directory_in_revision",
):		):
self.insert_relation(rel_table, data[rel_table])		self.insert_relation(relation, data[relation])

# TODO: this should be updated when origin-revision layer gets properly		# TODO: this should be updated when origin-revision layer gets properly
# updated.		# updated.
# if data["revision_before_revision"]:		# if data["revision_before_revision"]:
# psycopg2.extras.execute_values(		# psycopg2.extras.execute_values(
# self.cursor,		# self.cursor,
# """		# """
# LOCK TABLE ONLY revision_before_revision;		# LOCK TABLE ONLY revision_before_revision;
Show All 9 Lines	def commit(self, data: Dict[str, Any], raise_on_commit: bool = False) -> bool:
# self.cursor,		# self.cursor,
# """		# """
# LOCK TABLE ONLY revision_in_origin;		# LOCK TABLE ONLY revision_in_origin;
# INSERT INTO revision_in_origin VALUES %s		# INSERT INTO revision_in_origin VALUES %s
# ON CONFLICT DO NOTHING		# ON CONFLICT DO NOTHING
# """,		# """,
# data["revision_in_origin"],		# data["revision_in_origin"],
# )		# )
# data["revision_in_origin"].clear()		# data["revision_in_origin"].clear()
		douarddaUnsubmitted Done Inline Actions nitpick: I'd rather put these insertions snippets in dedicated methods instead of having a "giant" `commit()` method (aka one for each type of table/relation, as it's now the case for the content-to-revision layer). douardda: nitpick: I'd rather put these insertions snippets in dedicated methods instead of having a…
		aevisoAuthorUnsubmitted Done Inline Actions Yeah, I just got the feeling that the queries for the origin-revision layer are still too ad-hoc to be properly abstracted. All this class definitely requires some reworking. Will do anyway aeviso: Yeah, I just got the feeling that the queries for the origin-revision layer are still too ad…

return True		return True
except: # noqa: E722		except: # noqa: E722
# Unexpected error occurred, rollback all changes and log message		# Unexpected error occurred, rollback all changes and log message
logging.exception("Unexpected error")		logging.exception("Unexpected error")
if raise_on_commit:		if raise_on_commit:
raise		raise
return False		return False
Show All 11 Lines	class ProvenanceDBBase:

def insert_entity(self, entity: str, data: Dict[bytes, datetime]):		def insert_entity(self, entity: str, data: Dict[bytes, datetime]):
if data:		if data:
psycopg2.extras.execute_values(		psycopg2.extras.execute_values(
self.cursor,		self.cursor,
f"""		f"""
LOCK TABLE ONLY {entity};		LOCK TABLE ONLY {entity};
INSERT INTO {entity}(sha1, date) VALUES %s		INSERT INTO {entity}(sha1, date) VALUES %s
ON CONFLICT (sha1) DO		ON CONFLICT (sha1) DO
UPDATE SET date=LEAST(EXCLUDED.date,{entity}.date)		UPDATE SET date=LEAST(EXCLUDED.date,{entity}.date)
""",		""",
data.items(),		data.items(),
)		)
# XXX: not sure if Python takes a reference or a copy.		# XXX: not sure if Python takes a reference or a copy.
# This might be useless!		# This might be useless!
data.clear()		data.clear()

def insert_relation(self, relation: str, data: Set[Tuple[bytes, bytes, bytes]]):		def insert_relation(self, relation: str, data: Set[Tuple[bytes, bytes, bytes]]):
Show All 10 Lines	) -> Generator[Tuple[bytes, bytes, datetime, bytes], None, None]:
...		...

def origin_get_id(self, url: str) -> int:		def origin_get_id(self, url: str) -> int:
# Insert origin in the DB and return the assigned id		# Insert origin in the DB and return the assigned id
self.cursor.execute(		self.cursor.execute(
"""		"""
LOCK TABLE ONLY origin;		LOCK TABLE ONLY origin;
INSERT INTO origin(url) VALUES (%s)		INSERT INTO origin(url) VALUES (%s)
ON CONFLICT DO NOTHING		ON CONFLICT DO NOTHING
RETURNING id		RETURNING id
""",		""",
(url,),		(url,),
)		)
return self.cursor.fetchone()[0]		return self.cursor.fetchone()[0]

def revision_get_preferred_origin(self, revision: bytes) -> int:		def revision_get_preferred_origin(self, revision: bytes) -> int:
self.cursor.execute(		self.cursor.execute(
"""SELECT COALESCE(origin, 0) FROM revision WHERE sha1=%s""", (revision,)		"""SELECT COALESCE(origin, 0) FROM revision WHERE sha1=%s""", (revision,)
Show All 36 Lines