- Group Reviewers
- Maniphest Tasks
- T1231: Create an origin indexer that lists the most recent revision in HEAD branch
- rDCIDXb5efc94035e4: Add OriginIndexer + OriginHeadIndexer.
only minor issues for me, looks good otherwise
should be "implements origin indexing"
More generally in this file (not related to this diff but might be a chance to fix it) we use "indexation" which means something else in English. We should use "indexing" consistently
We shouldn't have a default here, as there is no sane default.
If you want the list somewhere for ease of testing please put them in a REAME or use a local script.
if you are keeping prints for testing, don't mind this comment.
i would rename method or comment with an explicit comment saying
but i'm being picky..
Here the result is without the metadata and will be kept in the DB as a pointer to the last seen revision.
It seems a good compartmentalization.
I'm not a real python programmer, so this is not my place to say
I think it's only me, so you can completely ignore this comment.
is 'alias' and 'revision' the only target_type possible?
I used a default list with the RevisionMetatadaIndexer to test the component outside the MockStorage
psql does not agree:
softwareheritage=> select distinct target_type from snapshot_branch; target_type ------------- content release revision alias directory
I'm not sure what to do with content and directory though. I'll create an issue later.
Now that I think about it, it may affect the search between the time the OriginHead indexer runs and updates the reference, and the time the revision metadata is copied.
We should probably discuss this further, outside a Diff inline comments
This looks good to me. Testing is also great for later iterations.
With the origin_head.py main function I have an error because origin_head_add isn't defined in my code, which is normal, waiting for the diff on the indexer storage.
I played a bit with the main function origin_head.py trying to launch the RevisionMetadataIndexer, commenting line 442 in indexer.py
which makes me more keen on keeping just the revision_id for an origin and not copy the metadata to the origin level.
I'm voting to not copy the metadata and keep only the reference (which might change quickly)
I don't mind keeping the main method with the defaults here for a few iterations before deleting it.
you should probably use a different id (even if this is a mock storage)
you can remove the parenthesis mention now.
Expected a (type...
Maybe worth mentioning the origin in question in the error log.
The default configuration should be targeting a remote storage , defaulting to a local one as per  for example.
For the snapshot, possible target_type per loader:
thumbs up for the detailed type annotations here
As this is going to be persisted in the DB, it's time to define some decent naming for the indexer tools.
Bottom line: let's go for "metadata-origin".
as per T1257 this is no longer git-specific, so this should be _try_get_head
… for the same reason this should now be removed
I'm pretty sure this regex should not be here. It's a naming convention that, I guess, is also implemented by the tarball loader, right?
same for this method. (And hence maybe we do not need to expose the regex, but rather a filename parsing method from the tarball loader?)
should be updated according to the name change suggested above
As far as I understand, the tarball loader does not try to parse the file name; it only downloads a tarball from an URL, extracts it, and forwards it to the dir loader (which doesn't seem to parse names either).