Page MenuHomeSoftware Heritage

Add OriginIndexer + OriginHeadIndexer.
ClosedPublic

Authored by vlorentz on Oct 4 2018, 6:26 PM.

Diff Detail

Repository
rDCIDX Metadata indexer
Branch
origin-head-indexer
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1527
Build 1871: arc lint + arc unit

Event Timeline

zack requested changes to this revision.Oct 5 2018, 10:54 AM

only minor issues for me, looks good otherwise

swh/indexer/indexer.py
389

should be "implements origin indexing"

More generally in this file (not related to this diff but might be a chance to fix it) we use "indexation" which means something else in English. We should use "indexing" consistently

swh/indexer/origin_head.py
111

We shouldn't have a default here, as there is no sane default.

If you want the list somewhere for ease of testing please put them in a REAME or use a local script.

This revision now requires changes to proceed.Oct 5 2018, 10:54 AM
swh/indexer/origin_head.py
44

if you are keeping prints for testing, don't mind this comment.
If not, don't forget erasing them (I forget them a lot).

46

i would rename method or comment with an explicit comment saying
that each origin type requires a different HEAD retrieval method

but i'm being picky..

57

Here the result is without the metadata and will be kept in the DB as a pointer to the last seen revision.
do you want to elaborate this by launching and copying the revision_metadata? or will it be another component?

It seems a good compartmentalization.
Maybe an orchestrator who uses the OriginIndexer's results to launch the RevisionIndexer with the list of revision ids?

81

I'm not a real python programmer, so this is not my place to say
but the else after the except looks kinda misplaced (esthetically of course).

I think it's only me, so you can completely ignore this comment.

91

is 'alias' and 'revision' the only target_type possible?

111

I used a default list with the RevisionMetatadaIndexer to test the component outside the MockStorage
and didn't think to delete it.
A script is a good alternative and adding a test to this or the next diff would be nice.

vlorentz added inline comments.
swh/indexer/origin_head.py
57

+1 for using another component

81

It's a block that is run if the try: block did not raise an exception. I can drop the else: if you find it less readable.

91

afaik, yes.

111

I'll turn these into tests.

vlorentz added inline comments.
swh/indexer/origin_head.py
91

psql does not agree:

softwareheritage=> select distinct target_type from snapshot_branch;
 target_type 
-------------
 
 content
 release
 revision
 alias
 directory

I'm not sure what to do with content and directory though. I'll create an issue later.

vlorentz added inline comments.
swh/indexer/origin_head.py
57

Now that I think about it, it may affect the search between the time the OriginHead indexer runs and updates the reference, and the time the revision metadata is copied.

We should probably discuss this further, outside a Diff inline comments

  • Add placeholders for other target types.
  • Start adding tests.
  • More tests.

This looks good to me. Testing is also great for later iterations.


With the origin_head.py main function I have an error because origin_head_add isn't defined in my code, which is normal, waiting for the diff on the indexer storage.

I played a bit with the main function origin_head.py trying to launch the RevisionMetadataIndexer, commenting line 442 in indexer.py
which makes me more keen on keeping just the revision_id for an origin and not copy the metadata to the origin level.

swh/indexer/origin_head.py
57

I agree.

I'm voting to not copy the metadata and keep only the reference (which might change quickly)
The intrinsic metadata found in a revision is a more permanent object than the one we can associate with an origin.

161

I don't mind keeping the main method with the defaults here for a few iterations before deleting it.

swh/indexer/tests/test_utils.py
148 ↗(On Diff #1539)

you should probably use a different id (even if this is a mock storage)

vlorentz marked 3 inline comments as done.
  • Deduplicate tool id.
  • Remove default -i arguments to indexers.
ardumont added inline comments.
swh/indexer/indexer.py
89

you can remove the parenthesis mention now.

423

Expected a (type...

441

Maybe worth mentioning the origin in question in the error log.

swh/indexer/origin_head.py
27

The default configuration should be targeting a remote storage [1], defaulting to a local one as per [2] for example.

[1] D372#inline-1828

[2] https://docs.softwareheritage.org/devel/getting-started.html#step-4-ingest-repositories

73

For the snapshot, possible target_type per loader:

  • mercurial: release, revision [1]
  • pypi: release, revision, alias (pointer to another release/revision mentioned in the snapshot) [2]
  • svn: revision [3]

[1] https://forge.softwareheritage.org/source/swh-loader-mercurial/browse/master/swh/loader/mercurial/loader.py$475-482

[2] https://forge.softwareheritage.org/source/swh-loader-pypi/browse/master/swh/loader/pypi/loader.py$239-271

[3] https://forge.softwareheritage.org/source/swh-loader-svn/browse/master/swh/loader/svn/loader.py$34-53

vlorentz added inline comments.
swh/indexer/origin_head.py
27

Sorry, I didn't mean to commit that.

73

What would the name of the "head" "branch" in mercurial be? "tip"?

vlorentz marked an inline comment as done.
  • Small fixes.
  • Remove config that wasn't supposed to be committed.
  • As per yesterday's chat, the result of OriginHeadIndexer should not be persistant.
vlorentz retitled this revision from Add OriginIndexer + draft of OriginHeadIndexer. to Add OriginIndexer + OriginHeadIndexer..Oct 10 2018, 10:04 AM
swh/indexer/origin_head.py
73

default is the head branch.
tip is what they use to define it, i think.

zack requested changes to this revision.Oct 15 2018, 10:06 AM
zack removed a reviewer: Reviewers.
zack added inline comments.
swh/indexer/indexer.py
410

thumbs up for the detailed type annotations here

swh/indexer/origin_head.py
23

As this is going to be persisted in the DB, it's time to define some decent naming for the indexer tools.
This is the "metadata indexer" so we should use it somewhere.
Also, we should go hierarchically: metadata → {origin,content,revision,etc.}
The fact we take the "head" revision is an implementation detail, it's just how the "origin metadata indexer" currently works.
"indexer" can remain implicit, as it's what all the tools do.

Bottom line: let's go for "metadata-origin".

56

as per T1257 this is no longer git-specific, so this should be _try_get_head

66

… for the same reason this should now be removed

70

I'm pretty sure this regex should not be here. It's a naming convention that, I guess, is also implemented by the tarball loader, right?
If so it should be exposed as a class attribute by it and used from here, to avoid duplicating it around.
(Unless you've very valid reasons to duplicate it here; if so they should be documented in a comment here.)

79

same for this method. (And hence maybe we do not need to expose the regex, but rather a filename parsing method from the tarball loader?)

swh/indexer/tests/test_utils.py
146 ↗(On Diff #1572)

should be updated according to the name change suggested above

This revision now requires changes to proceed.Oct 15 2018, 10:06 AM
vlorentz marked 3 inline comments as done.
  • Better tool name (thx @zack).
vlorentz added inline comments.
swh/indexer/origin_head.py
70

As far as I understand, the tarball loader does not try to parse the file name; it only downloads a tarball from an URL, extracts it, and forwards it to the dir loader (which doesn't seem to parse names either).

swh/indexer/origin_head.py
70

Fair enough. Thanks for checking.

  • Unify head detection for git and svn (T1257).
This revision is now accepted and ready to land.Oct 15 2018, 10:43 AM
This revision was automatically updated to reflect the committed changes.