Page MenuHomeSoftware Heritage

loading some svn origins while ignoring history sometimes raises
Closed, MigratedEdits Locked

Description

Recreated from T3695 submitted by @ardumont as the associated diffs do not fix the original issue (my bad, i misread the task details).

It should allow to start the svn ingestion while ignoring previous ingestion result.
It current raises while it should not.

The following stacktrack (KeyError) [1] shows some assertions were wrong.
So this needs investigation and fix.

[1]

swhworker@worker17:~$ swh loader run svn https://profs.scienze.univr.it/posenato/svn/sw/CSTNU start_from_scratch=True
INFO:swh.loader.svn.SvnLoader:Load origin 'https://profs.scienze.univr.it/posenato/svn/sw/CSTNU' with type 'svn'
INFO:swh.loader.svn.SvnLoader:Processing revisions [1-619] for {'swh-origin': 'https://profs.scienze.univr.it/posenato/svn/sw/CSTNU', 'remote_url': 'https://profs.scienze.univr.it/posenato/svn/sw/CSTNU', 'local_url': b'/tmp/swh.loader.svn.qhb2pmup-99331/CSTNU', 'uuid': b'782453a1-1937-45d1-8845-2a6fcc2839b7'}
ERROR:swh.loader.svn.SvnLoader:90
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/svn/loader.py", line 457, in fetch_data
    data = next(self.swh_revision_gen)
  File "/usr/lib/python3/dist-packages/swh/loader/svn/loader.py", line 364, in process_svn_revisions
    rev, commit, dir_id, revision_parents[rev]
KeyError: 90
ERROR:swh.loader.svn.SvnLoader:Loading failure, updating to `failed` status
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/swh/loader/core/loader.py", line 339, in load
    self.store_data()
  File "/usr/lib/python3/dist-packages/swh/loader/svn/loader.py", line 489, in store_data
    revision=self._last_revision, snapshot=self._snapshot
  File "/usr/lib/python3/dist-packages/swh/loader/svn/loader.py", line 525, in generate_and_load_snapshot
    "generate_and_load_snapshot called with null revision and snapshot!"
ValueError: generate_and_load_snapshot called with null revision and snapshot!
{'status': 'failed'}

Note: start-from-scratch parameter should probably be renamed ignore-history along the way.
That'd unify with the equivalent parameter in the loader-git.

Event Timeline

anlambert triaged this task as Normal priority.Nov 3 2021, 4:20 PM
anlambert created this task.

Original comment posted in T3695 by @ardumont

Ah, it's not for all origins though...
I tried with other origins which demonstrates the same issue [1] and they did not fail...

swhworker@worker17:~$ swh loader run svn https://svn.code.sf.net/p/unimacro/code start_from_scratch=True
INFO:swh.loader.svn.SvnLoader:Load origin 'https://svn.code.sf.net/p/unimacro/code' with type 'svn'
INFO:swh.loader.svn.SvnLoader:Processing revisions [1-614] for {'swh-origin': 'https://svn.code.sf.net/p/unimacro/code', 'remote_url': 'https://svn.code.sf.net/p/unimacro/code', 'local_url': b'/tmp/swh.loader.svn.a4fjpaqs-105156/code', 'uuid': b'df0dbeab-7b48-0410-a972-c90e96de496b'}
{'status': 'eventful'}
swhworker@worker17:~$ swh loader run svn https://svn.code.sf.net/p/open-chord/code start_from_scratch=True
INFO:swh.loader.svn.SvnLoader:Load origin 'https://svn.code.sf.net/p/open-chord/code' with type 'svn'
INFO:swh.loader.svn.SvnLoader:Processing revisions [1-424] for {'swh-origin': 'https://svn.code.sf.net/p/open-chord/code', 'remote_url': 'https://svn.code.sf.net/p/open-chord/code', 'local_url': b'/tmp/swh.loader.svn.0r3xuoo0-107295/code', 'uuid': b'5a381aae-974a-0410-8b17-85268456560c'}
{'status': 'eventful'}

[1] https://sentry.softwareheritage.org/share/issue/84433e1cd9974eb293f0ba3a9ee44fd1/

The issue is related to the difference of fetched revision data whether we use the swh.loader.svn.SvnLoader class (fetching revisions
one at a time through a ping pong with the svn server) or the swh.loader.svn.SvnLoaderFromRemoteDump class (fetching all revisions
to a dump file in one operation).

The swh loader run svn command uses first class while second one is used in production by the celery tasks.

Using swh.loader.svn.SvnLoader is like using the official svn client, this is the log we obtain after a checkout of repository located
at https://profs.scienze.univr.it/posenato/svn/sw/CSTNU.

(swh) anlambert@carnavalet:/tmp/CSTNU$ svn log
------------------------------------------------------------------------
r619 | posenato | 2021-10-22 15:39:58 +0200 (ven., 22 oct. 2021) | 1 line

Revision 4.3
------------------------------------------------------------------------
r618 | posenato | 2021-10-22 15:37:37 +0200 (ven., 22 oct. 2021) | 2 lines

Another improvement before final release.e

------------------------------------------------------------------------
r617 | posenato | 2021-10-22 15:36:19 +0200 (ven., 22 oct. 2021) | 1 line

Improved message for the user before to overwrite the current graph in the editor.
------------------------------------------------------------------------
...
...
...
------------------------------------------------------------------------
r92 | posenato | 2014-12-07 08:49:51 +0100 (dim., 07 déc. 2014) | 4 lines

Improved overall structure. 
It was a big error to maintain value and node set together. 
It is best to separate them in order to guarantee an effective label minimization.
Version 92 is a good release for CSTN.
------------------------------------------------------------------------
r91 | posenato | 2014-12-07 06:35:27 +0100 (dim., 07 déc. 2014) | 2 lines

Version 91 is a stable and sound version.
It manages node name set together with labeled value. 
------------------------------------------------------------------------
r90 | posenato | 2014-12-04 05:47:01 +0100 (jeu., 04 déc. 2014) | 2 lines

Import iniziale

------------------------------------------------------------------------

We can see that the first revision number is 90.

While generating a dump file with svnrdump fetches all revisions starting revision 0.

(swh) anlambert@carnavalet:/tmp$ svnrdump dump https://profs.scienze.univr.it/posenato/svn/sw/CSTNU/ > CSTNU.dump
* Dumped revision 0.
* Dumped revision 1.
* Dumped revision 2.
* Dumped revision 3.
* Dumped revision 4.
* Dumped revision 5.
* Dumped revision 6.
* Dumped revision 7.
...
...
...
* Dumped revision 615.
* Dumped revision 616.
* Dumped revision 617.
* Dumped revision 618.
* Dumped revision 619.

My guess is that the svn server is configured to serve revision 90 as first revision when requesting the log but that parameter is not taken
into account when requesting a dump file.

ardumont claimed this task.

Diffs landed, packaged and deployed.
Closing.

ardumont renamed this task from loading some svn origin while ignoring history sometimes raises to loading some svn origins while ignoring history sometimes raises.Nov 9 2021, 10:21 AM