Page MenuHomeSoftware Heritage

Enable save code now of software source code archives for specific users
Closed, ResolvedPublic

Description

Currently, Save Code Now only allows to fetch code available from a VCS. The rationale behind this was to avoid the service being abused by random users.

Now that we have authentication and authorization in place, and that Software Heritage ambassadors are coming, we can relax this constraint, allowing specific users the ability to trigger "save code now" also for .tar, .zip, packages etc.

This should be a simple modification to the current interface (add the new origin types in the drop down menu, and adapt the logic).

Revisions and Commits

rDWAPPS Web applications
D5801
D5792
D5738
D5725
D5725
D5724
D5676
D5696
D5685
D5661
D5660
D5653
rDENV Development environment
D5732
D5728
D5678
rSPSITE puppet-swh-site
D5622
rDAUTH Common authentication libraries
D5578

Event Timeline

rdicosmo triaged this task as Normal priority.
rdicosmo created this task.

Now that we have authentication and authorization in place, and that Software Heritage ambassadors are coming, we can relax this constraint, allowing specific users the ability to trigger "save code now" also for .tar, .zip, packages etc.

As a first step, we could immediately accept a save code now request submitted by an ambassador (no pending state for manual review).
This would require adding a dedicated permission for ambassadors in our authentication provider.

For the support of other origin visit types, @ardumont should know better than me how this could be integrated in the scheduler.

For the support of other origin visit types, @ardumont should know better than me how this could be integrated in the scheduler.

There is nothing much to it. Update the following [1] configuration snippet with the correct tasks (of our supported loader) prefixed
by save_code_now:, deploy and publish the manifest. The scheduler runner will do the rest.

[1] https://forge.softwareheritage.org/source/puppet-swh-site/browse/production/data/common/common.yaml$2292-2303

Thanks @ardumont ... so it appears that adapting the logic is easy... may you do it?
@anlambert may you look into the needed modification of the UI, to enable the new type of save code now payloads for selected authenticated users?

Thanks @ardumont ... so it appears that adapting the logic is easy... may you do it?
@anlambert may you look into the needed modification of the UI, to enable the new type of save code now payloads for selected authenticated users?

Sure, I planned to work on it tomorrow.

Thanks @ardumont ... so it appears that adapting the logic is easy... may you do it?

Well, if the bundle format is zip and .tar etc... I suppose the archive loader we use for gnu is enough.
There is still some glue to add in the webapp so it sends the correct message to the scheduler.
So yes, i can look into it.

I stand by what i said regarding the scheduling logic, it's as simple as I described
earlier... But...

Regarding the loader archive in its current state, it won't be possible to just add the
new type in the webapp.

Currently users only provide an url in the save code now, the loader expects a bit more
[1] (recall it's the lister which actually provide those).

The loader expects to be provided with a list of artifacts (could be only 1 in our
case). Still, such artifacts are described through the following:

  • artifact url
  • time
  • length (could be derived from the url when discussing with the server but not all server provides it...)
  • version (could be derived with heuristic from the url as well but that's regexp-hell-ish and prone to error)
  • filename (could be derived from the url without too much risk i think...)

I gather the save code now ui could be enriched (and displayed according to chosen visit
type) but that becomes more involved for people in general.

Another road would be to make some of those properties optional...

(Also, *shocker* the original need is not as simple as described in the task ;)

Thoughts?

[1]

 "url": "https://ftp.gnu.org/old-gnu/emacs/",
 "artifacts": [{"url": "https://ftp.gnu.org/old-gnu/emacs/elib-1.0.tar.gz",
                "time": "1995-12-12T08:00:00+00:00",
                "length": 58335,
                "version": "1.0",
                "filename": "elib-1.0.tar.gz",
                },
                ...
               ]
...

Currently users only provide an url in the save code now, the loader expects a bit more
[1] (recall it's the lister which actually provide those).

The loader expects to be provided with a list of artifacts (could be only 1 in our
case). Still, such artifacts are described through the following:

  • artifact url
  • time
  • length (could be derived from the url when discussing with the server but not all server provides it...)
  • version (could be derived with heuristic from the url as well but that's regexp-hell-ish and prone to error)
  • filename (could be derived from the url without too much risk i think...)

I gather the save code now ui could be enriched (and displayed according to chosen visit
type) but that becomes more involved for people in general.

Another road would be to make some of those properties optional...

Thoughts?

[1]

 "url": "https://ftp.gnu.org/old-gnu/emacs/",
 "artifacts": [{"url": "https://ftp.gnu.org/old-gnu/emacs/elib-1.0.tar.gz",
                "time": "1995-12-12T08:00:00+00:00",
                "length": 58335,
                "version": "1.0",
                "filename": "elib-1.0.tar.gz",
                },
                ...
               ]
...

From what I see, this loader is designed to be part of a pipeline where some work has already been done by a previous phase... in this example, that's the GNU lister I guess, that embodies heuristics to extract the version number from the file name following the convention used on the GNU ftp server.

For save code now, this previous phase is the user, and it makes sense to ask the user for some guidance, but here are a few questions/remarks before moving forward:

  • why do we need the length of the file? this can be computed when it is downloaded, right?
  • for filename, I agree that it can be safely extracted from the url
  • how is the version information used in the pipeline? Is this just a piece of metadata stored somewhere, or does it lead to creating synthetic commits?

(submitted too early)

Thanks for the questions, that helps.

From what I see, this loader is designed to be part of a pipeline where some work has already been done by a previous phase... in this example, that's the GNU lister I guess, that embodies heuristics to extract the version number from the file name following the convention used on the GNU ftp server.

yes, lister is in charge of extracting those.

For save code now, this previous phase is the user, and it makes sense to ask the user for some guidance, but here are a few questions/remarks before moving forward:

  • why do we need the length of the file? this can be computed when it is downloaded, right?

I recall it's part of creating a primary key (of sort) composed of all the properties mentioned
above (when the artifact does not provide some hashes already).
This to bypass fetching all other again things already fetched.

[1] https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/archive/loader.py$45-46

  • for filename, I agree that it can be safely extracted from the url

yes. After reading [1] again, this one looks optional.

  • how is the version information used in the pipeline? Is this just a piece of metadata stored somewhere, or does it lead to creating synthetic commits?

I recall, piece of metadata but also part of implementation detail to fasten loading if already fetched. I'll double check.

I recall it's part of creating a primary key (of sort) composed of all the properties mentioned
above (when the artifact does not provide some hashes already).
This to bypass fetching all other again things already fetched.

If I understand well, url+time+length+filename+version are used in an heuristic to avoid (down)loading over and over again something that is already ingested (I am still a bit puzzled by version and filename though: what is the corner case that they can help catch better than just url+time+length?).

For a "save code now", we should be able to get the same information (except the version) with a simple HEAD HTTP request performed when the user submits the request (this is by the way a very useful early validation step in any case). E.g.:

$ curl -I https://ftp.gnu.org/gnu/a2ps/a2ps-4.12.tar.gz
HTTP/1.1 200 OK
Date: Sat, 24 Apr 2021 11:09:18 GMT
Server: Apache/2.4.18 (Trisquel_GNU/Linux)
Strict-Transport-Security: max-age=63072000
Last-Modified: Tue, 23 Feb 1999 00:20:34 GMT
ETag: "1bf381-3447d26483880"
Accept-Ranges: bytes
Content-Length: 1831809
Content-Security-Policy: default-src 'self'; img-src 'self' https://static.fsf.org https://gnu.org; object-src 'none'; frame-ancestors 'none'
X-Frame-Options: DENY
X-Content-Type-Options: nosniff
Content-Type: application/x-gzip

Here Last-Modified provides time, and Content-Length provides length.

What do you think?

If I understand well, url+time+length+filename+version are used in an heuristic to
avoid (down)loading over and over again something that is already ingested

yes (minus filename, see below)

(I am still a bit puzzled by version and filename though: what is the corner case
that they can help catch better than just url+time+length?).

I don't think there is any reason indeed. Note that I mislead the discussion with
filename, it is an optional field (not part of that key, probably for this very reason).

Including the version in that key is apparently a hiccup (that we should probably
simplify). Well, assuming that the version is already present in the url field ;)

Note that I think having the version as parameter was initially to avoid having to do
heuristic parsing computations package loader side. It's also consistent with what other
package listers do (cran, pypi, npm...). They provide among other things, the artifact
version directly to their associated package loaders.

Note that saying that, I realize now that introducing a new kind of client for loader
(not only lister now), this part may need revisiting. If we don't want to replicate
those heuristics, we need to get this back into the loader (at least for the archive
loader, that makes sense right now).

For a "save code now", we should be able to get the same information (except the
version) with a simple HEAD HTTP request performed when the user submits the request
(this is by the way a very useful early validation step in any case). E.g.:

...
Here `Last-Modified` provides `time`, and `Content-Length` provides `length`.

What do you think?

Agreed! Thanks.

[1] Same goes for the cran loader then which uses both the url and the version... for its use case.

[2] https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/gnu/tree.py$259-307

Remains one or two concerns about this prior to actually act on it.

What's the expected origin for a bundle of the save code now?

The first versions of the archive loader was to use the tarball url as origin. Then we
moved away from it (and we did some clean up in the archive back then iirc). Now,
instead, the lister provides the origin url and the list of artifacts to actually ingest
for said origin. That avoids a one-to-one origin-bundle association in the archive.

Asking because, beyond the loader implementation, webapp wise, we only have one input
field in the ui for the origin url, so more inputs may be needed.

Either 1. we automatically sets the base url as origin url

Or 2. we provide 2 inputs, the origin is built out of from the artifact url. The user
fills in the artifact url input field, and with ui js magic, the origin url input is
filled in with the origin (1.). Only, now the user can fix what needs to in that
dedicated field.

Or 3, we propose a more involved ui where the user can submit the origin url and as many
software bundle artifact urls the user wants to submit.

The last concern may not be one depending on the above. It's about the resulting
snapshot. The actual snapshot built out of a visit from the archive loader has as many
branches as there are artifacts to load from (from its input). So it currently means, it
will be a snapshot built out of 1 branch in the naive version (1. or 2.) or as the
lister does already (3.).

After discussion with @anlambert and @rdicosmo, we agreed on the following as a first
iteration of the ui for the new bundle type.

For api connected user (ambassador) only. Adding a new bundle (or tar) type in the
selection list.

When selected, the ui refreshes and proposes the main url input as currently. Plus 3
other field inputs (origin, filename, version). As a first iteration, the inputs are not
yet prefilled. The user is manually filling in the required field themselves. (With
given some more feedback from users, we could always improve that later).

Once the user is satisfied with their inputs, they submit the form. This triggers a
validation check [1] . If some problem occurs during the check, the user is notified
with a notification alert as it's done currently. The notification should be as
meaningful as possible so the user knows what to fix if possible. Otherwise, as checks
are fine, this triggers the scheduling of a new ingestion task of type
'load-archive-files' for the new origin with a list of one artifact filled in.

The rest of the ui continues to work as it currently does.

[1] Implementation wise, an asynchronous check done by the server through a HEAD request
to check the url to ingest is ok and eventually fetch the missing information, size and
last modified date of the artifact)

anlambert renamed this task from Enable save code now of software bundles for specific users to Enable save code now of software source code archives for specific users.May 27 2021, 5:15 PM

The feature has been implemented and looks ready for production use.

I just tested it using the Web API and the docker environment for a real world example: the Kermit Software Source Code Archive.
I used the script below:

archive_kermit.py
import os

from bs4 import BeautifulSoup
import requests

origin_url = "https://www.kermitproject.org/archive.html"

response = requests.get(origin_url)

page = BeautifulSoup(response.content, features="html.parser")

archive_links = [p["href"] for p in page.find_all("a") if p["href"].endswith(".tar.gz")]

archives_data = []

for archive_link in archive_links:
    if not archive_link.startswith("http"):
        continue
    if not requests.head(archive_link).ok:
        continue
    artifact_version = archive_link.split("/")[-1].split(".")[0]
    archives_data.append(
        {"artifact_url": archive_link, "artifact_version": artifact_version}
    )

save_code_now_url = (
    f"http://localhost:5004/api/1/origin/save/archives/url/{origin_url}/"
)

headers = {"Authorization": f"Bearer {os.environ['SWH_TOKEN']}"}

print(
    requests.post(
        save_code_now_url, json={"archives_data": archives_data}, headers=headers
    ).json()
)

Apart a corrupted tarball that could not be uncompressed, the loading went fine and 166 tarballs were loaded into the archive.

swh-loader_1                    | [2021-05-28 10:09:08,405: INFO/MainProcess] Task swh.loader.package.archive.tasks.LoadArchive[2c5f2fcc-94ed-4991-ac16-bd395ad6269e] received
swh-loader_1                    | [2021-05-28 10:13:58,931: ERROR/ForkPoolWorker-1] Failed loading branch releases/cpm80 for https://www.kermitproject.org/archive.html
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 401, in _decode
swh-loader_1                    |     data = self._decoder.decompress(data)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 88, in decompress
swh-loader_1                    |     ret += self._obj.decompress(data)
swh-loader_1                    | zlib.error: Error -3 while decompressing data: incorrect header check
swh-loader_1                    | 
swh-loader_1                    | During handling of the above exception, another exception occurred:
swh-loader_1                    | 
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/requests/models.py", line 753, in generate
swh-loader_1                    |     for chunk in self.raw.stream(chunk_size, decode_content=True):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 576, in stream
swh-loader_1                    |     data = self.read(amt=amt, decode_content=decode_content)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 548, in read
swh-loader_1                    |     data = self._decode(data, decode_content, flush_decoder)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 407, in _decode
swh-loader_1                    |     e,
swh-loader_1                    | urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
swh-loader_1                    | 
swh-loader_1                    | During handling of the above exception, another exception occurred:
swh-loader_1                    | 
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/loader.py", line 576, in load
swh-loader_1                    |     res = self._load_revision(p_info, origin)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/loader.py", line 713, in _load_revision
swh-loader_1                    |     dl_artifacts = self.download_package(p_info, tmpdir)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/loader.py", line 364, in download_package
swh-loader_1                    |     return [download(p_info.url, dest=tmpdir, filename=p_info.filename)]
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/package/utils.py", line 90, in download
swh-loader_1                    |     for chunk in response.iter_content(chunk_size=HASH_BLOCK_SIZE):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/requests/models.py", line 758, in generate
swh-loader_1                    |     raise ContentDecodingError(e)
swh-loader_1                    | requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
swh-loader_1                    | [2021-05-28 10:26:18,035: WARNING/ForkPoolWorker-1] 1 failed branches
swh-loader_1                    | [2021-05-28 10:26:18,035: WARNING/ForkPoolWorker-1] Failed branches: releases/cpm80
swh-loader_1                    | [2021-05-28 10:26:18,050: INFO/ForkPoolWorker-1] Task swh.loader.package.archive.tasks.LoadArchive[2c5f2fcc-94ed-4991-ac16-bd395ad6269e] succeeded in 1029.6283570680534s: {'status': 'eventful', 'snapshot_id': '60f27b38f9808b0d718387ff5fb27faf47e4624d'}

The feature has been implemented and looks ready for production use.

I just tested it using the Web API and the docker environment for a real world example: the Kermit Software Source Code Archive.

Great to see this progress (and kermit archived :-))

Great to see this progress (and kermit archived :-))

This still needs to be deployed to production, I only loaded kermit into my docker environment.
But I will reuse the script above to do the real loading once it is done.

@rdicosmo That's the issue ^ (unrelated to T3361 in the end), the workers were not suscribed to consume from that queue yet.

(Although, if the workers were subscribed, they would have been stuck the same way nonetheless)

It's deployed now.

Of course, now they won't ingest your inputs...
I'll investigate in a dedicated task [1].

[1] T3365

The issue mentioned ^ has been fixed and deployed.
Everything is deployed.