Page MenuHomeSoftware Heritage

Use intrinsic info in browse URLs
Closed, MigratedEdits Locked

Description

Currently, the browse web application uses non intrinsic info in URLs to reference origins and their visits.
For instance, the root directory as found in the latest visit of the Linux kernel source tree mirror on Github
can be browsed through the URL: /browse/origin/2/visit/28/directory/.
The problematic parts in this URL are the use of the internal database ids for
referencing the origin and its visit.

In order to use intrinsic info, that URL should be replaced by
/browse/origin/git/url/https:∕/github.com/torvalds/linux/visit/2017-11-21T19:37:00.000Z/directory

It means that an origin should always be referred by its type and its url while one of
its visits by a date (fetching the closest visit when the provided date is not an exact visit one).

That scheme must be used in the whole browse URLs.

Revisions and Commits

Event Timeline

anlambert changed the task status from Open to Work in Progress.Dec 11 2017, 11:45 AM

The only issue with this approach is that there is no guarantee that a timestamp corresponds to a single visit: we can make several passes over an origin with the same (virtual) visit timestamp, for instance:

  • when we have downloaded the contents of an origin for further processing, the visit timestamp is the time we downloaded the data; if we fail to import for some reason, and we try again, two visits get the same timestamp.
  • when we have listed the debian archive, the visit timestamp is the time we fetched the indices; we can then process the packages in several (partial) passes before succeeding.

Although the latest incremental visit for a given timestamp is probably the most interesting, in the general case we will need a way to disambiguate those visits.

As requested here is an example of an origin where we struggle to make the full visit on the first pass, and therefore have a bunch of visits with the same timestamp:

https://archive.softwareheritage.org/api/1/origin/65623577/

New swh-web version using intrinsic info is now deployed to moma.

In order to disambiguate origin visits with the same date, I set up a query parameter visit_id
to retrieve the adequate visit for such cases (see https://archive.softwareheritage.org/browse/origin/deb/url/deb://Debian/packages/linux/).
For a given date, the default behaviour is to fetch the latest visit with that date.