diff --git a/docs/roadmap/roadmap-2022.rst b/docs/roadmap/roadmap-2022.rst new file mode 100644 --- /dev/null +++ b/docs/roadmap/roadmap-2022.rst @@ -0,0 +1,888 @@ +.. _roadmap-2022: + +Roadmap 2022 +============ + +(Version 1.0, last modified 28/03/2022) + +This document provides an overview of the technical roadmap of Software Heritage for +2021. + +The `Kanban board `_ +is seen through our forge. + +.. contents:: + :depth: 3 +.. + +Collect +------- + +Extend archive coverage (2+2 loaders/listers) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + +- lead: ardumont +- tags: coverage +- task: `T4079 `__ +- effort: variable, depending on the chosen listers/loaders (4PM ?) +- priority: Medium + +Deploy at least 2 additional loaders (of currently unsupported +VCS/package formats) and 2 additional listers (of currently unsupported +hosting platforms), expanding the coverage of the Software Heritage +archive. Listers and loaders can be developed in house or contributed by +external partners, e.g., via dedicated grants. + +KPIs:: \* number of new loaders/listers deployed \* number of origins +archived/listed + +Minimize archival lag w.r.t. upstream code hosting platforms +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: olasd +- tags: performance, coverage +- task: `T4080 `__ +- effort: 3 PM +- priority: High + +Includes work: + +- Quantify and monitor in real-time the lag, especially for major + platforms (GitHub, GitLab.com, etc.) +- Improve ingestion efficiency (optimize loaders, especially the Git + loader, optimize scheduling policies) - + `T2207 `__ +- Make lag monitoring dashboards easy to find (for decision makers) + +KPIs: + +- number of out of date repos (absolute and per platform) +- total archive lag (e.g., in days) + +Add forge now +^^^^^^^^^^^^^ + +- lead: ardumont +- tags: coverage +- task: `T1538 `__ +- effort: 3 PM +- priority: High + +Includes work: + +Make it user-driven, simple, and efficient to fully and recurrently +archive a new instance of an already supported code hosting platform . + +- User-facing web form allowing any user to *propose* the archival of a + new forge instance, and moderation web ui to validate archival + requests before ingestion. `T4047 `__ +- Admin tooling and UI to deal with received submissions. `T4058 `__ +- Include free-from box suggestion form for forges that are not + supported yet (to replace the currently poorly maintained `wiki + page `__). + Possibly to be integrated with the user support system elsewhere in + the roadmap. + +KPIs: + +- number of forges/instances added + +Integrate deposit with InvenioRDM +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: moranegg +- tags: 2021, coverage, deposit +- task: `T2344 `__ +- effort: 1-2 PM +- priority: Medium + +Includes work: + +Deploy in production support for receiving source code deposits from +InvenioRDM instances, and in particular the Zenodo instance. + +- Extend CodeMeta vocabulary to qualify author relationships - `T2329 `__ +- generalize usage of SWHID for referencing SWH archive objects - `T3034 `__ +- Analyze deposit-client on InvenioRDM compatibility - `T3549 `__ + + +KPIs: + +- support deployed in production +- nb of Zenodo deposits + + +Admin tooling for takedown notices +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: douardda +- tags: 2021, legal +- task: `T3087 `__ +- effort: 3 PM +- priority: High + +Includes work: + +Admin interface, private and public journal of operations. + +- Low level support for blacklisting specified contents (not only URLs, + also SWHIDs), with support for regexps (Txxx) +- admin interface to add/remove entries from the blacklist (Txxx) +- a journal of these operations (what was added/removed, when and why, + from the blacklist) (Txxx) +- A public webpage that maintains the list of accepted takedown notices + (Txxx) + + +KPIs: + +- Takedown tools deployed in production +- Number of processed takedown notices + +Preserve +-------- + +Continuous data validation of all the data stores of SWH +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: vlorentz +- tags: integrity, monitoring +- task: `T3841 `__ +- effort: 2 PM +- priority: Medium + +Includes work: + +- Set up background jobs to regularly check data validity in all SWH + data stores. +- This includes both blobs (swh-objstorage) and other graph objects + (swh-storage) on all the copies (in-house, kafka, azure, upcoming + mirrors, etc.). +- Estimate ETA for scrubbing of the entire archive. + +KPIs: + +- scrubbers deployed in production +- monitoring tools deployed in production +- % of the archive scrubbed + +Support archiving repositories containing SHA1 hash conflicts on blobs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: olasd +- tags: crypto +- task: `T3775 `__ +- effort: 1.5 PM +- priority: High + +Includes work: + +This involves getting rid of the limitations imposed by having SHA1 as a +primary key for the object storage internally. + +KPIs + +- Ability to archive git repos that contains sample SHAttered + collisions blobs (they are currently detected and refused) + +Up-to-date anonymized archive copy on Amazon S3 (except blobs) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: seirl +- tags: 2021, archivecopy +- task: `T3085 `__ +- effort: 3 PM +- priority: Low + +Includes work: + +Periodic dumps of the (anonymized) Merkle graph on the Amazon public +cloud. + +- fully automate export of the graph dataset -`T1847 `__ +- document how to export the graph edge dataset - `T2431 `__ +- Define a scheduling periodicity + +KPIs: + +- automatic exports scheduled +- S3 copy up to date w/ last scheduled export + +Archive cold-copy at CINES via Vitam +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: douardda +- tags: 2021, archivecopy +- task: `T3414 `__ +- effort: 2PM +- priority: Medium + +Includes work: + +Perform a first complete copy of the archive stored in Vitam @ CINES +Maintain the copy up-to-date periodically (on a period TBD) + +- Specify the Vitam archiving format - `T3415 `__ +- Implement the replayer service for Vitam - `T3416 `__ +- define updates scheduling + +KPIs: + +- first copy stored in vitam +- updates calendar defined + +Mirrors +^^^^^^^ + +- lead: douardda +- tags: 2021, mirror +- task: `T3116 `__ +- effort: 2 PM +- priority: High + +Includes work: + +Deploy in production at least 2 mirrors. + +- finalize ENEA Mirror deployment (Txxx) +- launch Snyk mirror project (Txxx) +- handle takedown notice synchronization ? +- add feature flags on web ui + + +KPIs: + +- ENEA Mirror in production +- Snyk mirror in production + +Publicly available standard for SWHID version 1 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: zack +- tags: 2021, standard, swhid +- task: `T3960 `__ +- effort: 1 PM +- priority: High + +Includes work: + +Publish a stable version of the SWHID version 1 specification, approved +by a standard organization body. + +KPIs: + +- published standard for SWHID version 1 + +SWHID version 2 +^^^^^^^^^^^^^^^ + +- lead: zack +- tags: 2021, swhid, crypto +- task: `T3134 `__ +- effort: 4 PM +- priority: Low + +Includes work: + +Complete on paper specification for SWHID version 2, including migrating +to a stronger hash than SHA1. + +- complete on paper spec +- aligned with work done on new git hashes +- migration plan from/cohabitation with v1 (N.B.: we need to maintain SWHID v1 support forever anyway) +- understand impact on internal microservice architecture (related to `T1805 `__, in particular use SWHIDs everywhere (core SWHIDs, without qualifiers)) +- keep correspondence with v1 (there may be multiple v2 for one v1) +- reviewed by crypto experts + + +KPIs: + +- written SWHID version 2 specification + +Share +----- + +Show metadata on Web UI +^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: vlorentz +- tags: share, present, webui +- task: `T4081 `__ +- effort: 3 PM +- priority: Low + +Includes work: + +Layer 1: show intrinsic and extrinsic metadata for artifact on web UI +(design, implementation and deployment) Layer 2: add linked data +capabilities (Semantic Web solutions) + +- design metadata view for Web UI +- allow export of metadata (in multiple formats - APA/ BibTeX/ + CodeMeta/ CFF) +- assistance and contribution to CodeMeta + +KPIs: + +- amount of metadata accessible on Web UI + +Provide a state-of-the-art UX for web search +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: jayesh +- tags: search +- task: `T3952 `__ +- effort: 3 PM +- priority: Medium + +Includes work: + +- Make the textual search language of archive.s.o a first-class + citizen, including: +- Simplify syntax +- Conduct UX audits and user-testing of the web search UI +- Note: this does *not* include extending the type of data currently + indexed and used for search (e.g., no filenames, no file content, + etc.; they can come later/separately). + + +KPIs: + +- SWH search using QL available in production +- Default user experience for archive.s.o textual searches + +Self-host Software Stories software stack +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: moranegg +- tags: communication, wikidata, docs +- task: `T3954 `__ +- effort: 1 PM +- priority: Medium + +Includes work: + +- Deploy `stories instance `__ in production on the SWH infrastructure. + +KPIs: + +- software stories app deployed in production on SWH infra +- content of current stories migrated to SWH instance + +Webhook-based notification for long-running user tasks +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: anlambert +- tags: deposit, vault, savecodenow +- task: `T3955 `__ +- effort: 1-3 PM +- priority: High + +Includes work: + +- create a reusable webhook architecture +- Add support for webhook-based notifications of long-running user + tasks, including: + + - deposit + - vault cooking + - save code now + - add forge now + - origin visit + +KPIs: + +- number of services that support webhook-based notifications + +Collect and index forge metadata +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: vlorentz +- tags: 2021 +- task: `T2202 `__ +- effort: 9 PM +- priority: High + +Includes work: + +- collect extrinsic metadata from at least 1 forge (e.g., GitHub or + GitLab project metadata) +- index them into a sensible and searchable ontology/data model (could be codemeta, if suitable, or something else if needed) +- cross-reference them to archived objects via SWHID +- enable searches based on indexed metadata + +KPIs: + +- number of forges supported +- metadata fields collected + +Prior art detection +^^^^^^^^^^^^^^^^^^^ + +- lead: zack +- tags: 2021 +- task: `T3136 `__ +- effort: 5 PM +- priority: Medium + +Includes work: + +Provide a full-circle user toolchain for prior-art detection in the +realm of software source code. + +- `revamp swh-scanner result dashboard `__ +- integrate with swh-provenance +- integrate with swh-graph + +KPIs: + +- release and announce a beta version of swh-scanner + +Documentation +------------- + +docs.s.o: provide a landing page, dispatching to devel/user/sysadmin/mirrors +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: bchauvet +- tags: docs, sys-admin +- task: `T3867 `__ +- effort: 0.5 PM +- priority: Medium + +Includes work: + +- Provide a nice landing page for all documentation at docs.s.o, + dispatching by user type. +- Drop the redirection docs.s.o -> docs.s.o/devel. +- Depends on populating the /sysadm, /user and /mirrors parts. + +KPIs: + +- landing page in production (https://docs.softwareheritage.org) + +docs.s.o/sysadm: improve sysadmin documentation website +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: vsellier +- tags: docs, sys-admin +- task: `T4082 `__ +- effort: 1 PM +- priority: Medium + +Includes work: + +- General goal: onboarding material + transparency about how we run the archive. +- Target user: team member, partners (e.g. mirror operators), or contributor who needs a clear view of the infrastructure architecture. + +This task will be completed when it: +- Documents the configuration system of each component. +- Documents hardware architecture. +- Documents CI architecture (and other major services currently not documented). + +KPIs: + +- list of minimum documented items +- number of available documented items + +docs.s.o/user: bootstrap user documentation website +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: moranegg +- tags: docs, user +- task: `T3972 `__ +- effort: 2 PM +- priority: Medium + +Includes work: + +The currently available user documentation only provides a FAQ. It should contain at least: +- an overall non-technical description of the archive and the core elements of its architecture +- a set of howto/getting started pages on main subjects (search, browse, push code in the archive, retrieve code and artifacts from the archive, metadata) +- link to existing documentation on the main w.s.o. site as appropriate. + +KPIs: + +- list of minimum documented items +- number of available documented items + +High-level overview of available listers/loaders +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: anlambert +- tags: 2021, docs, sys-admin +- task: `T3117 `__ +- effort: 0.5 PM +- priority: High + +Includes work: + +Publish a web page (under docs.s.o somewhere) providing a high-level +overview of which listers/loaders are available (implemented, deployed, +running, etc.) with pointers to the corresponding +modules/implementations. + +KPIs: + +- online web page + +Technical Debt +-------------- + +Refactor swh-web code +^^^^^^^^^^^^^^^^^^^^^ + +- lead: anlambert +- tags: webapp, refactoring +- task: `T3949 `__ +- effort: 3 PM +- priority: Medium + +Includes work: + +Have a smaller, more modular code base + +- Split the public API code from the frontend code base +- Reduce code duplication (eg. between API and frontend) +- Externalize conversion utilities towards swh-core + +KPIs: + +- separate repositories for frontend and web API + +New public API (GraphQL + thin layer) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: jayesh +- tags: api, refactoring +- task: `T4083 `__ +- effort: 4 PM +- priority: Medium + +Includes work: + +Provide a common unified (GraphQL based) public API + +- Create a GraphQL based API +- integrate actual API on graphQL + +KPIs: + +- QraphQL API in production + +Organize 4+ short peer programming code-audit sprints +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: bchauvet +- tags: refactoring +- task: `T3956 `__ +- effort: 2.5 PM +- priority: n/a (one 2-day sprint every 2 months) + +Includes work: + +- Go through the entire codebase and identify changes that should be + done and dead code +- Correct identified issues or, failing that, document them with + dedicated tasks +- identify one theme per sprint + +KPIs: + +- sprints done + +Organize 4+ sentry-cleaning sprints +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: bchauvet +- tags: project-management, monitoring +- task: `T3957 `__ +- effort: 2.5 PM +- priority: n/a (one 2-day sprint every 2 months) + +Includes work: + +We currently have a lot of `open Sentry issues `__, but this is very raw data that isn’t very usable or visible. They should be cleaned up so that under normal conditions, the number of reported issues stays “minimal”. + +KPIs: + +- sprints done +- number of sentry issues (before/after) + +Tooling and Infrastructure +-------------------------- + +GitLab migration +^^^^^^^^^^^^^^^^ + +- lead: olasd +- tags: 2021 +- task: `T2225 `__ +- effort: 3 PM +- priority: Medium + +Includes work: + +- Review the current workflow for the migration +- Prepare new team workflows for some “sample” projects +- Drive the migration to completion + + - sysadmin projects migration (iteration #1) + - remaining projects migration (iteration #2) + +KPIs: + +- number of migrated projects +- phabricator switched to read-only + +Polish developer-facing CI automation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: olasd +- tags: development environment, CI +- task: `T4084 `__ +- effort: 3 PM +- priority: Low + +Includes work: + +- more automation to keep all linting / testing tools (black, flake8, + tox, …) up to date and consistent +- CI support for multiple python versions (and possibly some dependency + versions) +- faster CI for diffs (e.g., consider use of + `testmon `__ to only run tests affected by + changes) +- investigation of more linters or flake8 plugins +- cypress performance (parallel testing) + +KPIs: + +- to be defined w/ task leader + +Continuous Deployment +^^^^^^^^^^^^^^^^^^^^^ + +- lead: vsellier +- task: `T2231 `__ +- tags: CI, CD, packaging +- effort: 6 PM +- priority: Low + +Includes work: + +Improve bug detection Validate the future elastic infrastructure +components + +- Migrate away from Debian packaging for deployment +- Build a docker image per deployable service +- Build the deployment tooling +- Reset and redeploy the stack after commits +- Execute acceptance tests + +KPIs: + +- operational CD platform +- CD integrated to gitlab + +Continuous Integration for sysadmin tools +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: vsellier +- tags: sysadmin, CI, tooling +- task: `T3834 `__ +- effort: 2 PM +- priority: Low + +Includes work: + +Add CI for sysadmin tasks: + +- puppet configuration +- vagrant projects +- terraform plans +- container (docker) image production + +Create sustainable plan for hardware provisioning/rotation +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: olasd +- tags: sysadmin, hardware +- task: `T3959 `__ +- effort: 0.5 PM +- priority: High + +Write a policy for hardware procurement with the following in mind: +- Make sure that we properly track our current pool of hardware, and its warranty status +- Make sure we don’t get surprised by lapsing warranties +- Make sure that we don’t end up having to renew a bunch of machines *all at once* +- Allow better budget previsions + +KPIs: + +- shared documented policy + +Elastic loaders and listers +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: ardumont +- tags: sysadmin, performance, elasticity +- task: `T3592 `__ +- effort: 3 PM +- priority: High + +Includes work: + +- Deploy the listers and loaders in containers +- Deploy on a couple of bare metal servers (?) +- Easily adapt the load to the resources and the waiting tasks + +KPIs: + +- running elastic infrastructure in production for loaders and listers +- cluster / elastic workers monitoring (nb of running workers, statsd,…) + + +Cassandra in production as primary storage +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: vsellier +- tags: 2021, storage, sysadmin +- task: `T2214 `__ +- effort: 3 PM +- priority: High + +Includes work: + +- Have the Cassandra storage in production as primary storage +- Set up equivalent MVP in staging + +KPIs: + +- Cassandra primary storage in production + +Scale-out objstorage in production as primary objstorage +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: olasd +- tags: 2021, objstorage, sysadmin +- task: `T3054 `__ +- effort: 2 PM +- priority: High + +Includes work: + +- Have the Ceph-based objstorage in production as primary storage +- Set up equivalent MVP in staging (maybe use the same Ceph cluster for this) + +KPIs: + +- Ceph-based obj-storage in production + +Provenance in production +^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: douardda +- tags: 2021, provenance +- task: `T3112 `__ +- effort: 3 PM +- priority: High + +Includes work: + +Have the provenance index in production with less then a month of lag +Set up equivalent MVP in staging + +- produce documentation +- finalize revisions layer processing + - investigate/solve revisions performance issues +- process origins layer +- flatten directories +- production setup (deployment / scripts) +- implement a querying API + +KPIs: + +- revisions processed per second +- % of archive covered +- published documentation + +Graph compression in production +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: seirl +- tags: 2021, graph compression +- task:`T2220 `__ +- effort: 2 PM +- priority: High + +Includes work: + +- Have the graph compression pipeline running in production with less then a month of lag +- Deployment, hosting and pipeline tooling +- Handle the situation for staging + +KPIs: + +- graph compression pipeline in production +- last update date / number of updates per year + +Mirror tooling in production +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: douardda +- tags: 2021, mirror +- task: `T4085 `__ +- effort: 2 PM +- priority: High + +Includes work: + +- Document the setup, the administration and the maintenance of a mirror (sprint + maintenance) +- Handle the situation for staging +- Organize the mirror operators community + +KPIs: + +- mirror un staging +- organized community + +User support ticket system and process +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: bchauvet +- tags: support, user +- task: `T3730 `__ +- effort: 1 PM +- priority: Medium + +Includes work: + +- Create a user-facing ticket system to support user requests and bug reports + - e.g., a support@ address that automatically create support tasks that we can triage and follow +- Define the process to: + - ensure some basic quality of service (e.g., time to first answer) + - pending tasks are not forgotten. + +KPIs: + +- user support feature available on web ui + +Reliable user-level monitoring of services +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +- lead: vsellier +- tags: 2021, support, user +- task: `T3129 `__ +- effort: 1 PM +- priority: High + +Includes work: + +High-level view of which services are running or not, and integration +with status.softwareheritage.org + +KPIs: + +- Services dashboard in production