diff --git a/README.md b/README.md index f4f2481..56e255b 100644 --- a/README.md +++ b/README.md @@ -1,49 +1,42 @@ swh-indexer ============ Tools to compute multiple indexes on SWH's raw contents: - content: - mimetype - - ctags - - language - fossology-license - metadata -- revision: - - metadata +- origin: + - metadata (intrinsic, using the content indexer; and extrinsic) An indexer is in charge of: - looking up objects - extracting information from those objects - store those information in the swh-indexer db There are multiple indexers working on different object types: - content indexer: works with content sha1 hashes - revision indexer: works with revision sha1 hashes - origin indexer: works with origin identifiers Indexation procedure: - receive batch of ids - retrieve the associated data depending on object type - compute for that object some index - store the result to swh's storage Current content indexers: - mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype -- language (queue swh_indexer_content_language): detect the - programming language - -- ctags (queue swh_indexer_content_ctags): compute tags information - - fossology-license (queue swh_indexer_fossology_license): compute the license -- metadata: translate file into translated_metadata dict +- metadata: translate file from an ecosystem-specific formats to JSON-LD + (using schema.org/CodeMeta vocabulary) -Current revision indexers: +Current origin indexers: -- metadata: detects files containing metadata and retrieves translated_metadata - in content_metadata table in storage or run content indexer to translate - files. +- metadata: translate file from an ecosystem-specific formats to JSON-LD + (using schema.org/CodeMeta and ForgeFed vocabularies)