project swh.indexer is in charge of this.
As explained through email:
- a producer sends batch of contents (reading from azure storages), fetch corresponding filenames from swh storage, and send a content (sha1, one random filename) on next queue
- first queue is in charge to read sha1s' content, enrich the content with it and send those to the next queue
- 2nd queue is in charge of computing mimetype, encoding, enrich the content, save the result in storage, and send the enriched content to next queue
- 3rd queue, in charge of computing the language programming, enrich the content, save the result, send the result to next queue
- 4th queue, in charge of compute the ctags, enrich the content, save the result, stop (for now)
- etc...
A POC is developed and deployed on worker01.euwest.azure.softwareheritage.org
Note:
- The storage i keep making reference to is mongodb and runs on the same azure node i run tests on (so i need to deploy it elsewhere)
- The raw content is not stored in the storage
- The storage is to write only but at some point, it will be used for reading as well (typically for the full text search on ctags, from the webapp with new api endpoint)