#+title: POC azure
#+author: ardumont
* Goal
Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...)
Wants to:
- show the world the ideas behind swh -> Iterate over the current datasets
- Avoid stressing out our current infrastructures (already working pretty well on injection right now)
** Status
Data has been currently stored in azure storages through:
- 16 different accounts with 1 blob storage (1 container)
This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account
Rougly 6M contents stored.
** Storage
What can we use as storage?
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| Technology | Pros | Cons | Note |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.postgresql.org/][PostgreSQL]] | - team knowledge | - must be fluent in index optimization for | |
| | - well documented | write/read operations | |
| | - debian packaged | | |
| | - python bindings | | |
| | - PostgreSQL Licence | | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[http://lucene.apache.org/solr/][Apache Solr]] | - python bindings (debian packaged) | More text search oriented | Solr is the popular, blazing-fast, open source enterprise search platform |
| | - server part debian packaged | | built on Apache Lucene. |
| | - Apache 2 licence | | |
| | - Community | | |
| | - SolrCloud for scalability (based on Apache Zookeeper) | | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.elastic.co][Elastic Search]] | - analytical tool | - debian package seems off | Elasticsearch is a distributed, open source search and analytics |
| | - python bindings (debian packaged) | - java (-> oracle's jre?) | engine, designed for horizontal scalability, reliability, and easy management. |
| | - Apache 2 licence | - not for complex computations | |
| | - More monitoring tools (like what?) | - Anyone can contribute but only employee can push | |
| | | - Split brain issue | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[http://spark.apache.org/][Apache Spark]] | - python bindings (pip3) | - debian package inexistant | Apache Spark is a fast and general-purpose cluster computing system. |
| | - complex computations | - java (-> oracle's jre?) | It provides high-level APIs in Java, Scala and Python, and an |
| | - connector for multiple backend (mongodb, ...) | | optimized engine that supports general execution graphs. |
| | - Apache 2 licence | | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.mongodb.com/][MongoDB]] | - debian packaged | | MongoDB is an open-source document database that provides high performance, |
| | - python bindings (python3-pymongo) | | high availability, and automatic scaling. |
| | - well documented (getting started per language, | | |
| | doc per job - developer, admin - etc...) | | |
| | - AGPL 3.0 / debian,rpm files, are apache 2 | | |
| | - connectors (apache spark) | | |
| | - horizontal scalability | | |
| | - just works as entertained | | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
sources:
Respective sites
https://www.datanami.com/2015/01/22/solr-elasticsearch-question/
* Indexer
|----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|
| Indexer | Tools | Note | Alternative |
|----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|
| Mimetype | file --mime-type | | |
|----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|
| Language | [[https://github.com/github/linguist][ruby-github-linguist]] | Do not detect without filename (not on stable) | Implement a bayesian filter and make it learn languages (we should be able to use |
| | [[https://github.com/blackducksoftware/ohcount][ohcount]] | Do not detect without filename | github's data for that) -> https://github.com/glaslos/langdog |
| | [[http://pygments.org/][python3-pygment]] | Do not detect without filename | |
| | [[https://github.com/isagalaev/highlight.js][highlight-js]] | javascript library to learn how they do it | |
|----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|
| Ctags | ctags-exuberant | Do not detect without filename (multiple runs | using flag `--language-force` on once we have |
| | | change the output depending on the filename) | something that can determine the language's nature (impose an order in the pipeline) |
|----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|
| Snippet | ... | | |
|----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|