#+title: POC azure
#+author: ardumont

* Goal

Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...)

Wants to:
- show the world the ideas behind swh -> Iterate over the current datasets

- Avoid stressing out our current infrastructure (already working pretty well on injection right now)

** Status

Data has been currently stored in azure storages through:
- 16 different accounts with 1 blob storage (1 container)

This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account

Rougly 6M contents stored.

** Storage

What can we use as storage?

|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| Technology     | Pros                                                    | Cons                                               | Note                                                                           |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.postgresql.org/][PostgreSQL]]     | - team knowledge                                        | - must be fluent in index optimization for         |                                                                                |
|                | - well documented                                       | write/read operations                              |                                                                                |
|                | - debian packaged                                       |                                                    |                                                                                |
|                | - python bindings                                       |                                                    |                                                                                |
|                | - PostgreSQL Licence                                    |                                                    |                                                                                |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[http://lucene.apache.org/solr/][Apache Solr]]    | - python bindings (debian packaged)                     | More text search oriented                          | Solr is the popular, blazing-fast, open source enterprise search platform      |
|                | - server part debian packaged                           |                                                    | built on Apache Lucene.                                                        |
|                | - Apache 2 licence                                      |                                                    |                                                                                |
|                | - Community                                             |                                                    |                                                                                |
|                | - SolrCloud for scalability (based on Apache Zookeeper) |                                                    |                                                                                |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.elastic.co][Elastic Search]] | - analytical tool                                       | - debian package seems off                         | Elasticsearch is a distributed, open source search and analytics               |
|                | - python bindings (debian packaged)                     | - java (-> oracle's jre?)                          | engine, designed for horizontal scalability, reliability, and easy management. |
|                | - Apache 2 licence                                      | - not for complex computations                     |                                                                                |
|                | - More monitoring tools (like what?)                    | - Anyone can contribute but only employee can push |                                                                                |
|                |                                                         | - Split brain issue                                |                                                                                |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[http://spark.apache.org/][Apache Spark]]   | - python bindings (pip3)                                | - debian package inexistant                        | Apache Spark is a fast and general-purpose cluster computing system.           |
|                | - complex computations                                  | - java (-> oracle's jre?)                          | It provides high-level APIs in Java, Scala and Python, and an                  |
|                | - connector for multiple backend (mongodb, ...)         |                                                    | optimized engine that supports general execution graphs.                       |
|                | - Apache 2 licence                                      |                                                    |                                                                                |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.mongodb.com/][MongoDB]]        | - debian packaged                                       | Atomicity per document                             | MongoDB is an open-source document database that provides high performance,    |
|                | - python bindings (python3-pymongo)                     | 2-Phase commit as pattern (not enforced)           | ehigh availability, and automatic scaling.                                     |
|                | - well documented (getting started per language,        |                                                    |                                                                                |
|                | doc per job - developer, admin - etc...)                |                                                    |                                                                                |
|                | - AGPL 3.0 / debian,rpm files, are apache 2             |                                                    |                                                                                |
|                | - connectors (apache spark)                             |                                                    |                                                                                |
|                | - horizontal scalability                                |                                                    |                                                                                |
|                | - just works as entertained                             |                                                    |                                                                                |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|

sources:
Respective sites
https://www.datanami.com/2015/01/22/solr-elasticsearch-question/

* Indexer

|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Indexer  | Tools                | Note                                                | Alternative                                                                       |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Mimetype | file --mime-type     | Lots of stuff seems to be detected as plain/text    |                                                                                   |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Language | [[https://github.com/github/linguist][ruby-github-linguist]] | Do not detect without filename (also not on stable) | Implement a bayesian filter and make it learn languages (we should be able to use |
|          | [[https://github.com/blackducksoftware/ohcount][ohcount]]              | Do not detect without filename                      | swh's content data for that) -> https://github.com/glaslos/langdog                |
|          | [[http://pygments.org/][python3-pygment]]      | Do not detect without filename                      |                                                                                   |
|          | [[https://github.com/isagalaev/highlight.js][highlight-js]]         | javascript library to learn how they do it          |                                                                                   |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Ctags    | ctags-exuberant      | Do not detect without filename (multiple runs       | use flag `--language-force` with the language detected at previous step           |
|          |                      | change the output depending on the filename's       | (-> impose an order in the pipeline)                                              |
|          |                      | extension)                                          |                                                                                   |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Snippet  | ...                  |                                                     |                                                                                   |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|

* MongoDB

- A document is a record (~ row)
- A collection is a set of documents (~ table)

** Atomicity

- Atomicity per document (which can contains other documents)

- Use $isolated keyword for transaction like (on multiple documents)
  - Does not work on sharded-cluster.
  - not all-in-one commit, a rollback won't rollback everything (???)

- 2-phases commits
- to avoid concurrency issues
  - create index with unique key
  - indicate the original value in the predicate of the write query
  - use 2 phase-commits

** Properties

Read uncommitted -> other clients can see the result of current writing, even once not committed.

** Indexes

* Bayesian filter

List of keywords per language: http://www.ultraedit.com/downloads/extras/wordfiles.html
Repository: https://github.com/IDMComputerSolutions/wordfiles under MIT license

Used to make the Language detector learn languages
** Data set

#+BEGIN_SRC sh
$ git clone https://github.com/IDMComputerSolutions/wordfiles
$ cd wordfiles
# remove all lines starting by / something
$ for f in *; do echo $f; grep -v '^/' $f | sponge $f; done
#+END_SRC

Now we have some playground to learn languages

** Status

Parsing is too basic in current implementation, need to work on better tokenizer since basic languages are wongly parsed (e.g. lisp).

For now, using python3-pygments instead of this.