#+title: POC azure #+author: ardumont * Goal Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...) Wants to: - show the world the ideas behind swh -> Iterate over the current datasets - Avoid stressing out our current infrastructure (already working pretty well on injection right now) ** Status Data has been currently stored in azure storages through: - 16 different accounts with 1 blob storage (1 container) This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account Rougly 6M contents stored. ** Storage What can we use as storage? |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | Technology | Pros | Cons | Note | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.postgresql.org/][PostgreSQL]] | - team knowledge | - must be fluent in index optimization for | | | | - well documented | write/read operations | | | | - debian packaged | | | | | - python bindings | | | | | - PostgreSQL Licence | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[http://lucene.apache.org/solr/][Apache Solr]] | - python bindings (debian packaged) | More text search oriented | Solr is the popular, blazing-fast, open source enterprise search platform | | | - server part debian packaged | | built on Apache Lucene. | | | - Apache 2 licence | | | | | - Community | | | | | - SolrCloud for scalability (based on Apache Zookeeper) | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.elastic.co][Elastic Search]] | - analytical tool | - debian package seems off | Elasticsearch is a distributed, open source search and analytics | | | - python bindings (debian packaged) | - java (-> oracle's jre?) | engine, designed for horizontal scalability, reliability, and easy management. | | | - Apache 2 licence | - not for complex computations | | | | - More monitoring tools (like what?) | - Anyone can contribute but only employee can push | | | | | - Split brain issue | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[http://spark.apache.org/][Apache Spark]] | - python bindings (pip3) | - debian package inexistant | Apache Spark is a fast and general-purpose cluster computing system. | | | - complex computations | - java (-> oracle's jre?) | It provides high-level APIs in Java, Scala and Python, and an | | | - connector for multiple backend (mongodb, ...) | | optimized engine that supports general execution graphs. | | | - Apache 2 licence | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.mongodb.com/][MongoDB]] | - debian packaged | Atomicity per document | MongoDB is an open-source document database that provides high performance, | | | - python bindings (python3-pymongo) | 2-Phase commit as pattern (not enforced) | ehigh availability, and automatic scaling. | | | - well documented (getting started per language, | | | | | doc per job - developer, admin - etc...) | | | | | - AGPL 3.0 / debian,rpm files, are apache 2 | | | | | - connectors (apache spark) | | | | | - horizontal scalability | | | | | - just works as entertained | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| sources: Respective sites https://www.datanami.com/2015/01/22/solr-elasticsearch-question/ * Indexer |----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------| | Indexer | Tools | Note | Alternative | |----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------| | Mimetype | file --mime-type | Lots of stuff seems to be detected as plain/text | | |----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------| | Language | [[https://github.com/github/linguist][ruby-github-linguist]] | Do not detect without filename (also not on stable) | Implement a bayesian filter and make it learn languages (we should be able to use | | | [[https://github.com/blackducksoftware/ohcount][ohcount]] | Do not detect without filename | swh's content data for that) -> https://github.com/glaslos/langdog | | | [[http://pygments.org/][python3-pygment]] | Do not detect without filename | | | | [[https://github.com/isagalaev/highlight.js][highlight-js]] | javascript library to learn how they do it | | |----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------| | Ctags | ctags-exuberant | Do not detect without filename (multiple runs | use flag `--language-force` with the language detected at previous step | | | | change the output depending on the filename's | (-> impose an order in the pipeline) | | | | extension) | | |----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------| | Snippet | ... | | | |----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------| * MongoDB - A document is a record (~ row) - A collection is a set of documents (~ table) ** Atomicity - Atomicity per document (which can contains other documents) - Use $isolated keyword for transaction like (on multiple documents) - Does not work on sharded-cluster. - not all-in-one commit, a rollback won't rollback everything (???) - 2-phases commits - to avoid concurrency issues - create index with unique key - indicate the original value in the predicate of the write query - use 2 phase-commits ** Properties Read uncommitted -> other clients can see the result of current writing, even once not committed. ** Indexes * Bayesian filter List of keywords per language: http://www.ultraedit.com/downloads/extras/wordfiles.html Repository: https://github.com/IDMComputerSolutions/wordfiles under MIT license Used to make the Language detector learn languages ** Data set #+BEGIN_SRC sh $ git clone https://github.com/IDMComputerSolutions/wordfiles $ cd wordfiles # remove all lines starting by / something $ for f in *; do echo $f; grep -v '^/' $f | sponge $f; done #+END_SRC Now we have some playground to learn languages ** Status Parsing is too basic in current implementation, need to work on better tokenizer since basic languages are wongly parsed (e.g. lisp). For now, using python3-pygments instead of this.