Change Details

#+title: POC azure #+author: ardumont * Goal Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...) Wants to: - show the world the ideas behind swh -> Iterate over the current datasets - Avoid stressing out our current infrastructures (already working pretty well on injection right now) ** Status Data has been currently stored in azure storages through: - 16 different accounts with 1 blob storage (1 container) This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account Rougly 6M contents stored. ** Storage What can we use as storage? |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | Technology | Pros | Cons | Note | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.postgresql.org/][PostgreSQL]] | - team knowledge | - must be fluent in index optimization for | | | | - well documented | write/read operations | | | | - debian packaged | | | | | - python bindings | | | | | - PostgreSQL Licence | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[http://lucene.apache.org/solr/][Apache Solr]] | - python bindings (debian packaged) | More text search oriented | Solr is the popular, blazing-fast, open source enterprise search platform | | | - server part debian packaged | | built on Apache Lucene. | | | - Apache 2 licence | | | | | - Community | | | | | - SolrCloud for scalability (based on Apache Zookeeper) | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.elastic.co][Elastic Search]] | - analytical tool | - debian package seems off | Elasticsearch is a distributed, open source search and analytics | | | - python bindings (debian packaged) | - java (-> oracle's jre?) | engine, designed for horizontal scalability, reliability, and easy management. | | | - Apache 2 licence | - not for complex computations | | | | - More monitoring tools (like what?) | - Anyone can contribute but only employee can push | | | | | - Split brain issue | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[http://spark.apache.org/][Apache Spark]] | - python bindings (pip3) | - debian package inexistant | Apache Spark is a fast and general-purpose cluster computing system. | | | - complex computations | - java (-> oracle's jre?) | It provides high-level APIs in Java, Scala and Python, and an | | | - connector for multiple backend (mongodb, ...) | | optimized engine that supports general execution graphs. | | | - Apache 2 licence | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.mongodb.com/][MongoDB]] | - debian packaged | | MongoDB is an open-source document database that provides high performance, | | | - python bindings (python3-pymongo) | | high availability, and automatic scaling. | | | - well documented (getting started per language, | | | | | doc per job - developer, admin - etc...) | | | | | - AGPL 3.0 / debian,rpm files, are apache 2 | | | | | - connectors (apache spark) | | | | | - horizontal scalability | | | | | - just works as entertained | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| sources: Respective sites https://www.datanami.com/2015/01/22/solr-elasticsearch-question/ * Indexer |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Indexer | Tools | Note | Alternative | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Mimetype | file --mime-type | | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Language | [[https://github.com/github/linguist][ruby-github-linguist]] | Do not detect without filename (not on stable) | Implement a bayesian filter and make it learn languages (we should be able to use | | | [[https://github.com/blackducksoftware/ohcount][ohcount]] | Do not detect without filename | github's data for that) -> https://github.com/glaslos/langdog | | | [[http://pygments.org/][python3-pygment]] | Do not detect without filename | | | | [[https://github.com/isagalaev/highlight.js][highlight-js]] | javascript library to learn how they do it | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Ctags | ctags-exuberant | Do not detect without filename (multiple runs | using flag `--language-force` on once we have | | | | change the output depending on the filename) | something that can determine the language's nature (impose an order in the pipeline) | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Snippet | ... | | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|

#+title: POC azure #+author: ardumont * Goal Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...) Wants to: - show the world the ideas behind swh -> Iterate over the current datasets - Avoid stressing out our current infrastructure (already working pretty well on injection right now) ** Status Data has been currently stored in azure storages through: - 16 different accounts with 1 blob storage (1 container) This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account Rougly 6M contents stored. ** Storage What can we use as storage? |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | Technology | Pros | Cons | Note | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.postgresql.org/][PostgreSQL]] | - team knowledge | - must be fluent in index optimization for | | | | - well documented | write/read operations | | | | - debian packaged | | | | | - python bindings | | | | | - PostgreSQL Licence | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[http://lucene.apache.org/solr/][Apache Solr]] | - python bindings (debian packaged) | More text search oriented | Solr is the popular, blazing-fast, open source enterprise search platform | | | - server part debian packaged | | built on Apache Lucene. | | | - Apache 2 licence | | | | | - Community | | | | | - SolrCloud for scalability (based on Apache Zookeeper) | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.elastic.co][Elastic Search]] | - analytical tool | - debian package seems off | Elasticsearch is a distributed, open source search and analytics | | | - python bindings (debian packaged) | - java (-> oracle's jre?) | engine, designed for horizontal scalability, reliability, and easy management. | | | - Apache 2 licence | - not for complex computations | | | | - More monitoring tools (like what?) | - Anyone can contribute but only employee can push | | | | | - Split brain issue | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[http://spark.apache.org/][Apache Spark]] | - python bindings (pip3) | - debian package inexistant | Apache Spark is a fast and general-purpose cluster computing system. | | | - complex computations | - java (-> oracle's jre?) | It provides high-level APIs in Java, Scala and Python, and an | | | - connector for multiple backend (mongodb, ...) | | optimized engine that supports general execution graphs. | | | - Apache 2 licence | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.mongodb.com/][MongoDB]] | - debian packaged | | MongoDB is an open-source document database that provides high performance, | | | - python bindings (python3-pymongo) | | high availability, and automatic scaling. | | | - well documented (getting started per language, | | | | | doc per job - developer, admin - etc...) | | | | | - AGPL 3.0 / debian,rpm files, are apache 2 | | | | | - connectors (apache spark) | | | | | - horizontal scalability | | | | | - just works as entertained | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| sources: Respective sites https://www.datanami.com/2015/01/22/solr-elasticsearch-question/ * Indexer |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Indexer | Tools | Note | Alternative | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Mimetype | file --mime-type | | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Language | [[https://github.com/github/linguist][ruby-github-linguist]] | Do not detect without filename (not on stable) | Implement a bayesian filter and make it learn languages (we should be able to use | | | [[https://github.com/blackducksoftware/ohcount][ohcount]] | Do not detect without filename | github's data for that) -> https://github.com/glaslos/langdog | | | [[http://pygments.org/][python3-pygment]] | Do not detect without filename | | | | [[https://github.com/isagalaev/highlight.js][highlight-js]] | javascript library to learn how they do it | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Ctags | ctags-exuberant | Do not detect without filename (multiple runs | using flag `--language-force` on once we have | | | | change the output depending on the filename) | something that can determine the language's nature (impose an order in the pipeline) | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Snippet | ... | | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|

#+title: POC azure #+author: ardumont * Goal Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...) Wants to: - show the world the ideas behind swh -> Iterate over the current datasets - Avoid stressing out our current infrastructures (already working pretty well on injection right now) ** Status Data has been currently stored in azure storages through: - 16 different accounts with 1 blob storage (1 container) This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account Rougly 6M contents stored. ** Storage What can we use as storage? |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | Technology | Pros | Cons | Note | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.postgresql.org/][PostgreSQL]] | - team knowledge | - must be fluent in index optimization for | | | | - well documented | write/read operations | | | | - debian packaged | | | | | - python bindings | | | | | - PostgreSQL Licence | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[http://lucene.apache.org/solr/][Apache Solr]] | - python bindings (debian packaged) | More text search oriented | Solr is the popular, blazing-fast, open source enterprise search platform | | | - server part debian packaged | | built on Apache Lucene. | | | - Apache 2 licence | | | | | - Community | | | | | - SolrCloud for scalability (based on Apache Zookeeper) | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.elastic.co][Elastic Search]] | - analytical tool | - debian package seems off | Elasticsearch is a distributed, open source search and analytics | | | - python bindings (debian packaged) | - java (-> oracle's jre?) | engine, designed for horizontal scalability, reliability, and easy management. | | | - Apache 2 licence | - not for complex computations | | | | - More monitoring tools (like what?) | - Anyone can contribute but only employee can push | | | | | - Split brain issue | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[http://spark.apache.org/][Apache Spark]] | - python bindings (pip3) | - debian package inexistant | Apache Spark is a fast and general-purpose cluster computing system. | | | - complex computations | - java (-> oracle's jre?) | It provides high-level APIs in Java, Scala and Python, and an | | | - connector for multiple backend (mongodb, ...) | | optimized engine that supports general execution graphs. | | | - Apache 2 licence | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| | [[https://www.mongodb.com/][MongoDB]] | - debian packaged | | MongoDB is an open-source document database that provides high performance, | | | - python bindings (python3-pymongo) | | high availability, and automatic scaling. | | | - well documented (getting started per language, | | | | | doc per job - developer, admin - etc...) | | | | | - AGPL 3.0 / debian,rpm files, are apache 2 | | | | | - connectors (apache spark) | | | | | - horizontal scalability | | | | | - just works as entertained | | | |----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------| sources: Respective sites https://www.datanami.com/2015/01/22/solr-elasticsearch-question/ * Indexer |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Indexer | Tools | Note | Alternative | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Mimetype | file --mime-type | | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Language | [[https://github.com/github/linguist][ruby-github-linguist]] | Do not detect without filename (not on stable) | Implement a bayesian filter and make it learn languages (we should be able to use | | | [[https://github.com/blackducksoftware/ohcount][ohcount]] | Do not detect without filename | github's data for that) -> https://github.com/glaslos/langdog | | | [[http://pygments.org/][python3-pygment]] | Do not detect without filename | | | | [[https://github.com/isagalaev/highlight.js][highlight-js]] | javascript library to learn how they do it | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Ctags | ctags-exuberant | Do not detect without filename (multiple runs | using flag `--language-force` on once we have | | | | change the output depending on the filename) | something that can determine the language's nature (impose an order in the pipeline) | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------| | Snippet | ... | | | |----------+----------------------+------------------------------------------------+--------------------------------------------------------------------------------------|