Page MenuHomeSoftware Heritage
Paste P111

poc azure - Notes on technology
ActivePublic

Authored by ardumont on Sep 27 2016, 12:41 AM.
#+title: POC azure
#+author: ardumont
* Goal
Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...)
Wants to:
- show the world the ideas behind swh -> Iterate over the current datasets
- Avoid stressing out our current infrastructure (already working pretty well on injection right now)
** Status
Data has been currently stored in azure storages through:
- 16 different accounts with 1 blob storage (1 container)
This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account
Rougly 6M contents stored.
** Storage
What can we use as storage?
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| Technology | Pros | Cons | Note |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.postgresql.org/][PostgreSQL]] | - team knowledge | - must be fluent in index optimization for | |
| | - well documented | write/read operations | |
| | - debian packaged | | |
| | - python bindings | | |
| | - PostgreSQL Licence | | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[http://lucene.apache.org/solr/][Apache Solr]] | - python bindings (debian packaged) | More text search oriented | Solr is the popular, blazing-fast, open source enterprise search platform |
| | - server part debian packaged | | built on Apache Lucene. |
| | - Apache 2 licence | | |
| | - Community | | |
| | - SolrCloud for scalability (based on Apache Zookeeper) | | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.elastic.co][Elastic Search]] | - analytical tool | - debian package seems off | Elasticsearch is a distributed, open source search and analytics |
| | - python bindings (debian packaged) | - java (-> oracle's jre?) | engine, designed for horizontal scalability, reliability, and easy management. |
| | - Apache 2 licence | - not for complex computations | |
| | - More monitoring tools (like what?) | - Anyone can contribute but only employee can push | |
| | | - Split brain issue | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[http://spark.apache.org/][Apache Spark]] | - python bindings (pip3) | - debian package inexistant | Apache Spark is a fast and general-purpose cluster computing system. |
| | - complex computations | - java (-> oracle's jre?) | It provides high-level APIs in Java, Scala and Python, and an |
| | - connector for multiple backend (mongodb, ...) | | optimized engine that supports general execution graphs. |
| | - Apache 2 licence | | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
| [[https://www.mongodb.com/][MongoDB]] | - debian packaged | Atomicity per document | MongoDB is an open-source document database that provides high performance, |
| | - python bindings (python3-pymongo) | 2-Phase commit as pattern (not enforced) | ehigh availability, and automatic scaling. |
| | - well documented (getting started per language, | | |
| | doc per job - developer, admin - etc...) | | |
| | - AGPL 3.0 / debian,rpm files, are apache 2 | | |
| | - connectors (apache spark) | | |
| | - horizontal scalability | | |
| | - just works as entertained | | |
|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------|
sources:
Respective sites
https://www.datanami.com/2015/01/22/solr-elasticsearch-question/
* Indexer
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Indexer | Tools | Note | Alternative |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Mimetype | file --mime-type | Lots of stuff seems to be detected as plain/text | |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Language | [[https://github.com/github/linguist][ruby-github-linguist]] | Do not detect without filename (also not on stable) | Implement a bayesian filter and make it learn languages (we should be able to use |
| | [[https://github.com/blackducksoftware/ohcount][ohcount]] | Do not detect without filename | swh's content data for that) -> https://github.com/glaslos/langdog |
| | [[http://pygments.org/][python3-pygment]] | Do not detect without filename | |
| | [[https://github.com/isagalaev/highlight.js][highlight-js]] | javascript library to learn how they do it | |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Ctags | ctags-exuberant | Do not detect without filename (multiple runs | use flag `--language-force` with the language detected at previous step |
| | | change the output depending on the filename's | (-> impose an order in the pipeline) |
| | | extension) | |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
| Snippet | ... | | |
|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------|
* MongoDB
- A document is a record (~ row)
- A collection is a set of documents (~ table)
** Atomicity
- Atomicity per document (which can contains other documents)
- Use $isolated keyword for transaction like (on multiple documents)
- Does not work on sharded-cluster.
- not all-in-one commit, a rollback won't rollback everything (???)
- 2-phases commits
- to avoid concurrency issues
- create index with unique key
- indicate the original value in the predicate of the write query
- use 2 phase-commits
** Properties
Read uncommitted -> other clients can see the result of current writing, even once not committed.
** Indexes
* Bayesian filter
List of keywords per language: http://www.ultraedit.com/downloads/extras/wordfiles.html
Repository: https://github.com/IDMComputerSolutions/wordfiles under MIT license
Used to make the Language detector learn languages
** Data set
#+BEGIN_SRC sh
$ git clone https://github.com/IDMComputerSolutions/wordfiles
$ cd wordfiles
# remove all lines starting by / something
$ for f in *; do echo $f; grep -v '^/' $f | sponge $f; done
#+END_SRC
Now we have some playground to learn languages
** Status
Parsing is too basic in current implementation, need to work on better tokenizer since basic languages are wongly parsed (e.g. lisp).
For now, using python3-pygments instead of this.

Event Timeline

ardumont changed the title of this paste from Notes on technology for poc to poc azure - Notes on technology.
ardumont edited the content of this paste. (Show Details)