poc azure - Notes on technology
ActivePublic
Actions

Authored by ardumont on Sep 27 2016, 12:41 AM.

Tags

None

Subscribers

None

	#+title: POC azure
	#+author: ardumont

	* Goal

	Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...)

	Wants to:
	- show the world the ideas behind swh -> Iterate over the current datasets

	- Avoid stressing out our current infrastructure (already working pretty well on injection right now)

	** Status

	Data has been currently stored in azure storages through:
	- 16 different accounts with 1 blob storage (1 container)

	This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account

	Rougly 6M contents stored.

	** Storage

	What can we use as storage?

	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| Technology \| Pros \| Cons \| Note \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[https://www.postgresql.org/][PostgreSQL]] \| - team knowledge \| - must be fluent in index optimization for \| \|
	\| \| - well documented \| write/read operations \| \|
	\| \| - debian packaged \| \| \|
	\| \| - python bindings \| \| \|
	\| \| - PostgreSQL Licence \| \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[http://lucene.apache.org/solr/][Apache Solr]] \| - python bindings (debian packaged) \| More text search oriented \| Solr is the popular, blazing-fast, open source enterprise search platform \|
	\| \| - server part debian packaged \| \| built on Apache Lucene. \|
	\| \| - Apache 2 licence \| \| \|
	\| \| - Community \| \| \|
	\| \| - SolrCloud for scalability (based on Apache Zookeeper) \| \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[https://www.elastic.co][Elastic Search]] \| - analytical tool \| - debian package seems off \| Elasticsearch is a distributed, open source search and analytics \|
	\| \| - python bindings (debian packaged) \| - java (-> oracle's jre?) \| engine, designed for horizontal scalability, reliability, and easy management. \|
	\| \| - Apache 2 licence \| - not for complex computations \| \|
	\| \| - More monitoring tools (like what?) \| - Anyone can contribute but only employee can push \| \|
	\| \| \| - Split brain issue \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[http://spark.apache.org/][Apache Spark]] \| - python bindings (pip3) \| - debian package inexistant \| Apache Spark is a fast and general-purpose cluster computing system. \|
	\| \| - complex computations \| - java (-> oracle's jre?) \| It provides high-level APIs in Java, Scala and Python, and an \|
	\| \| - connector for multiple backend (mongodb, ...) \| \| optimized engine that supports general execution graphs. \|
	\| \| - Apache 2 licence \| \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[https://www.mongodb.com/][MongoDB]] \| - debian packaged \| Atomicity per document \| MongoDB is an open-source document database that provides high performance, \|
	\| \| - python bindings (python3-pymongo) \| 2-Phase commit as pattern (not enforced) \| ehigh availability, and automatic scaling. \|
	\| \| - well documented (getting started per language, \| \| \|
	\| \| doc per job - developer, admin - etc...) \| \| \|
	\| \| - AGPL 3.0 / debian,rpm files, are apache 2 \| \| \|
	\| \| - connectors (apache spark) \| \| \|
	\| \| - horizontal scalability \| \| \|
	\| \| - just works as entertained \| \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|

	sources:
	Respective sites
	https://www.datanami.com/2015/01/22/solr-elasticsearch-question/

	* Indexer

	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Indexer \| Tools \| Note \| Alternative \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Mimetype \| file --mime-type \| Lots of stuff seems to be detected as plain/text \| \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Language \| [[https://github.com/github/linguist][ruby-github-linguist]] \| Do not detect without filename (also not on stable) \| Implement a bayesian filter and make it learn languages (we should be able to use \|
	\| \| [[https://github.com/blackducksoftware/ohcount][ohcount]] \| Do not detect without filename \| swh's content data for that) -> https://github.com/glaslos/langdog \|
	\| \| [[http://pygments.org/][python3-pygment]] \| Do not detect without filename \| \|
	\| \| [[https://github.com/isagalaev/highlight.js][highlight-js]] \| javascript library to learn how they do it \| \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Ctags \| ctags-exuberant \| Do not detect without filename (multiple runs \| use flag `--language-force` with the language detected at previous step \|
	\| \| \| change the output depending on the filename's \| (-> impose an order in the pipeline) \|
	\| \| \| extension) \| \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Snippet \| ... \| \| \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|

	* MongoDB

	- A document is a record (~ row)
	- A collection is a set of documents (~ table)

	** Atomicity

	- Atomicity per document (which can contains other documents)

	- Use $isolated keyword for transaction like (on multiple documents)
	- Does not work on sharded-cluster.
	- not all-in-one commit, a rollback won't rollback everything (???)

	- 2-phases commits
	- to avoid concurrency issues
	- create index with unique key
	- indicate the original value in the predicate of the write query
	- use 2 phase-commits

	** Properties

	Read uncommitted -> other clients can see the result of current writing, even once not committed.

	** Indexes

	* Bayesian filter

	List of keywords per language: http://www.ultraedit.com/downloads/extras/wordfiles.html
	Repository: https://github.com/IDMComputerSolutions/wordfiles under MIT license

	Used to make the Language detector learn languages
	** Data set

	#+BEGIN_SRC sh
	$ git clone https://github.com/IDMComputerSolutions/wordfiles
	$ cd wordfiles
	# remove all lines starting by / something
	$ for f in *; do echo $f; grep -v '^/' $f \| sponge $f; done
	#+END_SRC

	Now we have some playground to learn languages

	** Status

	Parsing is too basic in current implementation, need to work on better tokenizer since basic languages are wongly parsed (e.g. lisp).

	For now, using python3-pygments instead of this.

Event Timeline

ardumont created this paste.Sep 27 2016, 12:41 AM

ardumont changed the title of this paste from Notes on technology for poc to poc azure - Notes on technology.

ardumont edited the content of this paste. (Show Details)

ardumont edited the content of this paste. (Show Details)Sep 30 2016, 9:34 AM

	#+title: POC azure
	#+author: ardumont

	* Goal

	Permit the indexation of our objstorage for further analysis (sloccount, mimetypes, languages, full text research, snippets, etc...)

	Wants to:
	- show the world the ideas behind swh -> Iterate over the current datasets

	- Avoid stressing out our current infrastructure (already working pretty well on injection right now)

	** Status

	Data has been currently stored in azure storages through:
	- 16 different accounts with 1 blob storage (1 container)

	This permits to overcome the existing limit of 500Tb of storage in one blob storage for one account

	Rougly 6M contents stored.

	** Storage

	What can we use as storage?

	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| Technology \| Pros \| Cons \| Note \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[https://www.postgresql.org/][PostgreSQL]] \| - team knowledge \| - must be fluent in index optimization for \| \|
	\| \| - well documented \| write/read operations \| \|
	\| \| - debian packaged \| \| \|
	\| \| - python bindings \| \| \|
	\| \| - PostgreSQL Licence \| \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[http://lucene.apache.org/solr/][Apache Solr]] \| - python bindings (debian packaged) \| More text search oriented \| Solr is the popular, blazing-fast, open source enterprise search platform \|
	\| \| - server part debian packaged \| \| built on Apache Lucene. \|
	\| \| - Apache 2 licence \| \| \|
	\| \| - Community \| \| \|
	\| \| - SolrCloud for scalability (based on Apache Zookeeper) \| \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[https://www.elastic.co][Elastic Search]] \| - analytical tool \| - debian package seems off \| Elasticsearch is a distributed, open source search and analytics \|
	\| \| - python bindings (debian packaged) \| - java (-> oracle's jre?) \| engine, designed for horizontal scalability, reliability, and easy management. \|
	\| \| - Apache 2 licence \| - not for complex computations \| \|
	\| \| - More monitoring tools (like what?) \| - Anyone can contribute but only employee can push \| \|
	\| \| \| - Split brain issue \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[http://spark.apache.org/][Apache Spark]] \| - python bindings (pip3) \| - debian package inexistant \| Apache Spark is a fast and general-purpose cluster computing system. \|
	\| \| - complex computations \| - java (-> oracle's jre?) \| It provides high-level APIs in Java, Scala and Python, and an \|
	\| \| - connector for multiple backend (mongodb, ...) \| \| optimized engine that supports general execution graphs. \|
	\| \| - Apache 2 licence \| \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|
	\| [[https://www.mongodb.com/][MongoDB]] \| - debian packaged \| Atomicity per document \| MongoDB is an open-source document database that provides high performance, \|
	\| \| - python bindings (python3-pymongo) \| 2-Phase commit as pattern (not enforced) \| ehigh availability, and automatic scaling. \|
	\| \| - well documented (getting started per language, \| \| \|
	\| \| doc per job - developer, admin - etc...) \| \| \|
	\| \| - AGPL 3.0 / debian,rpm files, are apache 2 \| \| \|
	\| \| - connectors (apache spark) \| \| \|
	\| \| - horizontal scalability \| \| \|
	\| \| - just works as entertained \| \| \|
	\|----------------+---------------------------------------------------------+----------------------------------------------------+--------------------------------------------------------------------------------\|

	sources:
	Respective sites
	https://www.datanami.com/2015/01/22/solr-elasticsearch-question/

	* Indexer

	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Indexer \| Tools \| Note \| Alternative \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Mimetype \| file --mime-type \| Lots of stuff seems to be detected as plain/text \| \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Language \| [[https://github.com/github/linguist][ruby-github-linguist]] \| Do not detect without filename (also not on stable) \| Implement a bayesian filter and make it learn languages (we should be able to use \|
	\| \| [[https://github.com/blackducksoftware/ohcount][ohcount]] \| Do not detect without filename \| swh's content data for that) -> https://github.com/glaslos/langdog \|
	\| \| [[http://pygments.org/][python3-pygment]] \| Do not detect without filename \| \|
	\| \| [[https://github.com/isagalaev/highlight.js][highlight-js]] \| javascript library to learn how they do it \| \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Ctags \| ctags-exuberant \| Do not detect without filename (multiple runs \| use flag `--language-force` with the language detected at previous step \|
	\| \| \| change the output depending on the filename's \| (-> impose an order in the pipeline) \|
	\| \| \| extension) \| \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|
	\| Snippet \| ... \| \| \|
	\|----------+----------------------+-----------------------------------------------------+-----------------------------------------------------------------------------------\|

	* MongoDB

	- A document is a record (~ row)
	- A collection is a set of documents (~ table)

	** Atomicity

	- Atomicity per document (which can contains other documents)

	- Use $isolated keyword for transaction like (on multiple documents)
	- Does not work on sharded-cluster.
	- not all-in-one commit, a rollback won't rollback everything (???)

	- 2-phases commits
	- to avoid concurrency issues
	- create index with unique key
	- indicate the original value in the predicate of the write query
	- use 2 phase-commits

	** Properties

	Read uncommitted -> other clients can see the result of current writing, even once not committed.

	** Indexes

	* Bayesian filter

	List of keywords per language: http://www.ultraedit.com/downloads/extras/wordfiles.html
	Repository: https://github.com/IDMComputerSolutions/wordfiles under MIT license

	Used to make the Language detector learn languages
	** Data set

	#+BEGIN_SRC sh
	$ git clone https://github.com/IDMComputerSolutions/wordfiles
	$ cd wordfiles
	# remove all lines starting by / something
	$ for f in *; do echo $f; grep -v '^/' $f \| sponge $f; done
	#+END_SRC

	Now we have some playground to learn languages

	** Status

	Parsing is too basic in current implementation, need to work on better tokenizer since basic languages are wongly parsed (e.g. lisp).

	For now, using python3-pygments instead of this.

poc azure - Notes on technologyActivePublicActions

Event Timeline

poc azure - Notes on technology
ActivePublic
Actions