Commit 64efc41f authored by Daniele Venzano's avatar Daniele Venzano

Documentation update

parent 97743326
#!/usr/bin/env bash
PYTHONPATH=. sphinx-build -nW -b html -d docs/_build/doctrees docs/ docs/_build/html
Zoe - Container-based Analytics as a Service
============================================
Zoe uses `Docker Swarm <https://docs.docker.com/swarm/>`_ to run Analytics as a Service applications.
Zoe is a user facing software that hides the complexities of managing resources, configuring and deploying complex distributed applications on private clouds. The main focus are data analysis applications, such as `Spark <http://spark.apache.org/>`_, but Zoe has a very flexible application description format that lets you easily describe any kind of application.
Zoe is fast: it can create a fully-functional Spark cluster of 20 nodes in less than five seconds.
Zoe uses containerization technology to provide fast startup times and process isolation. A smart scheduler is able to prioritize executions according to several policies, maximising the use of the available capacity and maintaining a queue of executions that are ready to run.
Zoe is easy to use: just a few clicks on a web interface is all that is needed to configure and start a variety of data-intensive applications. Applications are flexible compositions of Frameworks: for example Jupyter and Spark can be composed to form a Zoe application (a ZApp!).
Zoe is open: applications can be described by a JSON file, anything that can run in a Docker container can be run within Zoe (but we concentrate on data intensive applications).
Zoe is smart: not everyone has infinite resources like Amazon or Google, Zoe is built for small clouds, physical or virtual, and is built to maximize the use of available capacity.
Zoe can use a Docker Swarm located anywhere, on Amazon or in your own private cloud, and does not need exclusive access to it, meaning your Swarm could also be running other services: Zoe will not interfere with them. Zoe is meant as a private service, adding data-analytics capabilities to existing, or new, Docker clusters.
Zoe currently supports Docker Swarm as the container backend. It can be located anywhere, on Amazon or in your own private cloud, and Zoe does not need exclusive access to it, meaning your Swarm could also be running other services: Zoe will not interfere with them. Zoe is meant as a private service, adding data-analytics capabilities to new or existing clusters.
The core components of Zoe are application-independent and users are free to create and execute application descriptions for any kind of service combination. Zoe targets analytics services in particular: we offer a number of tested sample ZApps and Frameworks that can be used as starting examples.
......@@ -28,7 +22,18 @@ A number of predefined applications for testing and customization can be found a
Have a look at the :ref:`vision` and at the :ref:`roadmap` to see what we are currently planning and feel free to `contact us <daniele.venzano@eurecom.fr>`_ via email or through the `GitHub issue tracker <https://github.com/DistributedSystemsGroup/zoe/issues>`_ to pose questions or suggest ideas and new features.
Contents:
A note on terminology (needs to be updated)
-------------------------------------------
We are spending a lot of effort to use consistent naming throughout the documentation, the software, the website and all the other resources associated with Zoe. Check the :ref:`architecture` document for the details, but here is a quick reference:
* Zoe Components: the Zoe processes, the Master, the API and the service monitor
* Zoe Applications: a composition of Zoe Frameworks, is the highest-level entry in application descriptions that the use submits to Zoe, can be abbreviated in ZApp(s).
* Zoe Frameworks: a composition of Zoe Services, is used to describe re-usable pieces of Zoe Applications, like a Spark cluster
* Zoe Services: one to one with a Docker container, describes a single service/process tree running in an isolated container
Contents
--------
.. toctree::
:maxdepth: 2
......@@ -38,30 +43,48 @@ Contents:
logging
monitoring
architecture
howto_zapp
rest-api
vision
roadmap
contributing
Zoe applications
----------------
A note on terminology
---------------------
:ref:`modindex`
We are spending a lot of effort to use consistent naming throughout the documentation, the software, the website and all the other resources associated with Zoe. Check the :ref:`architecture` document for the details, but here is a quick reference:
.. toctree::
:maxdepth: 2
* Zoe Components: the Zoe processes, the Master, the API and the service monitor
* Zoe Applications: a composition of Zoe Frameworks, is the highest-level entry in application descriptions that the use submits to Zoe, can be abbreviated in ZApp(s).
* Zoe Frameworks: a composition of Zoe Services, is used to describe re-usable pieces of Zoe Applications, like a Spark cluster
* Zoe Services: one to one with a Docker container, describes a single service/process tree running in an isolated container
zapps/classification
zapps/howto_zapp
zapps/zapp_format
zapps/contributing
Developer documentation
-----------------------
:ref:`modindex`
.. toctree::
:maxdepth: 2
developer/introduction
developer/rest-api
developer/auth
developer/api-endpoint
developer/master-api
developer/scheduler
Contacts
========
`Zoe website <http://zoe-analytics.eu>`_
Zoe is developed as part of the research activities of the `Distributed Systems Group <http://distsysgroup.wordpress.com>`_ at `Eurecom <http://www.eurecom.fr>`_, in
Sophia Antipolis, France.
`Zoe mailing list <http://www.freelists.org/list/zoe>`_
About us
========
Zoe is developed as part of the research activities of the `Distributed Systems Group <http://distsysgroup.wordpress.com>`_ at `Eurecom <http://www.eurecom.fr>`_, in Sophia Antipolis, France.
The main discussion area for issues, questions and feature requests is the `GitHub issue tracker <https://github.com/DistributedSystemsGroup/zoe/issues>`_.
......@@ -11,7 +11,7 @@ To setup a more convenient loggin solution, Zoe provides two alternatives:
2. Using the ``service-log-path`` option: logs will be stored in the directory specified when the execution terminates. The directory can be exposed via http or NFS to give access to users. On the other hand, if the log are too big, Zoe will spend a big amount of time saving the data and resources will not be freed until the the copying process has not finished.
In our experience, web interfaces like Kibana or Graylog are not useful to the Zoe users: they want to quickly dig through logs of their executions to find an error or an interesting number to correlate to some other number in some other log. The web interfaces (option 1) are slow and cluttered compared to using grep on a text file (option 2).
Which alternative is good for you depends on the usage pattern of your users, your log storage/auditing requirements, etc.
Which alternative is good for you depends on the usage pattern of your users, your log auditing requirements, etc.
Optional Kafka support
----------------------
......
.. _zapp_classification:
Classification
==============
Zoe runs processes inside containers and the Zoe application description is very generic, allowing any kind of application to be described in Zoe and submitted for execution. While the main focus of Zoe are so-called "analytic applications", there are many other tools that can be run on the same cluster, for monitoring, storage, log management, history servers, etc. These applications can be described in Zoe and executed, but they have quite different scheduling constraint.
- Long running: potentially will never terminate
- Non elastic
- Storage: need to have access to non-container storage (volumes or disk partitions)
- HDFS
- Cassandra
- ElasticSearch
- Interactive: need to expose web interfaces to the end user
- Jupyter
- Spark, Hadoop, Tensorflow, etc history servers
- Kibana
- Graylog (web interface only)
- Streaming:
- Logstash
- User access
- Proxies and SSH gateways
- Elastic
- Streaming:
- Spark streaming user jobs
- Storm
- Flink streaming
- Batch: will eventually finish by themselves
- Elastic:
- Spark classic batch jobs
- Hadoop MapReduce
- Flink
- Non elastic:
- MPI
- Tensorflow
All the applications in the **long-running** category need to be deployed, managed, upgraded and monitored since they are part of the cluster infrastructure. The Jupyter notebook at first glance may seem an out of place, but in fact it is an interface to access different computing systems and languages, sometimes integrated in Jupyter itself, but also distributed in other nodes, with Spark or Tensorflow backends. As an interface the user may expect for it to be always there, making it part of the infrastructure.
The **elastic, long-running** applications have a degree more of flexibility, that can be taken into account by Zoe. The have all the same needs as the non-elastic applications, but they can also be scaled according to many criteria (priority, latency, data volume).
The applications in the **batch** category, instead, need to be scheduled according to policies that give more or less priority to different jobs, taking also into account the elasticity of some of these computing engines.
.. _contributing:
How to contribute
=================
Zoe applications
----------------
Contributing ZApps
------------------
Zoe applications are maintained in the `zoe-applications <https://github.com/DistributedSystemsGroup/zoe-applications>`_ repository, feel free to fork it and generate pull requests for new applications, frameworks and services.
Check also the :ref:`howto_zapp` document for help on building ZApps from the already-available services and the :ref:`zapp_format`.
Developer documentation
-----------------------
:ref:`modindex`
.. toctree::
:maxdepth: 2
developer/introduction
developer/zapp_format
developer/rest-api
developer/auth
developer/api-endpoint
developer/master-api
developer/scheduler
......@@ -24,7 +24,7 @@ In this guide we are going to use Python because it is a very easy language to u
We are planning graphical tools and a packaging system for ZApps, so stay tuned for updates! In the `Zoe Applications repository <https://github.com/DistributedSystemsGroup/zoe-applications>`_ there is already a very simple web interface we use internally for our users.
.. image:: figures/zapp_structure.png
.. image:: /figures/zapp_structure.png
A ZApp is a tree of nested dictionaries (other languages call them maps or hashmaps). The actual JSON tree is flattened because Zoe does not need to know about Frameworks, it is a logical subdivision that helps the user.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment