Commit 9b0e81be authored by unknown's avatar unknown
Browse files

Keep updating the documentation

parent 14d89e6a
......@@ -21,6 +21,8 @@ import os
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath(os.path.join(os.path.basename(__file__), "..", "..")))
from zoe_lib.version import ZOE_VERSION
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
......@@ -58,7 +60,7 @@ author = 'Daniele Venzano'
# built documents.
#
# The short X.Y version.
version = '0.8.92'
version = ZOE_VERSION
# The full version, including alpha/beta/rc tags.
release = version
......
......@@ -45,6 +45,7 @@ Contents
architecture
rest-api
vision
motivations
roadmap
Zoe applications
......
.. _motivation:
The motivation behind Zoe
=========================
The fundamental idea of Zoe is that a user who wants run data analytics applications should not be bothered by systems details, such as how to configure the amount of RAM a Spark Executor should use, how many cores are available in the system or even how many worker nodes should be used to meet an execution deadline.
Moreover final users require a lot of flexibility, they want ot test new analytics systems and algorithms as soon as possible, without having to wait for some approval procedure to go through the IT department. Zoe proposes a flexible model for applications descriptions: their management can be left entirely to the final user, but they can also prepared very quickly in all or in part by an IT department, who sometimes is more knowledgeable of resource limits and environment variables. We also plan to offer a number of building blocks (Zoe Frameworks) that can be composed to make Zoe Applications.
Finally we feel that there is a lack of solutions in the field of private clouds, where resources are not infinite and data layers (data-sets) may be shared between different users. All the current Open Source solutions we are aware of target the public cloud use case and try, more or less, to mimic what Amazon and other big names are doing in their data-centers.
Zoe strives to satisfy the following requirements:
* easy to use for the end-user
* easy to manage for the system administrator, easy to integrate in existing data-centers/clouds/VM deployments
* short (a few seconds) reaction times to user requests or other system events
* smart queuing and scheduling of applications when resources are critical
Kubernetes, OpenStack Sahara, Mesos and YARN are the projects that, each in its own way, come near Zoe, without solving the needs we have.
Kubernetes (Borg)
-----------------
Kubernetes is a very complex system, both to deploy and to use. It takes some of the architectural principles from Google Borg and targets data centers with vast amounts of resources. We feel that while Kubernetes can certainly run analytic services in containers, it does so at a very high complexity cost for smaller setups. Moreover, in our opinion, certain scheduler choices in how preemption is managed do not apply well to environments with a limited set of users and compute resources, causing a less than optimal resource usage.
OpenStack Sahara
----------------
We know well `OpenStack Sahara <https://wiki.openstack.org/wiki/Sahara>`_, as we have been using it since 2013 and we contributed the Spark plugin. We feel that Sahara has limitations in:
* software support: Sahara plugins support a limited set of data-intensive frameworks, adding a new one means writing a new Sahara plugin and even adding support for a new version requires going through a one-two week (on average) review process.
* lack of scheduling: Sahara makes the assumption that you have infinite resources. When you try to launch a new cluster and there are not enough resources available, the request fails and the user is left doing application and resources scheduling by hand.
* usability: setting up everything that is needed to run an EDP job is cumbersome and error-prone. The user has to provide too many details in too many different places.
Moreover changes to Sahara needs to go through a lengthy review process, that on one side tries to ensure high quality, but on the other side slows down development, especially of major architectural changes, like the ones needed to address the concerns listed above.
Mesos
-----
Mesos is marketing itself as a data-center operating system. Zoe has no such high profile objective: while Zoe schedules distributed applications, it has no knowledge of the applications it is scheduling and, even more importantly, does not require any change in the applications themselves to be run in Zoe.
Mesos requires that each application provides two Mesos-specific components: a scheduler and an executor. Zoe has no such requirements and runs applications unmodified.
YARN
----
YARN, from our point of view, has many similarities with Mesos. It requires application support. Moreover it is integrated in the Hadoop distribution and, while recent efforts are pushing toward making YARN stand up on its own, it is currently tailored for Hadoop applications.
.. _vision:
The motivation behind Zoe
=========================
Vision for Zoe
==============
The fundamental idea of Zoe is that a user who wants run data analytics applications should not be bothered by systems details, such as how to configure the amount of RAM a Spark Executor should use, how many cores are available in the system or even how many worker nodes should be used to meet an execution deadline.
Zoe focus is data analytics. This focus helps defining a clear set of objectives and priorities for the project and avoid the risk of competing directly with generic infrastructure managers like Kubernetes or Swarm. Zoe instead sits on top of these "cloud operating systems" to provide a simpler interface to end users who have no interest in the intricacies of container infrastructures.
Moreover final users require a lot of flexibility, they want ot test new analytics systems and algorithms as soon as possible, without having to wait for some approval procedure to go through the IT department. Zoe proposes a flexible model for applications descriptions: their management can be left entirely to the final user, but they can also prepared very quickly in all or in part by an IT department, who sometimes is more knowledgeable of resource limits and environment variables. We also plan to offer a number of building blocks (Zoe Frameworks) that can be composed to make Zoe Applications.
Data analytic applications do not work in isolation. They need data, that may be stored or streamed, and they generate logs, that may have to be analyzed for debugging or stored for auditing. Data layers, in turn, need health monitoring. All these tools, frameworks, distributed filesystems, object stores, form a galaxy that revolves around the analytic applications. For simplicity we will call these "support applications".
Finally we feel that there is a lack of solutions in the field of private clouds, where resources are not infinite and data layers (data-sets) may be shared between different users. All the current Open Source solutions we are aware of target the public cloud use case and try, more or less, to mimic what Amazon and other big names are doing in their data-centers.
To offer an integrated, modern solution to data analysis Zoe must support both analytic and support applications, even if they have very different users and requirements.
Zoe strives to satisfy the following requirements:
The users of Zoe
----------------
* easy to use for the end-user
* easy to manage for the system administrator, easy to integrate in existing data-centers/clouds/VM deployments
* short (a few seconds) reaction times to user requests or other system events
* smart queuing and scheduling of applications when resources are critical
- users: the real end-users, who submit non-interactive analytic jobs or use interactive applications
- admins: systems administrators who maintain Zoe, the data layers, monitoring and logging for the infrastructure and the apps being run
Kubernetes, OpenStack Sahara, Mesos and YARN are the projects that, each in its own way, come near Zoe, without solving the needs we have.
Deviation from the current ZApp terminology
-------------------------------------------
Kubernetes (Borg)
-----------------
Kubernetes is a very complex system, both to deploy and to use. It takes some of the architectural principles from Google Borg and targets data centers with vast amounts of resources. We feel that while Kubernetes can certainly run analytic services in containers, it does so at a very high complexity cost for smaller setups. Moreover, in our opinion, certain scheduler choices in how preemption is managed do not apply well to environments with a limited set of users and compute resources, causing a less than optimal resource usage.
In the current Zoe implementation (0.10.x) ZApps are self-contained descriptions of a set of cooperating processes. They get submitted once, to request the start-up of an execution. This fits well the model of a single spark job or of a throw-away jupyter notebook.
OpenStack Sahara
----------------
We know well `OpenStack Sahara <https://wiki.openstack.org/wiki/Sahara>`_, as we have been using it since 2013 and we contributed the Spark plugin. We feel that Sahara has limitations in:
We need to revise a bit this terminology: ZApps remain the top level, user-visible entity. The ZApp building blocks, analytic engines or support tools, are called frameworks. A framework, by itself, cannot be run. It lacks configuration or a binary to execute, for example. Each framework is composed of one or more processes.
A few examples:
- A jupyter notebook, by itself, is a framework in Zoe terminology. It lacks configuration that tells it which kernels to enable or on which port to listen to. It is a framework composed by just one process.
- A Spark cluster is another framework. By itself it does nothing. It can be connected to a notebook, or it can be given a jar file and some data to process.
To create a ZApp you need to put together one or more frameworks and add some configuration (framework-dependent) that tells the framework(s) how to behave.
A ZApp shop could contain both frameworks (that user must combine together) and full-featured ZApps, whenever it is possible.
Nothing prevents certain ZApp attributes to be "templated", like resource requests or elastic process counts, for example.
Kinds of applications
---------------------
See :ref:`zapp_classification`.
The concept of application in Zoe is very difficult to define, as it is very fluid and encloses a lot of different tools, frameworks, interfaces, etc.
The focus on analytic applications helps in giving some concrete examples and use cases.
There two main categories of use-cases:
- long-running executions
- data processing workflows
Long-running executions
^^^^^^^^^^^^^^^^^^^^^^^
In this category we have:
- interactive applications started by users (a Jupyter Notebook for example)
- support applications started by admins
- data layers
- monitoring tools
These applications are static in nature. Once deployed they need to be maintained for an indefinite amount of time. A data layer can be expanded with new nodes, or a monitoring pipeline can be scaled up or down, but these are events initiated manually by admins or performed automatically following administrative policies.
Interactive applications (usually web interfaces) can be stand-alone data analysis tools or can be connected to distributed data intensive frameworks. As a matter of fact, a user may start working on a stand-alone interface and then connect the same interface to bigger and bigger clusters to test his algorithm with more and more data.
Workflows
^^^^^^^^^
A few examples of workflows:
- run a single job (simplest kind of workflow)
- run a job every hour (triggered by time)
- run a set of jobs in series or in parallel (triggered by the state of other jobs)
- run a job whenever the size of a file/directory/bucket on a certain data layer reaches 5GB (more complex trigger)
- combinations of all the above
A complete workflow system for data analytic is very complex and is a whole different project that runs on top of Zoe core functionality. Zoe-workflow should be implemented incrementally, starting with the basics. When it reaches a certain complexity, then it should be spun-out in its own project.
At the beginning, workflows can be made up of self-ending applications only. Integrating streaming applications should be done later on.
Commands to the system
----------------------
* software support: Sahara plugins support a limited set of data-intensive frameworks, adding a new one means writing a new Sahara plugin and even adding support for a new version requires going through a one-two week (on average) review process.
* lack of scheduling: Sahara makes the assumption that you have infinite resources. When you try to launch a new cluster and there are not enough resources available, the request fails and the user is left doing application and resources scheduling by hand.
* usability: setting up everything that is needed to run an EDP job is cumbersome and error-prone. The user has to provide too many details in too many different places.
Zoe manages requests to change the state of a set of resources (a virtual or physical cluster of machines) by starting, terminating or modifying process containers.
Moreover changes to Sahara needs to go through a lengthy review process, that on one side tries to ensure high quality, but on the other side slows down development, especially of major architectural changes, like the ones needed to address the concerns listed above.
Users should be kept as much as possible ignorant of the inner workings of these state changes and should be able to express high level commands, like:
Mesos
-----
- start this application (-> creates one or more executions)
- terminate this execution(s)
- Attach to this Jupyter notebook this new Spark cluster
- Define a workflow (see the workflow section)
Mesos is marketing itself as a data-center operating system. Zoe has no such high profile objective: while Zoe schedules distributed applications, it has no knowledge of the applications it is scheduling and, even more importantly, does not require any change in the applications themselves to be run in Zoe.
These kind of commands should be translated automatically into Zoe state changes that are then applies by the components at the lower levels of the Zoe architecture.
Mesos requires that each application provides two Mesos-specific components: a scheduler and an executor. Zoe has no such requirements and runs applications unmodified.
In addition to the commands above, admins should also be able to define operations on long-running executions:
YARN
----
- request rolling or standard upgrades (find all containers using a certain image v. 1 and upgrade them to version 2)
- start and scale long-running applications
- define non-ephemeral storage volumes for data layer applications
- terminate (should be well protected, may cause data losses)
YARN, from our point of view, has many similarities with Mesos. It requires application support. Moreover it is integrated in the Hadoop distribution and, while recent efforts are pushing toward making YARN stand up on its own, it is currently tailored for Hadoop applications.
......@@ -30,15 +30,16 @@ Zoe runs processes inside containers and the Zoe application description is very
- Proxies and SSH gateways
- Elastic
- Elastic (can be automatically resized)
- Streaming:
- Spark streaming user jobs
- Storm
- Flink streaming
- Kafka
- Batch: will eventually finish by themselves
- Self-ending/Ephemeral: will eventually finish by themselves
- Elastic:
......@@ -55,4 +56,4 @@ All the applications in the **long-running** category need to be deployed, manag
The **elastic, long-running** applications have a degree more of flexibility, that can be taken into account by Zoe. The have all the same needs as the non-elastic applications, but they can also be scaled according to many criteria (priority, latency, data volume).
The applications in the **batch** category, instead, need to be scheduled according to policies that give more or less priority to different jobs, taking also into account the elasticity of some of these computing engines.
The applications in the **self-ending** category, instead, need to be scheduled according to policies that give more or less priority to different jobs, taking also into account the elasticity of some of these computing engines.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment