Commit e8c6c2c2 authored by Daniele Venzano's avatar Daniele Venzano

Add more content to the vision page

parent 9b0e81be
......@@ -3,7 +3,7 @@
Vision for Zoe
==============
Zoe focus is data analytics. This focus helps defining a clear set of objectives and priorities for the project and avoid the risk of competing directly with generic infrastructure managers like Kubernetes or Swarm. Zoe instead sits on top of these "cloud operating systems" to provide a simpler interface to end users who have no interest in the intricacies of container infrastructures.
Zoe focus is data analytics. This focus helps defining a clear set of objectives and priorities for the project and avoid the risk of competing directly with generic infrastructure managers like Kubernetes or Swarm. Zoe instead sits on top of these "cloud managers" to provide a simpler interface to end users who have no interest in the intricacies of container infrastructures.
Data analytic applications do not work in isolation. They need data, that may be stored or streamed, and they generate logs, that may have to be analyzed for debugging or stored for auditing. Data layers, in turn, need health monitoring. All these tools, frameworks, distributed filesystems, object stores, form a galaxy that revolves around the analytic applications. For simplicity we will call these "support applications".
......@@ -20,14 +20,14 @@ Deviation from the current ZApp terminology
In the current Zoe implementation (0.10.x) ZApps are self-contained descriptions of a set of cooperating processes. They get submitted once, to request the start-up of an execution. This fits well the model of a single spark job or of a throw-away jupyter notebook.
We need to revise a bit this terminology: ZApps remain the top level, user-visible entity. The ZApp building blocks, analytic engines or support tools, are called frameworks. A framework, by itself, cannot be run. It lacks configuration or a binary to execute, for example. Each framework is composed of one or more processes.
We need to revise a bit this terminology: ZApps remain the top level, user-visible entity. The ZApp building blocks, analytic engines or support tools, are called frameworks. A framework, by itself, cannot be run. It lacks configuration or a user-provided binary to execute, for example. Each framework is composed of one or more processes.
A few examples:
- A jupyter notebook, by itself, is a framework in Zoe terminology. It lacks configuration that tells it which kernels to enable or on which port to listen to. It is a framework composed by just one process.
- A Spark cluster is another framework. By itself it does nothing. It can be connected to a notebook, or it can be given a jar file and some data to process.
To create a ZApp you need to put together one or more frameworks and add some configuration (framework-dependent) that tells the framework(s) how to behave.
To create a ZApp you need to put together one or more frameworks and add some configuration (framework-dependent) that tells them how to behave.
A ZApp shop could contain both frameworks (that user must combine together) and full-featured ZApps, whenever it is possible.
......@@ -38,62 +38,33 @@ Kinds of applications
See :ref:`zapp_classification`.
The concept of application in Zoe is very difficult to define, as it is very fluid and encloses a lot of different tools, frameworks, interfaces, etc.
Architecture
------------
The focus on analytic applications helps in giving some concrete examples and use cases.
.. image:: /figures/extended_arch.png
There two main categories of use-cases:
This architecture extends the current one by adding a number of pluggable modules that implement additional (and optional) features. The additional modules, in principle, are created as more Zoe frameworks, that will be instantiated in ZApps and run through the scheduler together with all the other ZApps.
- long-running executions
- data processing workflows
The core of Zoe remains very simple to understand, while opening the door to more capabilities. The additional modules will have the ability of communicating with the backend, submit, modify and terminate executions via the scheduler and report information back to the user. Their actions will be driven by additional fields written in the ZApps descriptions.
Long-running executions
^^^^^^^^^^^^^^^^^^^^^^^
In the figure there are three examples of such modules:
In this category we have:
- Zoe-HA: monitors and performs tasks related to high availability, both for Zoe components and for running user executions. This modules could take care of maintaining a certain replication factor, or making sure a certain service is restarted in case of failures, updating a load balancer or a DNS entry
- Zoe rolling upgrades: it can help the system administrator perform rolling upgrades of long-running ZApps
- Workflows (see the section below)
- interactive applications started by users (a Jupyter Notebook for example)
- support applications started by admins
Other examples could be:
- Zoe-storage: for managing volumes and associated constraints
- data layers
- monitoring tools
These applications are static in nature. Once deployed they need to be maintained for an indefinite amount of time. A data layer can be expanded with new nodes, or a monitoring pipeline can be scaled up or down, but these are events initiated manually by admins or performed automatically following administrative policies.
Interactive applications (usually web interfaces) can be stand-alone data analysis tools or can be connected to distributed data intensive frameworks. As a matter of fact, a user may start working on a stand-alone interface and then connect the same interface to bigger and bigger clusters to test his algorithm with more and more data.
The modules should try to re-use as much as possible the functionality already available in the backends. A simple Zoe installation could run on a single Docker engine available locally and provide a reduced set of features, while a full-fledged install could run on top of Kubernetes and provide all the large-scale deployment features that such platform already implements.
Workflows
^^^^^^^^^
A few examples of workflows:
- run a single job (simplest kind of workflow)
- run a job every hour (triggered by time)
- run a set of jobs in series or in parallel (triggered by the state of other jobs)
- run a job whenever the size of a file/directory/bucket on a certain data layer reaches 5GB (more complex trigger)
- run a job whenever the size of a file/directory/bucket on a certain data layer reaches 5GB (more complex triggers)
- combinations of all the above
A complete workflow system for data analytic is very complex and is a whole different project that runs on top of Zoe core functionality. Zoe-workflow should be implemented incrementally, starting with the basics. When it reaches a certain complexity, then it should be spun-out in its own project.
At the beginning, workflows can be made up of self-ending applications only. Integrating streaming applications should be done later on.
Commands to the system
----------------------
Zoe manages requests to change the state of a set of resources (a virtual or physical cluster of machines) by starting, terminating or modifying process containers.
Users should be kept as much as possible ignorant of the inner workings of these state changes and should be able to express high level commands, like:
- start this application (-> creates one or more executions)
- terminate this execution(s)
- Attach to this Jupyter notebook this new Spark cluster
- Define a workflow (see the workflow section)
These kind of commands should be translated automatically into Zoe state changes that are then applies by the components at the lower levels of the Zoe architecture.
In addition to the commands above, admins should also be able to define operations on long-running executions:
- request rolling or standard upgrades (find all containers using a certain image v. 1 and upgrade them to version 2)
- start and scale long-running applications
- define non-ephemeral storage volumes for data layer applications
- terminate (should be well protected, may cause data losses)
A complete workflow system for data analytic is very complex and is a whole different project that runs on top of Zoe core functionality. Zoe-workflow should be implemented incrementally, starting with the basics. There is a lot of theory and by itself it is project of the same size of Zoe itself. An example of a full-featured workflow system is http://oozie.apache.org/
......@@ -5,6 +5,8 @@ Classification
Zoe runs processes inside containers and the Zoe application description is very generic, allowing any kind of application to be described in Zoe and submitted for execution. While the main focus of Zoe are so-called "analytic applications", there are many other tools that can be run on the same cluster, for monitoring, storage, log management, history servers, etc. These applications can be described in Zoe and executed, but they have quite different scheduling constraint.
Please note that in this context an "elastic" service is a service that "can be automatically resized". HDFS can be resized, but it is done as an administrative operation that requires setting up partitions and managing the network and disk traffic.
- Long running: potentially will never terminate
- Non elastic
......@@ -39,7 +41,7 @@ Zoe runs processes inside containers and the Zoe application description is very
- Flink streaming
- Kafka
- Self-ending/Ephemeral: will eventually finish by themselves
- Ephemeral: will eventually finish by themselves
- Elastic:
......@@ -56,4 +58,4 @@ All the applications in the **long-running** category need to be deployed, manag
The **elastic, long-running** applications have a degree more of flexibility, that can be taken into account by Zoe. The have all the same needs as the non-elastic applications, but they can also be scaled according to many criteria (priority, latency, data volume).
The applications in the **self-ending** category, instead, need to be scheduled according to policies that give more or less priority to different jobs, taking also into account the elasticity of some of these computing engines.
The applications in the **ephemeral** category, instead, will eventually terminate by themselves: a batch job is a good example of such applications.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment