Commit d559961c authored by Daniele Venzano's avatar Daniele Venzano

Documentation update

parent 8d429a4c
......@@ -6,24 +6,25 @@ Architecture
The main Zoe Components are:
* zoe master: the core component that performs application scheduling and talks to Swarm
* zoe observer: listens to events from Swarm and looks for idle resources to free automatically
* zoe api: the Zoe frontend, offering a web interface and a REST API
* zoe: command-line client
The command line client is the main user-facing component of Zoe, while the master and the observer are the back ends.
The Zoe master is the core component of Zoe and communicates with the clients by using an internal ZeroMQ-based protocol. This protocol is designed to be robust, using the best practices from ZeroMQ documentation. A crash of the Api or of the Master process will not leave the other component inoperable, when the faulted process restarts, work will restart where it was left.
The Zoe master is the core component of Zoe and communicates with the clients by using a REST API. It manages users, applications and executions.
Users submit *application descriptions* for execution. Inside the Master, a scheduler keeps track of available resources and execution requests, and applies a
In this architecture we moved all the state bookkeeping out to Postgres database. With Zoe we try very hard not to reinvent the wheel and the internal state system we had in the previous architecture iteration was starting to show its limits.
Users submit *execution requests*, composed by a name and an *application description*. The frontend process (Zoe web) informs the Zoe Master that a new execution request is available for execution.
Inside the Master, a scheduler keeps track of available resources and execution requests, and applies a
scheduling policy to decide which requests should be satisfied as soon as possible and which ones can be deferred for later.
The master also talks to Docker Swarm to create and destroy containers and to read monitoring information used to schedule applications.
Application descriptions
------------------------
Application descriptions are at the core of Zoe. They are likely to evolve in time to include more information that needs to be passed to the scheduler.
Currently they are composed of a set of generic attributes that apply to the whole Zoe Application and a list of Zoe Frameworks. Each Framework is composed by Zoe Services, that describe actual Docker containers. The composition of Frameworks and Services is described by a dependency tree.
Application descriptions are at the core of Zoe. They are likely to evolve in time, to satisfy the needs of new distributed analytic engines. The current version is built around several use cases involving MPI, Spark and Jupyter notebooks.
These descriptions are strictly linked to the docker images used in the process descriptions, as they specify environment variables and commands to be executed. We successfully used third party images, demonstrating the generality of Zoe's approach.
Application descriptions are composed of a set of generic attributes that apply to the whole Zoe Application (abbreviated in ZApp) and a list of Zoe Frameworks. Each Framework is composed by Zoe Services, that describe actual Docker containers. The composition of Frameworks and Services is described by a dependency tree.
Please note that this documentation refers to the full Zoe Application description that is not yet fully implemented in actual code.
The Zoe Service descriptions are strictly linked to the Docker images they use, as they specify environment variables and commands to be executed. We successfully used third party images, demonstrating the generality of Zoe's approach, but in general prefer to build our own.
You can have a look to example applications by going to the `zoe-applications <https://github.com/DistributedSystemsGroup/zoe-applications>`_ repository.
......@@ -10,36 +10,45 @@ Also the directive ``--write-config <filename>`` is available: it will generate
Zoe config files have a simple format of ``<option name> = <value>``. Dash characters can be use for comments.
Each Zoe component has its own configuration directives, as described in the following sections:
Please note that Zoe uses the database to retain shared state, so the master and the api processes must be configured with the same database.
zoe-master.conf
---------------
* ``debug = <true|false>`` : enable or disable debug log output
* ``swarm = zk://zk1:2181,zk2:2181,zk3:2181`` : connection string to the Swarm API endpoint. Can be expressed by a plain http URL or as a zookeeper node list in case Swarm is configured for HA.
* ``state-dir = /var/lib/zoe`` : Directory where all state and other binaries (execution logs) are saved.
* ``zoeadmin-password = changeme`` : Password for the zoeadmin user
* ``api-listen-uri = tcp://*:4850`` : ZeroMQ server connection string, used for the master listening endpoint
* ``deployment-name = devel`` : name of this Zoe deployment. Can be used to have multiple Zoe deployments using the same Swarm (devel and prod, for example)
* ``listen-address`` : address Zoe will use to listen for incoming connections to the REST API
* ``listen-port`` : port Zoe will use to listen for incoming connections to the REST API
* ``influxdb-dbname = zoe`` : Name of the InfluxDB database to use for storing metrics
* ``influxdb-url = http://localhost:8086`` : URL of the InfluxDB service (ex. )
* ``influxdb-enable = False`` : Enable metric output toward influxDB
* ``passlib-rounds = 60000`` : Number of hashing rounds for passwords, has a sever performance impact on each API call
* ``gelf-address = udp://1.2.3.4:1234`` : Enable Docker GELF log output to this destination
* ``workspace-base-path = /mnt/zoe-workspaces`` : Base directory where user workspaces will be created. This directory should reside on a shared filesystem visible by all Docker hosts.
* ``guest-gateway-image-name`` : Docker image for guests gateway container (ex.: zoerepo/guest-gateway). The default image contains an ssh-based SOCKS proxy.
* ``user-gateway-image-name`` : Docker image for users gateway container (ex.: zoerepo/guest-gateway). The default image contains an ssh-based SOCKS proxy.
* ``overlay-network-name = zoe`` : name of the pre-configured Docker overlay network Zoe should use
zoe-observer.conf
-----------------
* ``debug = <true|false>`` : enable or disable debug log output
* ``swarm = zk://zk1:2181,zk2:2181,zk3:2181`` : connection string to the Swarm API endpoint. Can be expressed by a plain http URL or as a zookeeper node list in case Swarm is configured for HA.
* ``zoeadmin-password = changeme`` : Password for the zoeadmin user
* ``master-url = http://<address:port>`` : address of the Zoe Master REST API
* ``spark-activity-timeout = <seconds>`` : number of seconds to wait before an inactive Spark cluster is automatically terminated, this is done only for guest users
* ``loop-time = 300`` : time in seconds between successive checks for idle applications that can be automatically terminated
* ``dbname = zoe`` : DB name
* ``dbuser = zoe` : DB user
* ``dbpass = zoe`` : DB password
* ``dbhost = localhost`` : DB hostname
* ``dbport = 5432`` : DB port
zoe-web.conf
zoe-api.conf
------------
* ``debug = <true|false>`` : enable or disable debug log output
* ``listen-address`` : address Zoe will use to listen for incoming connections to the web interface
* ``listen-port`` : port Zoe will use to listen for incoming connections to the web interface
* ``master-url = http://<address:port>`` : address of the Zoe Master REST API
* ``master-url = tcp://127.0.0.1:4850`` : address of the Zoe Master ZeroMQ API
* ``deployment-name = devel`` : name of this Zoe deployment. Can be used to have multiple Zoe deployments using the same Swarm (devel and prod, for example)
* ``ldap-server-uri = ldap://localhost`` : LDAP server to use for user authentication
* ``ldap-base-dn = ou=something,dc=any,dc=local`` : LDAP base DN for users
* ``ldap-bind-user = cn=guest,ou=something,dc=any,dc=local`` : LDAP user to bind as for user lookup
* ``ldap-bind-password = notsosecret`` : LDAP bind user password
* ``ldap-admin-gid = 5000`` : LDAP group ID for admins
* ``ldap-user-gid = 5001`` : LDAP group ID for users
* ``ldap-guest-gid = 5002`` : LDAP group ID for guests
* ``dbname = zoe`` : DB name
* ``dbuser = zoe` : DB user
* ``dbpass = zoe`` : DB password
* ``dbhost = localhost`` : DB hostname
* ``dbport = 5432`` : DB port
The zoe_lib package
===================
.. module:: zoe_lib
Users
-----
.. automodule:: zoe_lib.users
:members:
Applications
------------
.. automodule:: zoe_lib.applications
:members:
Executions
----------
.. automodule:: zoe_lib.executions
:members:
Services
--------
.. automodule:: zoe_lib.services
:members:
Exceptions
----------
.. automodule:: zoe_lib.exceptions
:members:
Queries
-------
.. automodule:: zoe_lib.query
:members:
General design decisions
========================
Zoe uses an internal state class hierarchy, with a checkpoint system for persistence and debug. This has been done because:
* Zoe state is small
* Relations between classes are simple
* SQL adds a big requirement, with problems for redundancy and high availability
* Checkpoints can be reverted to and examined for debug
In this architecture we overturned our previous decision of keeping state internal, with periodic checkpointing.
State is kept in Postgres and shared among the different Zoe components. For a distributed system an external database simplifies enormously many common situation, with transactions and strong guarantees of consistency.
For now checkpoints are created each time the state changes.
User management is left out of Zoe as much as possible. User authentication backends provide just a minimum of information for Zoe: a user ID and a role. Zoe does not manage creation, deletion, passwords, etc.
Authentication: HTTP basic auth is used, as it is the simplest reliable mechanism we could think of. It can be easily secured by adding SSL. Server-side ``passlib`` guarantees a reasonably safe storage of salted password hashes.
There advantages and disadvantages to this choice, but for now we wnat to concentrate on different, more core-related features of Zoe.
Synchronous API. The Zoe Scheduler is not multi-thread, all requests to the API are served immediately. Again, this is done to keep the system simple and is by no means a decision set in stone.
Zoe is distributed and uses threads to keep the APIs responsive at all times.
Object naming
-------------
Every object in Zoe has a unique name. Zoe uses a notation with a hierarchical structure, left to right, from specific to generic, like the DNS system.
These names are used throughout the API.
A service (one service corresponds to one Docker container) is identified by this name:
<service_name>-<execution_name>-<owner>-<deployment_name>
An execution is identified by:
<execution_name>-<owner>-<deployment_name>
A user is:
<owner>-<deployment_name>
And a Zoe instance is:
<deployment_name>
Where:
* service name: the name of the service as written in the application description
* execution name: name of the execution as passed to the start API call
* owner: user name of the owner of the execution
* deployment name: configured deployment for this Zoe instance
Docker hostnames
^^^^^^^^^^^^^^^^
The names described above are used to generate the names and host names in Docker. User networks are also named in the same way. This, among other things, has advantages when using Swarm commands, because it is easy to distinguish Zoe containers, and for monitoring solutions that take data directly from Swarm, preserving all labels and container names. With Telegraf, InfluxDB and Grafana it is possible to build Zoe dashboards that show resource utilization per-user or per-execution.
Database IDs are used to identify executions and services. Container names within Docker Swarm must be unique, we decided to produce names that give some information to the administrator who looks at the output of ``docker ps`` instead of using opaque UUIDs. In addition, these same names are exposed by standard monitoring tools.
......@@ -3,6 +3,3 @@ Scheduler classes
.. autoclass:: zoe_master.scheduler.ZoeScheduler
:members:
.. autoclass:: zoe_master.scheduler_policies.base.BaseSchedulerPolicy
:members:
......@@ -5,20 +5,15 @@ Zoe uses `Docker Swarm <https://docs.docker.com/swarm/>`_ to run Analytics as a
Zoe is fast: it can create a fully-functional Spark cluster of 20 nodes in less than five seconds.
Zoe is easy to use: just a few clicks on a web interface is all that is needed to configure and start a variety of data-intensive applications.
Zoe is easy to use: just a few clicks on a web interface is all that is needed to configure and start a variety of data-intensive applications. Applications are flexible compositions of Frameworks: for example Jupyter and Spark can be composed to form a Zoe application (a ZApp!).
Zoe is open: applications can be described by a JSON file, anything that can run in a Docker container can be run within Zoe (but we concentrate on
data intensive software)
Zoe is open: applications can be described by a JSON file, anything that can run in a Docker container can be run within Zoe (but we concentrate on data intensive applications).
Zoe is smart: not everyone has infinite resources like Amazon or Google, Zoe is built for small clouds, physical or virtual, and is built to maximize
the use of available capacity.
Zoe is smart: not everyone has infinite resources like Amazon or Google, Zoe is built for small clouds, physical or virtual, and is built to maximize the use of available capacity.
Zoe can use a Docker Swarm located anywhere, on Amazon or in your own private cloud, and does not need exclusive access to it, meaning
your Swarm could also be running other services: Zoe will not interfere with them. Zoe is meant as a private service, adding data-analytics
capabilities to existing, or new, Docker clusters.
Zoe can use a Docker Swarm located anywhere, on Amazon or in your own private cloud, and does not need exclusive access to it, meaning your Swarm could also be running other services: Zoe will not interfere with them. Zoe is meant as a private service, adding data-analytics capabilities to existing, or new, Docker clusters.
The core components of Zoe are application-independent and a user can submit application description for any kind of service combination. Since Zoe targets
analytics services in particular, the client tools offer some pre-configured Zoe applications that can be used as starting examples.
The core components of Zoe are application-independent and users are free to create and execute application descriptions for any kind of service combination. Zoe targets analytics services in particular: we offer a number of tested sample ZApps and Frameworks that can be used as starting examples.
To better understand what we mean by "analytic service", here are a few examples:
......@@ -31,8 +26,7 @@ To better understand what we mean by "analytic service", here are a few examples
A number of predefined applications for testing and customization can be found at the `zoe-applications <https://github.com/DistributedSystemsGroup/zoe-applications>`_ repository.
Have a look at the :ref:`vision` and at the `roadmap <https://github.com/DistributedSystemsGroup/zoe/blob/master/ROADMAP.rst>`_ to see what we are currently
planning and feel free to `contact us <venza@brownhat.org>`_ via email or through the GitHub issue tracker to pose questions or suggest ideas and new features.
Have a look at the :ref:`vision` and at the :ref:`roadmap` to see what we are currently planning and feel free to `contact us <daniele.venzano@eurecom.fr>`_ via email or through the GitHub issue tracker to pose questions or suggest ideas and new features.
Contents:
......@@ -45,6 +39,7 @@ Contents:
monitoring
architecture
vision
roadmap
contributing
......@@ -53,8 +48,8 @@ A note on terminology
We are spending a lot of effort to use consistent naming throughout the documentation, the software, the website and all the other resources associated with Zoe. Check the :ref:`architecture` document for the details, but here is a quick reference:
* Zoe Components: the Zoe processes, the Master, the Observer, the web interface, etc.
* Zoe Applications: a composition of Zoe Frameworks, is the highest-level entry in application descriptions that the use submits to Zoe
* Zoe Components: the Zoe processes, the Master, the API and the service monitor
* Zoe Applications: a composition of Zoe Frameworks, is the highest-level entry in application descriptions that the use submits to Zoe, can be abbreviated in ZApp(s).
* Zoe Frameworks: a composition of Zoe Services, is used to describe re-usable pieces of Zoe Applications, like a Spark cluster
* Zoe Services: one to one with a Docker container, describes a single service/process tree running in an isolated container
......
......@@ -4,32 +4,33 @@ Installing Zoe
Zoe components:
* Master
* Observer
* logger (optional, see https://github.com/DistributedSystemsGroup/zoe-logger)
* web client
* API
* command-line client
* Service monitor (not yet implemented)
Zoe is written in Python and uses the ``requirements.txt`` file to list the package dependencies needed for all components of Zoe. Not all of them are needed in all cases, for example you need the ``kazoo`` library only if you use Zookeeper to manage Swarm high availability.
Zoe is a young software project and we foresee it being used in places with wildly different requirements in terms of IT organization (what is below Zoe) and user interaction (what is above Zoe). For this reason we are aiming at providing a solid core of features and a number of basic external components that can be easily customized. For example, the Spark idle monitoring feature is useful only in certain environments and it is implemented as an external service, that can be customized of takes as an example to build something different.
Zoe is a young software project and we foresee it being used in places with wildly different requirements in terms of IT organization (what is below Zoe) and user interaction (what is above Zoe). For this reason we are aiming at providing a solid core of features and a number of basic external components that can be easily customized. For example, user management is delegated as much as possible to external services. For now we support LDAP, but other authentication methods can be easily implemented.
There is an experimental configuration file for Docker Compose, if you want to try it. It will run Zoe and its components inside Docker containers. It needs to be customized with the address of your Swarm master, the port mappings and the location of a shared filesystem.
Overview
--------
The applications run by Zoe, usually, expose a number of interfaces (web, rest and others) to the user. Docker Swarm does not provide an easy way to manage this situation, the prt can be statically allocated, by the public IP address is chosen arbitrarily by Swarm and there is no discovery mechanism (DNS) exposed to the outside of Swarm.
ZApps, usually, expose a number of interfaces (web, REST and others) to the user. Docker Swarm does not provide an easy way to manage this situation: the port can be statically allocated, but the IP address is chosen arbitrarily by Swarm and there is no discovery mechanism (DNS) exposed to the outside of Swarm.
To work around this problem Zoe creates a gateway container for each user. The image used for this gateway container is configurable. The default one, downloaded from the Docker hub contains an ssh-based SOCKS proxy that the user must configure in his/her browser to be able to access the services run by Zoe executions.
There is no good, automated, way to solve this problem. Each Swarm deployment is different: some use public addresses, other use overlay networks.
In out deployment we use a standard Swarm configuration, with private and closed overlay networks. We create one network for use by Zoe and spawn two containers attached to it: one is a SOCKS proxy and the other is an SSH gateway. Thanks to LDAP users can use the SSH gateway to create tunnels and copy files from/to their workspace.
This gateway containers are maintained outside of Zoe, at this Github repository: https://github.com/DistributedSystemsGroup/gateway-containers
Zoe requires a shared filesystem, visible from all Docker hosts. Some Zoe Applications (for example spark submit, MPI) require user-provided binaries to run. Zoe creates and maintains for each user a workspace directory on this shared filesystem. The user can access the directory from outside Zoe and put the files required for his/her application. We are evaluating whether to integrate into the Zoe web client some kind of web interface for accessing the workspace directory.
Zoe requires a shared filesystem, visible from all Docker hosts. Each user has a workspace directory visible from all its running ZApps. The workspace is used to save Jupyter notebooks, copy data from/to HDFS, provide binaries to MPI and Spark applications.
Requirements
------------
Zoe is written in Python 3. Development happens on Python 3.4, but we test also for Python 3.5 on Travis-CI.
To run Zoe you need Docker Swarm and a shared filesystem, mounted on all hosts part of the Swarm. Internally we use CEPH-FS, but NFS is also a valid solution.
* Python 3. Development happens on Python 3.4, but we test also for Python 3.5 on Travis-CI.
* Docker Swarm (we have not yet tested the new distributed swarm-in-docker available in Docker 1.12)
* A shared filesystem, mounted on all hosts part of the Swarm. Internally we use CEPH-FS, but NFS is also a valid solution.
Optional:
......@@ -44,37 +45,31 @@ Install Docker and the Swarm container:
* https://docs.docker.com/installation/ubuntulinux/
* https://docs.docker.com/swarm/install-manual/
For testing you can use a Swarm with a single Docker instance located on the same host/VM.
Network configuration
^^^^^^^^^^^^^^^^^^^^^
Docker 1.9/Swarm 1.0 multi-host networking is used in Zoe:
Docker 1.9/Swarm 1.0 multi-host networking can be used in Zoe:
* https://docs.docker.com/engine/userguide/networking/get-started-overlay/
This means that you will also need a key-value store supported by Docker. We use Zookeeper, it is available in Debian and Ubuntu without the need for external package
repositories and is very easy to set up.
This means that you will also need a key-value store supported by Docker. We use Zookeeper, it is available in Debian and Ubuntu without the need for external package repositories and is very easy to set up.
Images: Docker Hub Vs local Docker registry
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The Zoe master will run a gateway container for each user. The image for this container is available on the `Docker Hub <https://hub.docker.com/r/zoerepo/>`_ and is generated from the Dockerfile in the ``gateway-image`` directory of the main Zoe repository.
Since the Docker Hub can be slow, we strongly suggest setting up a private registry, containing also the Zoe Service images. Have a look at the `zoe-applications <https://github.com/DistributedSystemsGroup/zoe-applications>`_ repository for examples of Zoe Applications and Services.
A few sample ZApps have their images available on the Docker Hub. We strongly suggest setting up a private registry, containing your customized Zoe Service images. Have a look at the `zoe-applications <https://github.com/DistributedSystemsGroup/zoe-applications>`_ repository for examples of Zoe Applications and Services that can be customized, built and loaded on the Hub or on a local registry.
Zoe
---
Currently this is the recommended procedure:
Currently this is the recommended procedure, once the initial Swarm setup has been done:
1. Clone the zoe repository
2. Install Python package dependencies: ``pip3 install -r requirements.txt``
3. Create new configuration files for the master and the observer (:ref:`config_file`)
4. Build the gateway container image from the sources in the ``gateway-image`` directory and push it to your internal registry (OPTIONAL)
5. Setup supervisor to manage Zoe processes: in the ``scripts/supervisor/`` directory you can find the configuration file for
3. Create new configuration files for the master and the api processes (:ref:`config_file`)
4. Setup supervisor to manage Zoe processes: in the ``scripts/supervisor/`` directory you can find the configuration file for
supervisor. You need to modify the paths to point to where you cloned Zoe and the user (Zoe does not need special privileges).
6. Start running applications using the command-line client! (the web interface will be coming soon)
6. Start running ZApps!
Docker compose - demo install
-----------------------------
......
Container logs
==============
By default Zoe does not involve itself with the output from container processes. The logs can be retrieved with the usual Docker command ``docker logs`` while a container is alive and then they are lost forever when the container is deleted. This solution however does not scale very well: users need to used docker commandline tools and when containers produce lots of output Docker will fill-up the disk of whatever Docker host the container is running in.
By default Zoe does not involve itself with the output from container processes. The logs can be retrieved with the usual Docker command ``docker logs`` while a container is alive and then they are lost forever when the container is deleted. This solution however does not scale very well: users need to have access to the docker commandline tools and when containers produce lots of output Docker will fill-up the disk of whatever Docker host the container is running in.
Using the ``gelf-address`` option of the Zoe Master process, Zoe can configure Docker to send the container outputs to an external destination in GELF format. GELF is the richest format supported by Docker and can be ingested by a number of tools such as Graylog and Logstash. When that option is set all containers created by Zoe will send their output (standard output and standard error) to the destination specified.
Docker is instructed to add all Zoe-defined tags to the GELF messages, so that they can be aggregated by Zoe execution, Zoe user, etc.
Optional Kafka support
----------------------
Zoe also provides a Zoe Logger process, in case you prefer to use Kafka in your log pipeline. Each container output will be sent to its own topic, that Kafka will retain for seven days by default. With Kafka you can also monitor the container output in real-time, for example to debug your container images running in Zoe. In this case GELF is converted to syslog-like format for easier handling.
The logger process is very small and simple, you can modify it to suit your needs and convert logs in any format to any destination you prefer. It lives in its own repository, here: https://github.com/DistributedSystemsGroup/zoe-logger
......
......@@ -18,6 +18,4 @@ service_time
Time in milliseconds taken to service a request. The tags associated with the request will add more details:
* action: get, post, delete, ...
* object: user, execution, service, ...
* user: user name of the authenticated user that performed the request
* user_id: user identifier of the authenticated user that performed the request
.. _roadmap:
Roadmap
=======
We, the main developers of Zoe, are an academic research team. As such we have limited resources and through collaborations with other universities and private companies our aim is to do research and advance the state of the art. Our roadmap reflects this and pushes more on large-scale topics than on specific features.
The first priority for Zoe is to mature a stable and modular architecture on which advanced features can be built. Most of the work that is going into version 0.10.x is related to this point.
Scheduler architectures and resource allocation
-----------------------------------------------
In parallel to classic, stable and well known schedulers (FIFO), we plan to design and implement within Zoe novel approaches to application scheduling and resource allocation. This includes:
* Optimistic, pessimistic, distributed, centralized schedulers
* Distributed or centralized schedulers
Scheduling policies
-------------------
While the FIFO policy is fine for many settings, is it not the most efficient way of managing work that can be done concurrently. Many decades of scheduling literature point in all sorts of directions, some of which can find new applications in analytic systems:
* Appropriate management of batch Vs interactive Vs streaming analytic applications
* Deadline scheduling for streaming frameworks
* Size-based scheduling better utilization and smaller response times
Dynamic resource allocation
---------------------------
Users are usually bad guessers on how many resources a particular application will need. We all have a tendency of overestimating resource reservations to make sure there is some headroom for unplanned spikes. This overestimation causes low utilization and non-efficient resource usage: with better reservation and allocation mechanisms that can adapt at runtime, more work could be done with the same resources.
* Resize dynamically running applications in terms of number of services
* Resize dynamically running applications in terms of memory and cores allocated for each service
Fault tolerance
---------------
Any modern system must be able to cope with faults and failures of any kind. Zoe is currently built around state of the art mechanisms for fault tolerance, but this does not stop us from further investigating fault tolerance mechanisms both for Zoe itself and for the applications it runs.
......@@ -3,12 +3,11 @@
The motivation behind Zoe
=========================
The fundamental idea of Zoe is that a user who wants run data analytics applications should not be bothered by systems details, such as how to configure the amount
of RAM a Spark Executor should use, how many cores are available in the system or even how many worker nodes should be used to meet an execution deadline.
The fundamental idea of Zoe is that a user who wants run data analytics applications should not be bothered by systems details, such as how to configure the amount of RAM a Spark Executor should use, how many cores are available in the system or even how many worker nodes should be used to meet an execution deadline.
Moreover we feel that there is a lack of solutions in the field of private clouds, where resources are not infinite and data layers (data-sets) may be shared between
different users. All the current Open Source solutions we are aware of target the public cloud use case and try, more or less, to mimic what Amazon and other big
names are doing in their data-centers.
Moreover final users require a lot of flexibility, they want ot test new analytics systems and algorithms as soon as possible, without having to wait for some approval procedure to go through the IT department. Zoe proposes a flexible model for applications descriptions: their management can be left entirely to the final user, but they can also prepared very quickly in all or in part by an IT department, who sometimes is more knowledgeable of resource limits and environment variables. We also plan to offer a number of building blocks (Zoe Frameworks) that can be composed to make Zoe Applications.
Finally we feel that there is a lack of solutions in the field of private clouds, where resources are not infinite and data layers (data-sets) may be shared between different users. All the current Open Source solutions we are aware of target the public cloud use case and try, more or less, to mimic what Amazon and other big names are doing in their data-centers.
Zoe strives to satisfy the following requirements:
......@@ -17,37 +16,30 @@ Zoe strives to satisfy the following requirements:
* short (a few seconds) reaction times to user requests or other system events
* smart queuing and scheduling of applications when resources are critical
Kubernetes, OpenStack Sahara, Mesos and YARN are the projects that, each in its own way, try to solve at least part of our needs.
Kubernetes, OpenStack Sahara, Mesos and YARN are the projects that, each in its own way, come near Zoe, without solving the needs we have.
Kubernetes (Borg)
-----------------
Kubernetes is a very complex system, both to deploy and to use. It takes some of the architectural principles from Google Borg and targets datacenters with vast amounts of resources. We feel that while Kubernetes can certainly run analytic services in containers, it does so at a very high complexity cost for smaller setups. Moreover, in our opinion, certain scheduler choices in how preemption is managed do not apply well to environments with a limited set of users and compute resources, causing a less than optimal resource usage.
Kubernetes is a very complex system, both to deploy and to use. It takes some of the architectural principles from Google Borg and targets data centers with vast amounts of resources. We feel that while Kubernetes can certainly run analytic services in containers, it does so at a very high complexity cost for smaller setups. Moreover, in our opinion, certain scheduler choices in how preemption is managed do not apply well to environments with a limited set of users and compute resources, causing a less than optimal resource usage.
OpenStack Sahara
----------------
We know well `OpenStack Sahara <https://wiki.openstack.org/wiki/Sahara>`_, as we wrote the Spark plugin and contributed it to that project. We
feel that Sahara has limitations in:
We know well `OpenStack Sahara <https://wiki.openstack.org/wiki/Sahara>`_, as we have been using it since 2013 and we contributed the Spark plugin. We feel that Sahara has limitations in:
* software support: Sahara plugins support a limited set of data-intensive frameworks, adding a new one means writing a new Sahara plugin and even adding support
for a new version requires going through a one-two week (on average) review process.
* lack of scheduling: Sahara makes the assumption that you have infinite resources. When you try to launch a new cluster and there are not enough resources available,
the request fails and the user is left doing application and resources scheduling by hand.
* software support: Sahara plugins support a limited set of data-intensive frameworks, adding a new one means writing a new Sahara plugin and even adding support for a new version requires going through a one-two week (on average) review process.
* lack of scheduling: Sahara makes the assumption that you have infinite resources. When you try to launch a new cluster and there are not enough resources available, the request fails and the user is left doing application and resources scheduling by hand.
* usability: setting up everything that is needed to run an EDP job is cumbersome and error-prone. The user has to provide too many details in too many different places.
Moreover changes to Sahara needs to go through a lengthy review process, that on one side tries to ensure high quality, but on the other side slows down development,
especially of major architectural changes, like the ones needed to address the concerns listed above.
Moreover changes to Sahara needs to go through a lengthy review process, that on one side tries to ensure high quality, but on the other side slows down development, especially of major architectural changes, like the ones needed to address the concerns listed above.
Mesos
-----
Mesos is marketing itself as a data-center operating system. Zoe has no such high profile objective: while Zoe schedules distributed applications, it has
not knowledge of the applications it is scheduling and, even more importantly, does not require any change in the applications themselves to be run in Zoe.
Mesos is marketing itself as a data-center operating system. Zoe has no such high profile objective: while Zoe schedules distributed applications, it has no knowledge of the applications it is scheduling and, even more importantly, does not require any change in the applications themselves to be run in Zoe.
Mesos requires that each application provides two Mesos-specific components: a scheduler and an executor. Zoe has not such requirements.
Mesos requires that each application provides two Mesos-specific components: a scheduler and an executor. Zoe has no such requirements and runs applications unmodified.
YARN
----
YARN, from our point of view, has many similarities with Mesos. It requires application support. Moreover it is integrated in the Hadoop distribution and,
while recent efforts are pushing toward making YARN stand up on its own, it is currently tailored for Hadoop applications. Finally YARN does not use Docker
containers.
YARN, from our point of view, has many similarities with Mesos. It requires application support. Moreover it is integrated in the Hadoop distribution and, while recent efforts are pushing toward making YARN stand up on its own, it is currently tailored for Hadoop applications.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment