Commit c58db1f4 authored by Daniele Venzano's avatar Daniele Venzano

Documentation update

parent 11aa3bfe
.. _architecture:
Architecture
============
......@@ -5,10 +7,9 @@ The main Zoe Components are:
* zoe master: the core component that performs application scheduling and talks to Swarm
* zoe observer: listens to events from Swarm and looks for idle resources to free automatically
* zoe-web: the web client interface
* zoe: command-line client
The command line client and the web interface are the user-facing components of Zoe, while the master and the observer are the back ends.
The command line client is the main user-facing component of Zoe, while the master and the observer are the back ends.
The Zoe master is the core component of Zoe and communicates with the clients by using a REST API. It manages users, applications and executions.
Users submit *application descriptions* for execution. Inside the Master, a scheduler keeps track of available resources and execution requests, and applies a
......@@ -26,3 +27,4 @@ These descriptions are strictly linked to the docker images used in the process
Please note that this documentation refers to the full Aoe Application description that is not yet fully implemented in actual code.
You can use the ``zoe.py pre-app-list`` and ``zoe.py pre-app-export`` commands to export a JSON-formatted application description to use as a template.
......@@ -50,7 +50,7 @@ master_doc = 'index'
# General information about the project.
project = 'Zoe'
copyright = '2015, Daniele Venzano'
copyright = '2016, Daniele Venzano'
author = 'Daniele Venzano'
# The version info for the project you're documenting, acts as replacement for
......
......@@ -24,10 +24,10 @@ Executions
:members:
Containers
----------
Services
--------
.. automodule:: zoe_lib.containers
.. automodule:: zoe_lib.services
:members:
......
......@@ -16,18 +16,18 @@ Synchronous API. The Zoe Scheduler is not multi-thread, all requests to the API
Object naming
-------------
Every object in Zoe has a unique name. Zoe uses a dotted notation, with a hierarchical structure, left to right, from specific to generic, like the DNS system.
Every object in Zoe has a unique name. Zoe uses a notation with a hierarchical structure, left to right, from specific to generic, like the DNS system.
These names are used throughout the API.
A service (one service corresponds to one Docker container) is identified by this name:
<service_name>.<execution_name>.<owner>.<deployment_name>
<service_name>-<execution_name>-<owner>-<deployment_name>
An execution is identified by:
<execution_name>.<owner>.<deployment_name>
<execution_name>-<owner>-<deployment_name>
A user is:
<owner>.<deployment_name>
<owner>-<deployment_name>
And a Zoe instance is:
<deployment_name>
......@@ -40,3 +40,4 @@ Where:
Docker hostnames
^^^^^^^^^^^^^^^^
The names described above are used to generate the names and host names in Docker. User networks are also named in the same way. This, among other things, has advantages when using Swarm commands, because it is easy to distinguish Zoe containers, and for monitoring solutions that take data directly from Swarm, preserving all labels and container names. With Telegraf, InfluxDB and Grafana it is possible to build Zoe dashboards that show resource utilization per-user or per-execution.
......@@ -40,6 +40,7 @@ Contents:
install
config_file
logging
monitoring
architecture
vision
contributing
......
......@@ -11,12 +11,14 @@ Zoe components:
Zoe is written in Python and uses the ``requirements.txt`` file to list the package dependencies needed for all components of Zoe. Not all of them are needed in all cases, for example you need the ``kazoo`` library only if you use Zookeeper to manage Swarm high availability.
Zoe is a young software project and we foresee it being used in places with wildly different requirements in terms of IT organization (what is below Zoe) and user interaction (what is above Zoe). For this reason we are aiming at providing a solid core of features and a number of basic external components that can be easily customized. For example, the Spark idle monitoring feature is useful only in certain environments and it is implemented as an external service, that can be customized of takes as an example to build something different.
Requirements
------------
Zoe is written in Python 3. Development happens on Python 3.4, but we test also for Python 3.5.
Zoe is written in Python 3. Development happens on Python 3.4, but we test also for Python 3.5 on Travis-CI.
* Docker Swarm
To run Zoe you need Docker Swarm and a shared filesystem, mounted on all hosts part of the Swarm. Internally we use CEPH-FS, but NFS is also a valid solution.
Optional:
......
Container logs
==============
By default Zoe does not involve itself with the output from container processes. The logs can be retrieved with the usual Docker command ``docker logs`` while a container is alive and then they are lost forever.
By default Zoe does not involve itself with the output from container processes. The logs can be retrieved with the usual Docker command ``docker logs`` while a container is alive and then they are lost forever when the container is deleted.
Using the ``gelf-address`` option of the Zoe Master process, Zoe can configure Docker to send the container outputs to an external destination in GELF format. GELF is the richest format supported by Docker and can be ingested by a number of tools such as Graylog and Logstash. When that option is set all containers created by Zoe will send their output (standard output and standard error) to the destination specified.
Docker is instriucted to add all Zoe-defined tags to the GELF messages, so that they can be aggregate by Zoe Application, Zoe user, etc.
Docker is instructed to add all Zoe-defined tags to the GELF messages, so that they can be aggregated by Zoe Execution, Zoe User, etc.
Zoe also provides a Zoe Logger process, in case you prefer to use Kafka in your log pipeline. Each container output will be sent to its own topic, that Kafka will conserve for seven days by default. With Kafka you can also monitor the container output in real-time, for example to debug your container images running in Zoe. In this case GELF is converted to syslog-like format for easier handling
......
......@@ -3,8 +3,14 @@
Monitoring interface
====================
REST API
--------
Zoe has a built-in metrics generator able to send data to InfluxDB. By default it is disabled, it can be enabled by using the ``influxdb-*`` options available in the master configuration file. Metrics are generated for a number of internal events, listed below, and can be used to monitor Zoe performance and aliveness.
Please note that Zoe does not involve itself with container metrics, to gather container statistics you need to use third party tools able to talk directly to Docker. For example Telegraf, from InfluxData, is able to retain all the label associated with a container, thus producing very useful per-container metrics.
REST API metrics
----------------
These metrics report the latency measured during all API calls, as seen from the Zoe Master process.
service_time
^^^^^^^^^^^^
......
......@@ -53,6 +53,14 @@ def predefined_app_generate(name):
def app_validate(data):
"""
Validates an application description, making sure all required fields are present and of the correct type.
This validation is also performed on the Zoe Master side.
If the description is not valid, an InvalidApplicationDescription exception is thrown.
:param data: a dictionary containing an application description
:return: None if the application description is correct
"""
required_keys = ['name', 'will_end', 'priority', 'requires_binary', 'version']
for k in required_keys:
if k not in data:
......
......@@ -30,18 +30,29 @@ class InfluxDBMetricSender(threading.Thread):
self._deployment = conf.deployment_name
self._influxdb_endpoint = conf.influxdb_url + '/write?precision=ms&db=' + conf.influxdb_dbname
self._queue = queue.Queue()
self.retries = 5
def _send_buffer(self):
error = False
if self._influxdb_endpoint is not None and len(self._buffer) > 0:
payload = '\n'.join(self._buffer)
try:
r = requests.post(self._influxdb_endpoint, data=payload)
except:
log.exception('error writing metrics to influxdb, data thrown away')
log.exception('error writing metrics to influxdb, will retry {} times'.format(self.retries))
error = True
else:
if r.status_code != 204:
log.error('error writing metrics to influxdb, data thrown away')
self._buffer.clear()
log.error('error writing metrics to influxdb, will retry {} times'.format(self.retries))
error = True
if error:
if self.retries <= 0:
self.retries = 5
self._buffer.clear()
else:
self.retries -= 1
else:
self._buffer.clear()
def quit(self):
self._queue.put('quit')
......
......@@ -36,7 +36,7 @@ class ZoeServiceAPI(ZoeAPIBase):
:return:
:type container_id: int
:rtype dict
:rtype: dict
"""
c, status_code = self._rest_get('/service/' + str(container_id))
if status_code == 200:
......
......@@ -58,7 +58,7 @@ class ZoeUserAPI(ZoeAPIBase):
:return: the user dictionary, or None
:type user_name: str
:rtype dict|None
:rtype: dict|None
"""
user, status_code = self._rest_get('/user/' + user_name)
if status_code == 200:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment