[OPNFV] Architecture Proposal of Doctor Project

Overview

Doctor project is nothing else but a framework of fault management and maintenance. The biggest problem currently is the gaps of OpenStack that is chose to be the VIM in Doctor and the process of choosing/analyzing the suitable monitoring tools that are able to fit the requirements of Doctor.

Doctor Functionalites:

Manages the failures of virtualized resources.
Plans the maintenance actions based on the above failures.

They may affect to a VNF/application and the Customer should ASAP react to the failures, e.g., by switching to the STBY mode. With the app supported by HA, the impact is so serious, how much the network service is impacted by the failures depends on how the service is implemented.

OpenStack is the target upstream of VIM where the new functional elements (Controller, Notifier, Monitor and Inspector) are expected to implemented. Some of these elements may sit outside of OpenStack and offer a northbound interface to the OpenStack.

General Features in Doctor:

Monitor: Monitors the physical and virtual resources.
Detection: Detects the unavailability of physical resources.
Correlation and Cognition: Correlates faults and identifies the affected virtual resources.
Notification: Notifies unavailable virtual resources to the Customers.
Fencing: Shutdowns or isolates the failed virtual resources (VM, virtual network, virtual storages).
Recovery: Executes action to process fault recovery and maintenance.

These features are processed in sequence from the top to the bottom. The time interval from the step1 to step4 is less than 1s..

VIM Interfaces

VIM South-bound interface

There is no south-bound interface that defines the generic format of data models that caught from monitoring solutions. Each monitoring solution results the different data formats, especially the 3rd party solutions such as zabbix, Nagios, Munin, etc.

In our current vanilla OpenStack (as a VIM), there is no any interface that communicates with 3rd party monitoring solutions.

VIM North-bound Interface

VIM has to notify the user about the unavailability of resources triggered by NFVI. VIM accepts the message from admin and marks the affected resources. The problem (gap) is that VIM user can not get the maintenance notification.

Doctor Design Proposal in Fault management

General design:

Using sqlalchemy for ORM and mysql/postgres for the db of each component.
Each service will run as a daemon in threading. It should utilize the multi-cooperative/asynchronous threading as using greenthread and RPC inter-process communication technique to achieve this.
The internal API between the services will be the RPC API, the external should be the REST API

Cooperation with OpenStack components:

By this option, it needs to follow the architecture of OpenStack to make other components then can internally interact each other.

Monitor:

Description:

To-be: A module that monitors the resources of NFVI.

Implementation: It varies in implementation with various solutions (OpenStack Ceilometer, 3^rd party monitor solutions such as Zabbix, Nagios)

Challenges:

Each tool has advantages and disadvantages (e.g. Ceilometer is not recommended for the medium and big size deployments) so that the choice of monitoring solutions should depend on the use cases of deployment. Thus, it is necessary to support multiple monitoring solutions.
The current OpenStack monitoring solution is Ceilometer but it has many problems, two of them are the high resource consumption and the lack of supporting streamed data. In the future, Ceilometer may decrease these scenarios with TSDB or InfluxDB.
It is good if a chosen tool has Python plugin driver so that the Inspector can query since the Doctor source code will be written in Python.
The output format should be pre-defined and readable for Inspector.
Monasca supports streaming alarm engine, hundreds of thousands of metrics per second. However, it has the complexity in alarm dependency of failure.

Candidates:

Ceilometer, Zabbix, Monsaca

Controller:

Description:

To-be: A module that maintains the information database of physical and virtual resources.
Implementation:

It can be the OpenStack services (Nova, Neutron, Cinder). In this case, Resource Map will be the database of those OpenStack services.
All of the actions of finding and updating the state of affected virtual resources will be executed through API of OpenStack. These actions are triggered by Inspector.
It should maintain its database that stores the state of resources. This database can be designed as the same way of OpenStack db by using sqlalchemy as an ORM and supports access through an API.

Challenges:

There is a spec of Nova that marks the state of vm down when it realizes crashed compute host/OS.
Besides, there is also another spec of forcing down the services and another spec of updating server state immediately in Nova in order to support manual orders from Consumers (See in Neutron, Cinder)

Candidates:

Nova, Neutron, Cinder

The Inspector:

Description:

To-be:

Able to receive various failure notifications from monitors regarding of physical resources.
Able to update the affected VM by querying the Resource Map from the Controller.

Implementation:

It is able to understand various types of failure notifications of monitoring solutions such as 3rd party solutions, ceilometer, etc. In case of 3rd party, it can use Python specific modules designed for each solution (e.g. PyZabbix for Zabbix) to communicate. In case of OpenStack monitor solution such as Ceilometer, we can query the ceilometer client to get results. Generally, we can create an abstraction layer between Inspector and monitor solutions. The specific monitoring drivers can be provided in the configuration file. The abstraction layer will talk with APIs of monitoring solutions and Inspector will abstract this abstraction layer.

It is able to talk with OpenStack services APIs by querying them to find the affected Vms.

Updating the state of virtual/physical resources in the resource map should be the action of updating the DBs of OpenStack services such as Nova, Neutron, Cinder, etc.

It can directly send the alarms/alerts to Notifier by querying its API.

Candidates:

Monasca

A central engine that collects data from Monitor, maps the physical host and affected VMs in the Controller DB, triggers the VM state updating actions on Controller.

Notifier:

Desciption:

To-be:

It should receive the failure notification from Controller/Inspector.
It should send the notification (alarms, alerts) to the Consumers.

Implementation:

One candidate for this module is Ceilometer. It is able to receive the notifications (alarms, alerts) that are sent by Controller/Inspector. It means, each time Controller/Inspector would like to send notifications, they will trigger the module of Notifier so called NotificationSender to send notifications to a central notification collector of Notifier.
It needs an API through that Consumers can query the information. Ideally, this API design is a RestAPI. Otherwise, it may have CLI for the usage of Consumer.
This Notifier should support the south-bound interfaces (e.g. RestAPI, client interface) for others services that can query and a north-bound API (RestAPI) for consumer who can query and set the alarm policies.

Challenges:

The format/definition of notification that Notifier sends to Customer needs to be pre-defined

Candidates:

Ceilometer
A NotificationEngine that includes a NotificationReceiver to get the notification from Controller/Inspector and a NotificationSender that send the notification to Consumer.

9/3/2016

VietStack team

Share

VietStack

Comments