-
Notifications
You must be signed in to change notification settings - Fork 0
Design Draft
NOTE all structs listed in this page are just draft pseudo-code, and may be different from real implementation.
There are two major components in YAMF, Task Scheduler and Task Executor.
Task Scheduler parses rules, and generates Task
s at rule-specific given interval. Task
s are published to an MQ, and consumed by Task Executor. There could be many instances of Task Executors. A Task Executor receives Task
from MQ, and execute it (usually does a graphite query and compare result with configured threshold), and emits Event
s based on execution results.
There is also an Direct Event Receiver, which listens and accepts for external events (e.g. snmptrap message), convert the events into unified Event
format and emits like how Task Executor does.
As dependency, an MQ and SQL server is needed. SQL server is used for storing rules.
A Task
defines what to do.
type Task struct {
// type of the task, currently only "graphite" is planned
// * graphite: performs a graphite render query and check the response data
Type string
// if ``Type`` is graphite, then the details is attached as ``GraphiteTask``
GraphiteTask GraphiteTask
// when an executor fetches a task, it should check if current time is beyond ``Expiration``,
// if the task is expired, it means there is no meaning to do this task anymore, and probably
// another task has already been scheduled, the executor should skip this.
Expiration time.Time
// the max time should an executor spend on executing this task. The task may be stuck at querying
// graphite api, and executor should not dead waiting, and abort the task on timeout.
Timeout time.Duration
// Metadata passed from ``Rule`` and further passed on to ``Event``
Metadata map[string]string
// RuleID of the Rule which generated this Task, it is useless for the task, but can be passed in ``Event``,
// and ``Event``s' consumer may found it useful.
RuleID int
}
type GraphiteTask struct {
// passed as "target=$Query" in api request
Query string
// passed as "from=$From&until=$Until" in api request
From string
Until string
// pattern used to extract data from graphite api response.
// `MetaPattern` is a regular expression with named capture groups, a `target` is ignored if not matching MetaPattern.
// if MetaPattern is empty, then no metadata is extracted and all returned ``target``s are checked.
// Extracted pattern is merged with Rule.Metadata and passed on to emitted Event.
// Example:
// pattern: "^(?P<resource_type>[^.]+)\.(?P<host>[^.]+)\.[^.]+\.user$"
// * "pm.server1.cpu.user" matches, with: resource_type=pm, host=server1
// * "pm.server2.cpu.user" matches, with: resource_type=pm, host=server2
// * "pm.server3.cpu.idle" does not match, and ignored (no further threshold checking)
MetaPattern string
// threshold of warning and critical, must be in the following forms:
// "> 1.0", ">= 1.0", "== 1.0", "<=1.0", "<1.0", "== nil", "!= nil"
// the last value of a series is used as left operand.
// Edge cases:
// * if last value is nil but expression is not nil related, Unknown is returned;
// Evaluation order:
// * evaluate critical expression, next if not satasified
// * evaluate warning expression
CriticalExpr string
WarningExpr string
// --------
// The following options are not to be implemented for first release
// --------
// specify graphite api url, so we can query different graphite instances.
EndpointUrl string
}
// Later, we could implement other types of tasks, like `ElasticsearchTask`, `OpenTSDBTask`, etc.
A Rule
contains a Task
, and defines how to schedule the task.
type Rule struct {
// the following fields are copied to generated ``Task`` as is.
Type string
GraphiteTask GraphiteTask
// at which interval tasks should be generated
Interval time.Duration
// it is transposed to ``Expiration`` field in ``Rule``.
Timeout time.Duration
// a Rule can be paused, so that no tasks are generated
Paused bool
// Metadata to be passed on to ``Task`` and further ``Event``
Metadata map[string]string
}
A Task
could yield multiple Event
s, if the corresponding graphite query returns multiple target
s.
type Event struct {
// level of the event, could be one of: ok, unknown, warning, critical
Level string
// Metadata specified in `Rule` and produced during `Task` execution.
Metadata map[string]string
}
Metadata is a freely composed dictionary, it is usually useful during event processing (which is currently out of YAMF's scope). However, these fields are supposed present in specific situations.
{
// always specify it while creating `Rule`
"description": "description or title of this check rule",
// if Task type is graphite, this is the computed current value
"current_value": "0.1",
// and the compare expressions
"critical_expr": ">80",
"warning_expr": ">60",
}
Task scheduler exposes an http api for manipulating rules:
-
GET /rules/
list all rules -
POST /rules/
create a rule -
GET /rules/(:id)/
get a rule -
PUT /rules/(:id)/
update a rule -
DELETE /rules/(:id)/
delete a rule
Task executor does not expose any interface. It can support multiple methods of emitting events.
- LogFile: emit events to log file
- HTTP Post: emit events to an http endpoint
- MQ: publish events to a MQ
The emitting methods and corresponding configuration is available via configuration file.
This component is supposed to accept outside pushed events and normalize to unified Event
struct. So it can have multiple listeners, like snmptrap, http, etc.
Due to the fact that we have to get this framework working in a very short time, and team members have different language tastes, we may mix different language for different components in the first release. And later at a appropriate time rewrite all components using golang.
-
Task Scheduler would be implemented in golang
-
Task Executor would first be implemented in Java
-
Direct Event Receiver would first be implemented in Java
-
MQ: after discussion, we choose RabbitMQ for the first release. We may switch to other MQ or support multiple MQ later.
-
DB: we choose postgresql.
All code will be put in this single project, each component sits in its own directory. Later, when we rewrite everything with go, the structure may be re-organized, since components share the same data structs.
/--
|- scheduler/ -- task scheduler
|- executor/ -- task executor
|- directreceiver/ -- direct event receiver
|- docs/