Skip to content

Design Draft

Stephen Zhang edited this page May 24, 2017 · 3 revisions

NOTE all structs listed in this page are just draft pseudo-code, and may be different from real implementation.

Components

There are two major components in YAMF, Task Scheduler and Task Executor.

Task Scheduler parses rules, and generates Tasks at rule-specific given interval. Tasks are published to an MQ, and consumed by Task Executor. There could be many instances of Task Executors. A Task Executor receives Task from MQ, and execute it (usually does a graphite query and compare result with configured threshold), and emits Events based on execution results.

There is also an Direct Event Receiver, which listens and accepts for external events (e.g. snmptrap message), convert the events into unified Event format and emits like how Task Executor does.

As dependency, an MQ and SQL server is needed. SQL server is used for storing rules.

Glossaries

Task

A Task defines what to do.

type Task struct {
    // type of the task, currently only "graphite" is planned
    //   * graphite: performs a graphite render query and check the response data
    Type string

    // if ``Type`` is graphite, then the details is attached as ``GraphiteTask``
    GraphiteTask GraphiteTask

    // when an executor fetches a task, it should check if current time is beyond ``Expiration``,
    // if the task is expired, it means there is no meaning to do this task anymore, and probably
    // another task has already been scheduled, the executor should skip this.
    Expiration time.Time

    // the max time should an executor spend on executing this task. The task may be stuck at querying
    // graphite api, and executor should not dead waiting, and abort the task on timeout.
    Timeout time.Duration

    // Metadata passed from ``Rule`` and further passed on to ``Event``
    Metadata map[string]string

    // RuleID of the Rule which generated  this Task, it is useless for the task, but can be passed in ``Event``,
    // and ``Event``s' consumer may found it useful.
    RuleID int
}

type GraphiteTask struct {
    // passed as "target=$Query" in api request
    Query string

    // passed as "from=$From&until=$Until" in api request
    From  string
    Until string

    // pattern used to extract data from graphite api response.
    // `MetaPattern` is a regular expression with named capture groups, a `target` is ignored if not matching MetaPattern.
    // if MetaPattern is empty, then no metadata is extracted and all returned ``target``s are checked.
    // Extracted pattern is merged with Rule.Metadata and passed on to emitted Event.
    // Example:
    //   pattern: "^(?P<resource_type>[^.]+)\.(?P<host>[^.]+)\.[^.]+\.user$"
    //   * "pm.server1.cpu.user" matches, with: resource_type=pm, host=server1
    //   * "pm.server2.cpu.user" matches, with: resource_type=pm, host=server2
    //   * "pm.server3.cpu.idle" does not match, and ignored (no further threshold checking)
    MetaPattern string

    // threshold of warning and critical, must be in the following forms:
    //   "> 1.0", ">= 1.0", "== 1.0", "<=1.0", "<1.0", "== nil", "!= nil"
    // the last value of a series is used as left operand.
    // Edge cases:
    // * if last value is nil but expression is not nil related, Unknown is returned;
    // Evaluation order:
    // * evaluate critical expression, next if not satasified
    // * evaluate warning expression
    CriticalExpr string
    WarningExpr  string

    // --------
    //   The following options are not to be implemented for first release
    // --------

    // specify graphite api url, so we can query different graphite instances.
    EndpointUrl string
}

// Later, we could implement other types of tasks, like `ElasticsearchTask`, `OpenTSDBTask`, etc.

Rule

A Rule contains a Task, and defines how to schedule the task.

type Rule struct {
    // the following fields are copied to generated ``Task`` as is. 
    Type         string
    GraphiteTask GraphiteTask

    // at which interval tasks should be generated
    Interval time.Duration

    // it is transposed to ``Expiration`` field in ``Rule``.
    Timeout time.Duration

    // a Rule can be paused, so that no tasks are generated
    Paused bool

    // Metadata to be passed on to ``Task`` and further ``Event``
    Metadata map[string]string
}

Event

A Task could yield multiple Events, if the corresponding graphite query returns multiple targets.

type Event struct {
    // level of the event, could be one of: ok, unknown, warning, critical
    Level string

    // Metadata specified in `Rule` and produced during `Task` execution.
    Metadata map[string]string
}

Metadata

Metadata is a freely composed dictionary, it is usually useful during event processing (which is currently out of YAMF's scope). However, these fields are supposed present in specific situations.

{
    // always specify it while creating `Rule`
    "description": "description or title of this check rule",

    // if Task type is graphite, this is the computed current value
    "current_value": "0.1", 
    // and the compare expressions
    "critical_expr": ">80",
    "warning_expr": ">60",
}

Interfaces

Task scheduler

Task scheduler exposes an http api for manipulating rules:

  • GET /rules/ list all rules
  • POST /rules/ create a rule
  • GET /rules/(:id)/ get a rule
  • PUT /rules/(:id)/ update a rule
  • DELETE /rules/(:id)/ delete a rule

Task executor

Task executor does not expose any interface. It can support multiple methods of emitting events.

  • LogFile: emit events to log file
  • HTTP Post: emit events to an http endpoint
  • MQ: publish events to a MQ

The emitting methods and corresponding configuration is available via configuration file.

Direct Event Receiver

This component is supposed to accept outside pushed events and normalize to unified Event struct. So it can have multiple listeners, like snmptrap, http, etc.

Implementation Language

Due to the fact that we have to get this framework working in a very short time, and team members have different language tastes, we may mix different language for different components in the first release. And later at a appropriate time rewrite all components using golang.

  • Task Scheduler would be implemented in golang

  • Task Executor would first be implemented in Java

  • Direct Event Receiver would first be implemented in Java

  • MQ: after discussion, we choose RabbitMQ for the first release. We may switch to other MQ or support multiple MQ later.

  • DB: we choose postgresql.

All code will be put in this single project, each component sits in its own directory. Later, when we rewrite everything with go, the structure may be re-organized, since components share the same data structs.

/--
  |- scheduler/      -- task scheduler
  |- executor/       -- task executor
  |- directreceiver/ -- direct event receiver
  |- docs/