State Engine Plugin

Overview

The State Engine plugin tracks and manages the state of Configuration Items (CIs) and other monitored components in your system. It consolidates multiple alarms into aggregated state messages, helping you understand the overall health status of your infrastructure components at a glance.

Key Features

State Aggregation: Combines multiple alarms for the same CI into a single state message
Automatic State Updates: Continuously monitors incoming alarms and updates state accordingly
Delayed Alarm Processing: Prevents alarm flapping by introducing configurable waiting periods
Persistent Storage: Optional MongoDB storage for state information (or in-memory processing)
Automatic Cleanup: Expires old alarms to keep state information current
Periodic Resynchronization: Sends complete state updates at configured intervals
Severity Tracking: Determines overall state based on the highest severity alarm present

How It Works

State Tracking

When alarms arrive for your CIs, the State Engine:

Groups alarms by CI or other configured identifiers
Applies delay rules if specified, holding alarms before they affect state
Tracks the highest severity among all active (non-delayed) alarms for each CI
Creates state messages that include all relevant alarms for that CI
Sends state updates when the overall severity changes or when new alarms are promoted

State Messages

State messages provide a consolidated view of all active alarms for a particular CI. Each state message includes:

The current overall severity (determined by the highest severity alarm)
All active alarms associated with that CI
Timing information about when the state was last updated

State Message Format

Message Structure

State messages are JSON-formatted messages that consolidate all active alarms for a specific group (typically a CI or monitored component).

Core Fields

Message Type

Configurable message type (default: "State")
Used to distinguish state messages from regular alarms

Group Identifier

The group field identifies what the state represents (e.g., a specific CI, server, or component)
All alarms with the same group value are consolidated into a single state message

Severity

The overall state severity, determined by the highest severity among all active alarms in that group
Represents the most critical condition affecting the component

Suppression Key

Used to uniquely identify individual alarms within the state
Prevents duplicate alarms from being counted multiple times

Alarm List

The state message includes a collection of all active alarms for that group. Each alarm in the state contains:

Its individual severity level
Suppression key (unique identifier)
Any custom fields from the original alarm message
Timing information

Example Scenario

If a server has three active alarms:

Alarm 1: Disk space warning (severity 2)
Alarm 2: CPU critical (severity 4)
Alarm 3: Memory warning (severity 2)

The resulting state message would:

Have group = "ServerA" (or whatever identifies this server)
Have overall severity = 4 (the highest severity present)
Include all three alarms in its alarm list

When the CPU critical alarm clears (severity 0):

The state message updates to severity = 2 (now the highest)
If "Omit Severity Zero" is enabled, the cleared CPU alarm won't be included
The two warning alarms remain in the alarm list

Delay Feature

Purpose

The delay feature prevents "alarm flapping" by introducing a waiting period before including new alarms in state messages. This is particularly useful for:

Transient conditions that resolve quickly
Preventing alert fatigue from brief spikes
Stabilizing state messages during system fluctuations

How It Works

Delay Field

Each incoming alarm can include a delay field (specified in seconds)
If no delay is specified, the default is 0 (no delay)

Delayed Processing

Alarm Arrives: When an alarm with a delay value arrives, it is received and tracked
Waiting Period: The alarm is held in a pending state for the specified number of seconds
State Impact: During the delay period, the alarm does not affect the current state
Delay Expiration: Once the delay period passes, the alarm is promoted and included in state calculations

Delay Behavior Examples

Example 1: Brief Spike (Alarm Clears During Delay)

Timeline:

T+0: CPU warning arrives with 300 second (5 minute) delay
T+2min: CPU alarm clears (severity 0)
T+5min: Delay expires, but alarm already cleared

Result: The warning never appears in the state message because it cleared before the delay expired.

Example 2: Sustained Condition

Timeline:

T+0: Disk space warning arrives with 180 second (3 minute) delay
T+3min: Delay expires, disk space still critical
T+3min: Alarm is promoted and included in state

Result: The alarm appears in the state message only after being sustained for 3 minutes.

Example 3: Escalating Severity

Timeline:

T+0: Memory warning (severity 2) arrives with 300 second delay
T+2min: Memory critical (severity 4) arrives for same component
T+5min: First alarm delay expires

Result: Both alarms may be promoted based on their individual delay timers. The state severity reflects the highest severity among promoted alarms.

Configuration Considerations

Setting Delays

Delays are set on incoming alarm messages, not in the State Engine configuration. The source system or upstream processing must add the delay field to alarm messages.

Typical Delay Values:

60-180 seconds: For transient conditions (network blips, brief CPU spikes)
300-600 seconds: For conditions that should stabilize before alerting
0 seconds: For critical conditions requiring immediate state updates

Interaction with Other Features

Delay + Expiration

Alarm expiration timers start when the alarm first arrives, not after delay promotion
An alarm with a very long delay could theoretically expire before being promoted

Delay + Resync

Resync operations send the current active state
Alarms still in delay period are not included in resync messages

Delay + Severity Zero

If an alarm clears (severity 0) during its delay period, it never impacts the state
This naturally filters transient alarms without additional configuration

Delay Use Cases

Network Monitoring: Set 120-second delay on connectivity alarms to avoid alerting during brief network fluctuations
Performance Metrics: Use 300-second delay on threshold violations to ensure sustained performance issues before declaring state change
Service Health Checks: Apply delays to prevent state changes from single failed health check polls
Multi-stage Systems: Use different delays for different severity levels (longer delay for warnings, shorter for critical)

Configuration

Storage Settings

Collection Name

Purpose: Specifies the MongoDB collection name for persistent state storage
Default: Empty (in-memory processing)
When to use:
- Leave empty for temporary, in-memory state tracking
- Provide a collection name when you need state to persist across plugin restarts

Message Handling

State Message Type

Purpose: Defines the message type used for outgoing state messages
Default: "State"
When to change: Modify if your downstream systems expect a different message type

Omit Severity Zero Alarms

Purpose: Controls whether cleared alarms (severity 0) are included in state messages
Default: Enabled (severity 0 alarms are excluded)
When to disable: Turn off if you need to see which alarms have been cleared in your state messages

Maintenance Operations

Expire Alarms After (hours)

Purpose: Automatically removes alarms that haven't been updated within the specified time period
Default: 0 (disabled)
Recommended: Set to a value like 24 or 48 hours to prevent stale alarms from persisting
How it works: Runs an hourly cleanup sweep to remove expired alarms and update affected states

Resync (hours)

Purpose: Sends complete state messages for all CIs at regular intervals
Default: 0 (disabled)
Recommended: Set to 6-24 hours for periodic full state synchronization
Use cases:
- Ensures downstream systems stay synchronized
- Recovers from any missed state updates
- Provides periodic health snapshots

Usage Scenarios

Scenario 1: Basic State Tracking (In-Memory)

Configuration:

Collection Name: (empty)
Omit Severity Zero: Enabled
Expire Alarms: 24 hours
Resync: 0

Best for: Development, testing, or environments where state doesn't need to persist across restarts.

Scenario 2: Production State Management

Configuration:

Collection Name: "ci_states"
Omit Severity Zero: Enabled
Expire Alarms: 48 hours
Resync: 12 hours

Best for: Production environments requiring persistent state tracking and regular synchronization.

Scenario 3: Complete Alarm History

Configuration:

Collection Name: "full_states"
Omit Severity Zero: Disabled
Expire Alarms: 0
Resync: 6 hours

Best for: Audit scenarios where you need visibility into both active and cleared alarms.

Scenario 4: Flap Prevention with Delays

Configuration:

Collection Name: "ci_states"
Omit Severity Zero: Enabled
Expire Alarms: 24 hours
Resync: 12 hours
Incoming alarms configured with appropriate delay values (e.g., 180 seconds for warnings)

Best for: Environments with intermittent conditions or noisy monitoring where you want to suppress transient alarms.

Operational Behavior

Automatic Processing

The plugin operates continuously with several automatic processes:

Immediate Processing: Incoming alarms are processed immediately, triggering state updates when necessary (unless delayed)
Delay Promotion: Every 10 seconds, alarms whose delay periods have expired are promoted and state changes are published
Hourly Cleanup: If alarm expiration is configured, old alarms are removed every hour
Scheduled Resync: If configured, complete state messages are sent at the specified interval

State Updates

State messages are sent when:

A new alarm is promoted (after delay, if applicable) for a CI
An alarm's severity changes
The overall state severity changes (e.g., a critical alarm clears, leaving only warnings)
A resynchronization interval triggers
An alarm expires and affects the overall state

Performance Considerations

In-Memory Mode: Fastest processing but state is lost on restart
MongoDB Mode: Persistent state but requires database connectivity
Resync Interval: More frequent resync provides better synchronization but increases message volume
Expire Alarms: Regular cleanup prevents unbounded growth of state data
Delay Processing: Minimal overhead; delayed alarms are efficiently managed in memory

Troubleshooting

States Not Updating

Verify incoming alarms are reaching the plugin
Check that the CI identifier (group field) is consistent across alarms
Ensure alarms have valid suppression keys
Review MongoDB connectivity if using persistent storage
Check if alarms are in delay period and haven't been promoted yet

Alarms Not Appearing in State

Verify the alarm delay period hasn't caused the alarm to wait before promotion
Check if the alarm cleared (severity 0) during its delay period
Ensure "Omit Severity Zero" setting matches your expectations
Verify alarm validation (group and suppression key must be present)

High Message Volume

Consider increasing the resync interval
Enable "Omit Severity Zero" to reduce cleared alarm messages
Verify alarm expiration is configured to clean up old data
Review delay settings on incoming alarms

Missing State Information After Restart

Ensure MongoDB collection name is configured for persistent storage
Verify database connectivity and permissions
Check that the specified collection exists and is accessible

Transient Alarms Causing Noise

Implement appropriate delay values on incoming alarms
Set delays based on expected duration of transient conditions
Review alarm patterns to tune delay values

Best Practices

General Configuration

Use Persistent Storage in Production: Configure a MongoDB collection for production environments
Set Appropriate Expiration: Configure alarm expiration to match your operational needs (typically 24-48 hours)
Enable Periodic Resync: Set resync to 6-12 hours for reliable state synchronization
Monitor State Changes: Set up downstream processing to act on state severity changes
Test Configuration: Start with in-memory mode during development and testing

Delay Strategy

Match Delay to Symptom: Set delays based on how long a condition should persist before it's significant
Consider Downstream Impact: Remember that delayed alarms won't trigger immediate notifications
Critical Alarms: Use zero delay for conditions requiring immediate attention
Test Delay Values: Monitor actual alarm patterns to tune delay values appropriately
Document Delay Strategy: Ensure operations teams understand which alarms have delays and why
Layered Approach: Use different delays for different severity levels (e.g., longer delays for warnings, shorter for critical alarms)

Required Alarm Fields

For alarms to be processed correctly, ensure incoming messages contain:

group: Identifies the CI or component
suppression_key: Uniquely identifies the alarm
severity: Numeric severity level (0 = cleared, higher = more severe)
delay (optional): Delay period in seconds before alarm affects state

Summary

The State Engine plugin provides intelligent alarm aggregation and state management, helping you maintain a clear view of your infrastructure health. By consolidating multiple alarms per CI, filtering transient conditions through delays, and providing flexible storage options, it reduces alert fatigue while ensuring critical conditions are properly tracked and communicated. Configure the plugin according to your operational needs, implement appropriate delay strategies for different alarm types, and leverage periodic resync to maintain reliable state synchronization across your monitoring ecosystem.

State Engine Plugin

Overview​

Key Features​

How It Works​

State Tracking​

State Messages​

State Message Format​

Message Structure​

Core Fields​

Alarm List​

Example Scenario​

Delay Feature​

Purpose​

How It Works​

Delay Behavior Examples​

Example 1: Brief Spike (Alarm Clears During Delay)​

Example 2: Sustained Condition​

Example 3: Escalating Severity​

Configuration Considerations​

Interaction with Other Features​

Delay Use Cases​

Configuration​

Storage Settings​

Message Handling​

Maintenance Operations​

Usage Scenarios​

Scenario 1: Basic State Tracking (In-Memory)​

Scenario 2: Production State Management​

Scenario 3: Complete Alarm History​

Scenario 4: Flap Prevention with Delays​

Operational Behavior​

Automatic Processing​

State Updates​

Performance Considerations​

Troubleshooting​

States Not Updating​

Alarms Not Appearing in State​

High Message Volume​

Missing State Information After Restart​

Transient Alarms Causing Noise​

Best Practices​

General Configuration​

Delay Strategy​

Required Alarm Fields​

Summary​

Overview

Key Features

How It Works

State Tracking

State Messages

State Message Format

Message Structure

Core Fields

Alarm List

Example Scenario

Delay Feature

Purpose

How It Works

Delay Behavior Examples

Example 1: Brief Spike (Alarm Clears During Delay)

Example 2: Sustained Condition

Example 3: Escalating Severity

Configuration Considerations

Interaction with Other Features

Delay Use Cases

Configuration

Storage Settings

Message Handling

Maintenance Operations

Usage Scenarios

Scenario 1: Basic State Tracking (In-Memory)

Scenario 2: Production State Management

Scenario 3: Complete Alarm History

Scenario 4: Flap Prevention with Delays

Operational Behavior

Automatic Processing

State Updates

Performance Considerations

Troubleshooting

States Not Updating

Alarms Not Appearing in State

High Message Volume

Missing State Information After Restart

Transient Alarms Causing Noise

Best Practices

General Configuration

Delay Strategy

Required Alarm Fields

Summary