State Engine Plugin
Overview
The State Engine plugin tracks and manages the state of Configuration Items (CIs) and other monitored components in your system. It consolidates multiple alarms into aggregated state messages, helping you understand the overall health status of your infrastructure components at a glance.
Key Features
- State Aggregation: Combines multiple alarms for the same CI into a single state message
- Automatic State Updates: Continuously monitors incoming alarms and updates state accordingly
- Delayed Alarm Processing: Prevents alarm flapping by introducing configurable waiting periods
- Persistent Storage: Optional MongoDB storage for state information (or in-memory processing)
- Automatic Cleanup: Expires old alarms to keep state information current
- Periodic Resynchronization: Sends complete state updates at configured intervals
- Severity Tracking: Determines overall state based on the highest severity alarm present
How It Works
State Tracking
When alarms arrive for your CIs, the State Engine:
- Groups alarms by CI or other configured identifiers
- Applies delay rules if specified, holding alarms before they affect state
- Tracks the highest severity among all active (non-delayed) alarms for each CI
- Creates state messages that include all relevant alarms for that CI
- Sends state updates when the overall severity changes or when new alarms are promoted
State Messages
State messages provide a consolidated view of all active alarms for a particular CI. Each state message includes:
- The current overall severity (determined by the highest severity alarm)
- All active alarms associated with that CI
- Timing information about when the state was last updated
State Message Format
Message Structure
State messages are JSON-formatted messages that consolidate all active alarms for a specific group (typically a CI or monitored component).
Core Fields
Message Type
- Configurable message type (default: "State")
- Used to distinguish state messages from regular alarms
Group Identifier
- The
groupfield identifies what the state represents (e.g., a specific CI, server, or component) - All alarms with the same group value are consolidated into a single state message
Severity
- The overall state severity, determined by the highest severity among all active alarms in that group
- Represents the most critical condition affecting the component
Suppression Key
- Used to uniquely identify individual alarms within the state
- Prevents duplicate alarms from being counted multiple times
Alarm List
The state message includes a collection of all active alarms for that group. Each alarm in the state contains:
- Its individual severity level
- Suppression key (unique identifier)
- Any custom fields from the original alarm message
- Timing information
Example Scenario
If a server has three active alarms:
- Alarm 1: Disk space warning (severity 2)
- Alarm 2: CPU critical (severity 4)
- Alarm 3: Memory warning (severity 2)
The resulting state message would:
- Have group = "ServerA" (or whatever identifies this server)
- Have overall severity = 4 (the highest severity present)
- Include all three alarms in its alarm list
When the CPU critical alarm clears (severity 0):
- The state message updates to severity = 2 (now the highest)
- If "Omit Severity Zero" is enabled, the cleared CPU alarm won't be included
- The two warning alarms remain in the alarm list
Delay Feature
Purpose
The delay feature prevents "alarm flapping" by introducing a waiting period before including new alarms in state messages. This is particularly useful for:
- Transient conditions that resolve quickly
- Preventing alert fatigue from brief spikes
- Stabilizing state messages during system fluctuations
How It Works
Delay Field
- Each incoming alarm can include a
delayfield (specified in seconds) - If no delay is specified, the default is 0 (no delay)
Delayed Processing
- Alarm Arrives: When an alarm with a delay value arrives, it is received and tracked
- Waiting Period: The alarm is held in a pending state for the specified number of seconds
- State Impact: During the delay period, the alarm does not affect the current state
- Delay Expiration: Once the delay period passes, the alarm is promoted and included in state calculations
Delay Behavior Examples
Example 1: Brief Spike (Alarm Clears During Delay)
Timeline:
- T+0: CPU warning arrives with 300 second (5 minute) delay
- T+2min: CPU alarm clears (severity 0)
- T+5min: Delay expires, but alarm already cleared
Result: The warning never appears in the state message because it cleared before the delay expired.
Example 2: Sustained Condition
Timeline:
- T+0: Disk space warning arrives with 180 second (3 minute) delay
- T+3min: Delay expires, disk space still critical
- T+3min: Alarm is promoted and included in state
Result: The alarm appears in the state message only after being sustained for 3 minutes.
Example 3: Escalating Severity
Timeline:
- T+0: Memory warning (severity 2) arrives with 300 second delay
- T+2min: Memory critical (severity 4) arrives for same component
- T+5min: First alarm delay expires
Result: Both alarms may be promoted based on their individual delay timers. The state severity reflects the highest severity among promoted alarms.
Configuration Considerations
Setting Delays
Delays are set on incoming alarm messages, not in the State Engine configuration. The source system or upstream processing must add the delay field to alarm messages.
Typical Delay Values:
- 60-180 seconds: For transient conditions (network blips, brief CPU spikes)
- 300-600 seconds: For conditions that should stabilize before alerting
- 0 seconds: For critical conditions requiring immediate state updates
Interaction with Other Features
Delay + Expiration
- Alarm expiration timers start when the alarm first arrives, not after delay promotion
- An alarm with a very long delay could theoretically expire before being promoted
Delay + Resync
- Resync operations send the current active state
- Alarms still in delay period are not included in resync messages
Delay + Severity Zero
- If an alarm clears (severity 0) during its delay period, it never impacts the state
- This naturally filters transient alarms without additional configuration
Delay Use Cases
- Network Monitoring: Set 120-second delay on connectivity alarms to avoid alerting during brief network fluctuations
- Performance Metrics: Use 300-second delay on threshold violations to ensure sustained performance issues before declaring state change
- Service Health Checks: Apply delays to prevent state changes from single failed health check polls
- Multi-stage Systems: Use different delays for different severity levels (longer delay for warnings, shorter for critical)
Configuration
Storage Settings
Collection Name
- Purpose: Specifies the MongoDB collection name for persistent state storage
- Default: Empty (in-memory processing)
- When to use:
- Leave empty for temporary, in-memory state tracking
- Provide a collection name when you need state to persist across plugin restarts
Message Handling
State Message Type
- Purpose: Defines the message type used for outgoing state messages
- Default: "State"
- When to change: Modify if your downstream systems expect a different message type
Omit Severity Zero Alarms
- Purpose: Controls whether cleared alarms (severity 0) are included in state messages
- Default: Enabled (severity 0 alarms are excluded)
- When to disable: Turn off if you need to see which alarms have been cleared in your state messages
Maintenance Operations
Expire Alarms After (hours)
- Purpose: Automatically removes alarms that haven't been updated within the specified time period
- Default: 0 (disabled)
- Recommended: Set to a value like 24 or 48 hours to prevent stale alarms from persisting
- How it works: Runs an hourly cleanup sweep to remove expired alarms and update affected states
Resync (hours)
- Purpose: Sends complete state messages for all CIs at regular intervals
- Default: 0 (disabled)
- Recommended: Set to 6-24 hours for periodic full state synchronization
- Use cases:
- Ensures downstream systems stay synchronized
- Recovers from any missed state updates
- Provides periodic health snapshots
Usage Scenarios
Scenario 1: Basic State Tracking (In-Memory)
Configuration:
- Collection Name: (empty)
- Omit Severity Zero: Enabled
- Expire Alarms: 24 hours
- Resync: 0
Best for: Development, testing, or environments where state doesn't need to persist across restarts.
Scenario 2: Production State Management
Configuration:
- Collection Name: "ci_states"
- Omit Severity Zero: Enabled
- Expire Alarms: 48 hours
- Resync: 12 hours
Best for: Production environments requiring persistent state tracking and regular synchronization.
Scenario 3: Complete Alarm History
Configuration:
- Collection Name: "full_states"
- Omit Severity Zero: Disabled
- Expire Alarms: 0
- Resync: 6 hours
Best for: Audit scenarios where you need visibility into both active and cleared alarms.
Scenario 4: Flap Prevention with Delays
Configuration:
- Collection Name: "ci_states"
- Omit Severity Zero: Enabled
- Expire Alarms: 24 hours
- Resync: 12 hours
- Incoming alarms configured with appropriate delay values (e.g., 180 seconds for warnings)
Best for: Environments with intermittent conditions or noisy monitoring where you want to suppress transient alarms.
Operational Behavior
Automatic Processing
The plugin operates continuously with several automatic processes:
- Immediate Processing: Incoming alarms are processed immediately, triggering state updates when necessary (unless delayed)
- Delay Promotion: Every 10 seconds, alarms whose delay periods have expired are promoted and state changes are published
- Hourly Cleanup: If alarm expiration is configured, old alarms are removed every hour
- Scheduled Resync: If configured, complete state messages are sent at the specified interval
State Updates
State messages are sent when:
- A new alarm is promoted (after delay, if applicable) for a CI
- An alarm's severity changes
- The overall state severity changes (e.g., a critical alarm clears, leaving only warnings)
- A resynchronization interval triggers
- An alarm expires and affects the overall state
Performance Considerations
- In-Memory Mode: Fastest processing but state is lost on restart
- MongoDB Mode: Persistent state but requires database connectivity
- Resync Interval: More frequent resync provides better synchronization but increases message volume
- Expire Alarms: Regular cleanup prevents unbounded growth of state data
- Delay Processing: Minimal overhead; delayed alarms are efficiently managed in memory
Troubleshooting
States Not Updating
- Verify incoming alarms are reaching the plugin
- Check that the CI identifier (group field) is consistent across alarms
- Ensure alarms have valid suppression keys
- Review MongoDB connectivity if using persistent storage
- Check if alarms are in delay period and haven't been promoted yet
Alarms Not Appearing in State
- Verify the alarm delay period hasn't caused the alarm to wait before promotion
- Check if the alarm cleared (severity 0) during its delay period
- Ensure "Omit Severity Zero" setting matches your expectations
- Verify alarm validation (group and suppression key must be present)
High Message Volume
- Consider increasing the resync interval
- Enable "Omit Severity Zero" to reduce cleared alarm messages
- Verify alarm expiration is configured to clean up old data
- Review delay settings on incoming alarms
Missing State Information After Restart
- Ensure MongoDB collection name is configured for persistent storage
- Verify database connectivity and permissions
- Check that the specified collection exists and is accessible
Transient Alarms Causing Noise
- Implement appropriate delay values on incoming alarms
- Set delays based on expected duration of transient conditions
- Review alarm patterns to tune delay values
Best Practices
General Configuration
- Use Persistent Storage in Production: Configure a MongoDB collection for production environments
- Set Appropriate Expiration: Configure alarm expiration to match your operational needs (typically 24-48 hours)
- Enable Periodic Resync: Set resync to 6-12 hours for reliable state synchronization
- Monitor State Changes: Set up downstream processing to act on state severity changes
- Test Configuration: Start with in-memory mode during development and testing
Delay Strategy
- Match Delay to Symptom: Set delays based on how long a condition should persist before it's significant
- Consider Downstream Impact: Remember that delayed alarms won't trigger immediate notifications
- Critical Alarms: Use zero delay for conditions requiring immediate attention
- Test Delay Values: Monitor actual alarm patterns to tune delay values appropriately
- Document Delay Strategy: Ensure operations teams understand which alarms have delays and why
- Layered Approach: Use different delays for different severity levels (e.g., longer delays for warnings, shorter for critical alarms)
Required Alarm Fields
For alarms to be processed correctly, ensure incoming messages contain:
- group: Identifies the CI or component
- suppression_key: Uniquely identifies the alarm
- severity: Numeric severity level (0 = cleared, higher = more severe)
- delay (optional): Delay period in seconds before alarm affects state
Summary
The State Engine plugin provides intelligent alarm aggregation and state management, helping you maintain a clear view of your infrastructure health. By consolidating multiple alarms per CI, filtering transient conditions through delays, and providing flexible storage options, it reduces alert fatigue while ensuring critical conditions are properly tracked and communicated. Configure the plugin according to your operational needs, implement appropriate delay strategies for different alarm types, and leverage periodic resync to maintain reliable state synchronization across your monitoring ecosystem.