Version: SG FLX

Production Monitoring

Effective monitoring is critical for maintaining a secure and performant Search Guard deployment. This guide covers what to monitor, how to set up alerts, and how to respond to common issues.

What to Monitor

Security Events

Critical security events requiring immediate attention:

Event	Severity	Action
Multiple failed login attempts	High	Investigate potential brute force attack
Authorization failures	Medium	Check for misconfigured permissions
TLS handshake failures	High	Possible MITM attack or certificate issue
Admin configuration changes	Medium	Verify authorized changes
Unusual data access patterns	High	Potential data exfiltration
Certificate expiration warnings	High	Renew certificates immediately

Performance Metrics

Search Guard performance indicators:

Metric	Threshold	Issue
Authentication latency	> 100ms	Cache miss rate too high, slow auth backend
Authorization latency	> 50ms	Complex role definitions, too many roles
DLS query overhead	> 20% slower	Inefficient DLS queries
Audit log queue size	> 80% full	Audit processing bottleneck
Authentication cache hit rate	< 80%	TTL too short, cache too small

System Health

Elasticsearch cluster health:

Cluster status (green/yellow/red)
Node availability
Shard allocation
JVM heap usage
GC pause times
Thread pool rejections
Disk space usage

Monitoring Tools

Elasticsearch APIs

Built-in monitoring endpoints:

1. Cluster health:

copy

curl -XGET "https://localhost:9200/_cluster/health?pretty" \
  -u admin:password

Response:

copy

{
  "cluster_name": "production",
  "status": "green",
  "number_of_nodes": 5,
  "active_shards": 1000
}

Alert if: status is yellow or red

2. Node stats:

copy

curl -XGET "https://localhost:9200/_nodes/stats?pretty" \
  -u admin:password

Monitor:

jvm.mem.heap_used_percent (alert if > 75%)
jvm.gc.collectors.*.collection_time_in_millis (alert if increasing rapidly)
thread_pool.*.rejected (alert if > 0)

3. Search Guard health:

copy

curl -XGET "https://localhost:9200/_searchguard/health?pretty" \
  -u admin:password

4. Search Guard authentication info:

copy

curl -XGET "https://localhost:9200/_searchguard/authinfo?pretty" \
  -u admin:password

Audit Log Monitoring

Query audit logs for security events:

Failed logins in last hour:

copy

curl -XGET "https://localhost:9200/sg_auditlog-*/_search?pretty" \
  -H 'Content-Type: application/json' -d '{
  "query": {
    "bool": {
      "must": [
        {"term": {"audit_category": "FAILED_LOGIN"}},
        {"range": {"@timestamp": {"gte": "now-1h"}}}
      ]
    }
  },
  "aggs": {
    "by_user": {
      "terms": {
        "field": "audit_request_effective_user",
        "size": 10
      }
    }
  }
}'

Authorization failures:

copy

curl -XGET "https://localhost:9200/sg_auditlog-*/_search?pretty" \
  -H 'Content-Type: application/json' -d '{
  "query": {
    "bool": {
      "must": [
        {"term": {"audit_category": "MISSING_PRIVILEGES"}},
        {"range": {"@timestamp": {"gte": "now-1h"}}}
      ]
    }
  }
}'

Configuration changes:

copy

curl -XGET "https://localhost:9200/sg_auditlog-*/_search?pretty" \
  -H 'Content-Type: application/json' -d '{
  "query": {
    "bool": {
      "must": [
        {"term": {"audit_category": "SG_INDEX_ATTEMPT"}},
        {"range": {"@timestamp": {"gte": "now-24h"}}}
      ]
    }
  }
}'

Metrics Collection

Popular monitoring stacks:

1. Elastic Stack (Metricbeat + Kibana):

copy

# metricbeat.yml
metricbeat.modules:
  - module: elasticsearch
    metricsets:
      - node
      - node_stats
      - cluster_stats
    period: 10s
    hosts: ["https://localhost:9200"]
    username: "monitoring_user"
    password: "${MONITORING_PASSWORD}"
    ssl.verification_mode: "full"
    ssl.certificate_authorities: ["/path/to/ca.pem"]

2. Prometheus + Grafana:

Use Elasticsearch exporter:

copy

# elasticsearch_exporter
./elasticsearch_exporter \
  --es.uri=https://localhost:9200 \
  --es.all \
  --es.cluster_settings \
  --es.ssl-skip-verify=false \
  --es.ca=/path/to/ca.pem

3. Datadog, New Relic, or other APM:

Configure using agent and Elasticsearch integration.

Log Aggregation

Centralize Elasticsearch logs:

1. Filebeat to ship logs:

copy

# filebeat.yml
filebeat.inputs:
  - type: log
    paths:
      - /var/log/elasticsearch/*.log
    fields:
      cluster: production
      node: ${NODE_NAME}

output.elasticsearch:
  hosts: ["https://logs-cluster:9200"]
  username: "filebeat"
  password: "${FILEBEAT_PASSWORD}"

2. Parse logs for errors:

copy

[WARN].*authentication failed
[ERROR].*TLS.*handshake
[ERROR].*certificate.*expired

Alerting with Search Guard Signals

Detect brute force attacks:

sg_signals/watch/failed-logins.json:

copy

{
  "trigger": {
    "schedule": {
      "interval": ["5m"]
    }
  },
  "checks": [
    {
      "type": "search",
      "name": "failed_logins",
      "target": "sg_auditlog-*",
      "request": {
        "query": {
          "bool": {
            "must": [
              {"term": {"audit_category": "FAILED_LOGIN"}},
              {"range": {"@timestamp": {"gte": "now-5m"}}}
            ]
          }
        },
        "aggs": {
          "by_user": {
            "terms": {
              "field": "audit_request_effective_user",
              "size": 10,
              "min_doc_count": 5
            }
          }
        }
      }
    }
  ],
  "checks.0.severity": "warning",
  "actions": [
    {
      "type": "email",
      "name": "security_team",
      "throttle_period": "1h",
      "account": "default",
      "to": ["security@example.com"],
      "subject": "Alert: Multiple Failed Login Attempts Detected",
      "body": "User(s) with failed login attempts:\n\n- :  failures\n"
    }
  ]
}

Certificate Expiration Alert

Monitor certificate validity:

sg_signals/watch/cert-expiration.json:

copy

{
  "trigger": {
    "schedule": {
      "daily": {"at": ["09:00"]}
    }
  },
  "checks": [
    {
      "type": "static",
      "name": "check_cert",
      "severity": "error",
      "target": "cert_expiry_days",
      "value": ""
    }
  ],
  "actions": [
    {
      "type": "email",
      "name": "ops_team",
      "account": "default",
      "to": ["ops@example.com"],
      "subject": "URGENT: SSL Certificate Expiring Soon",
      "body": "Certificate expires in  days. Renew immediately!",
      "checks": {
        "check_cert": {
          "severity": ["error", "critical"]
        }
      }
    }
  ]
}

Script to check certificate:

copy

#!/bin/bash
# cert-check.sh

CERT_FILE="/etc/elasticsearch/certs/node.pem"
EXPIRY_DATE=$(openssl x509 -enddate -noout -in "$CERT_FILE" | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s)
NOW_EPOCH=$(date +%s)
DAYS_UNTIL_EXPIRY=$(( ($EXPIRY_EPOCH - $NOW_EPOCH) / 86400 ))

# Index result for Signals to monitor
curl -XPOST "https://localhost:9200/cert-monitoring/_doc" \
  -H 'Content-Type: application/json' -d "{
  \"@timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
  \"cert_expiry_days\": $DAYS_UNTIL_EXPIRY,
  \"cert_file\": \"$CERT_FILE\"
}"

# Exit with error if < 30 days
if [ $DAYS_UNTIL_EXPIRY -lt 30 ]; then
  echo "WARNING: Certificate expires in $DAYS_UNTIL_EXPIRY days!"
  exit 1
fi

Cluster Health Alert

Alert on cluster status changes:

sg_signals/watch/cluster-health.json:

copy

{
  "trigger": {
    "schedule": {
      "interval": ["1m"]
    }
  },
  "checks": [
    {
      "type": "condition",
      "name": "cluster_status",
      "source": "data.cluster_health.status != 'green'",
      "severity": {
        "mapping": {
          "data.cluster_health.status == 'yellow'": "warning",
          "data.cluster_health.status == 'red'": "error"
        }
      }
    }
  ],
  "actions": [
    {
      "type": "slack",
      "name": "ops_channel",
      "account": "default",
      "channel": "#elasticsearch-alerts",
      "text": "⚠️ Cluster status is !\n unassigned shards.",
      "checks": {
        "cluster_status": {
          "severity": ["warning", "error"]
        }
      }
    }
  ]
}

High JVM Heap Usage Alert

copy

{
  "trigger": {
    "schedule": {
      "interval": ["5m"]
    }
  },
  "checks": [
    {
      "type": "condition",
      "name": "high_heap",
      "source": "data.nodes_stats.nodes.any(node -> node.jvm.mem.heap_used_percent > 85)",
      "severity": "warning"
    }
  ],
  "actions": [
    {
      "type": "pagerduty",
      "name": "oncall",
      "account": "default",
      "routing_key": "${env.PAGERDUTY_KEY}",
      "event_action": "trigger",
      "summary": "High JVM heap usage detected",
      "severity": "warning"
    }
  ]
}

Unusual Data Access Pattern Alert

Detect potential data exfiltration:

copy

{
  "trigger": {
    "schedule": {
      "interval": ["10m"]
    }
  },
  "checks": [
    {
      "type": "search",
      "name": "bulk_reads",
      "target": "sg_auditlog-*",
      "request": {
        "query": {
          "bool": {
            "must": [
              {"term": {"audit_category": "GRANTED_PRIVILEGES"}},
              {"term": {"audit_request_privilege": "indices:data/read/*"}},
              {"range": {"@timestamp": {"gte": "now-10m"}}}
            ]
          }
        },
        "aggs": {
          "by_user": {
            "terms": {
              "field": "audit_request_effective_user",
              "size": 10
            },
            "aggs": {
              "doc_count": {
                "sum": {
                  "field": "audit_request_body.size"
                }
              }
            }
          }
        }
      }
    }
  ],
  "checks.0.severity": {
    "mapping": {
      "data.bulk_reads.aggregations.by_user.buckets.any(b -> b.doc_count.value > 100000)": "error"
    }
  },
  "actions": [
    {
      "type": "email",
      "name": "security_team",
      "account": "default",
      "to": ["security@example.com"],
      "subject": "SECURITY ALERT: Unusual bulk data access detected",
      "body": "Large data access detected:\n\n- User: , Documents: \n"
    }
  ]
}

Monitoring Dashboards

Kibana Dashboards

Create monitoring dashboards in Kibana:

1. Security Overview Dashboard:

Failed login attempts (last 24h)
Authorization failures by user
Recent configuration changes
Active sessions count

2. Performance Dashboard:

Search latency (p50, p95, p99)
Authentication cache hit rate
Audit log processing rate
JVM heap usage
GC pause times

3. Cluster Health Dashboard:

Cluster status
Node count
Shard allocation
Disk space usage
CPU and memory usage

Example visualization (Vega):

copy

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {
    "url": {
      "index": "sg_auditlog-*",
      "body": {
        "query": {
          "range": {"@timestamp": {"gte": "now-24h"}}
        },
        "aggs": {
          "over_time": {
            "date_histogram": {
              "field": "@timestamp",
              "interval": "1h"
            },
            "aggs": {
              "by_category": {
                "terms": {"field": "audit_category"}
              }
            }
          }
        }
      }
    }
  },
  "mark": "bar",
  "encoding": {
    "x": {"field": "@timestamp", "type": "temporal"},
    "y": {"field": "count", "type": "quantitative"},
    "color": {"field": "audit_category", "type": "nominal"}
  }
}

Grafana Dashboards

Example Prometheus queries:

Authentication latency:

copy

histogram_quantile(0.99,
  rate(searchguard_authentication_duration_seconds_bucket[5m])
)

Failed logins per minute:

copy

rate(searchguard_failed_login_total[1m])

JVM heap usage:

copy

elasticsearch_jvm_memory_used_bytes{area="heap"}
/
elasticsearch_jvm_memory_max_bytes{area="heap"} * 100

Log Analysis

Common Log Patterns

1. Failed authentication:

copy

[WARN ][c.f.s.a.BackendRegistry] Authentication finally failed for user1 from 192.168.1.100

Action: Investigate source IP, check for brute force.

2. TLS handshake failure:

copy

[WARN ][o.e.h.n.Netty4HttpServerTransport] caught exception while handling client http traffic, closing connection
io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record

Action: Check client TLS configuration, verify certificate validity.

3. Missing privileges:

copy

[WARN ][c.f.s.p.PrivilegesEvaluator] No index-level permission match for user1 [indices:data/write/index] on index logs-2024

Action: Review user roles and permissions.

4. Certificate expiration:

copy

[ERROR][c.f.s.s.DefaultSearchGuardKeyStore] Certificate expired: CN=node-1

Action: Renew certificate immediately.

Log Retention

Configure log rotation:

logrotate config (/etc/logrotate.d/elasticsearch):

copy

/var/log/elasticsearch/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 elasticsearch elasticsearch
    postrotate
        /usr/bin/systemctl reload elasticsearch > /dev/null 2>&1 || true
    endscript
}

Incident Response Runbooks

Detection: > 5 failed logins per user in 5 minutes

Response:

Identify affected user(s) from audit logs
Check source IPs - single IP or distributed?
If single IP: Block at firewall level
If distributed: Enable account lockout
Contact user to verify if legitimate activity
Reset password if account compromised

Cluster Status Yellow/Red

Detection: Cluster health API returns non-green status

Response:

Check unassigned shards: GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason
Check node status: GET /_cat/nodes?v
Review Elasticsearch logs for errors
If node failure: Restart failed node
If disk space: Free up disk or add nodes
If shard allocation issue: Review allocation settings

Certificate Expiration

Detection: Certificate expires in < 30 days

Response:

Generate new certificates
Test new certificates in staging environment
Schedule maintenance window
Perform rolling certificate update using hot reload
Verify TLS connectivity after update
Update monitoring with new expiration date

Unusual Data Access

Detection: User accessed > 100,000 documents in 10 minutes

Response:

Identify user from audit logs
Review query patterns - legitimate or suspicious?
Check source IP and time of day
If suspicious: Disable user account immediately
Contact user to verify if legitimate
Review data accessed - any sensitive data?
File security incident report
Reset user credentials if compromised

Health Checks

Automated Health Checks

Script to run every 5 minutes:

copy

#!/bin/bash
# es-health-check.sh

ES_HOST="https://localhost:9200"
ES_USER="monitoring"
ES_PASS="${MONITORING_PASSWORD}"

# Cluster health
CLUSTER_STATUS=$(curl -s -u $ES_USER:$ES_PASS "$ES_HOST/_cluster/health" | jq -r '.status')

if [ "$CLUSTER_STATUS" != "green" ]; then
  echo "ALERT: Cluster status is $CLUSTER_STATUS"
  # Send alert
  curl -X POST "https://alerts.example.com/api/alert" \
    -d "{\"message\": \"Cluster status: $CLUSTER_STATUS\"}"
fi

# Check nodes
NODE_COUNT=$(curl -s -u $ES_USER:$ES_PASS "$ES_HOST/_cat/nodes" | wc -l)
if [ $NODE_COUNT -lt 5 ]; then
  echo "ALERT: Only $NODE_COUNT nodes active (expected 5)"
fi

# Check disk space
DISK_USAGE=$(curl -s -u $ES_USER:$ES_PASS "$ES_HOST/_cat/allocation?v&h=disk.percent" | tail -n +2 | sort -rn | head -1)
if [ ${DISK_USAGE%\%} -gt 80 ]; then
  echo "ALERT: Disk usage is $DISK_USAGE"
fi

Manual Health Checks

Weekly checklist:

Review audit logs for security anomalies
Check certificate expiration dates
Review failed login patterns
Verify backup completion
Check cluster performance metrics
Review recent configuration changes

Next Steps

Production Checklist - Complete pre-deployment verification
Security Hardening - Implement security best practices
Performance Tuning - Optimize performance
Audit and Compliance - Configure audit logging

Questions? Drop us a note!

Production Monitoring

What to Monitor

Security Events

Performance Metrics

System Health

Monitoring Tools

Elasticsearch APIs

Audit Log Monitoring

Metrics Collection

Log Aggregation

Alerting with Search Guard Signals

Failed Login Alert

Certificate Expiration Alert

Cluster Health Alert

High JVM Heap Usage Alert

Unusual Data Access Pattern Alert

Monitoring Dashboards

Kibana Dashboards

Grafana Dashboards

Log Analysis

Common Log Patterns

Log Retention

Incident Response Runbooks

High Failed Login Rate

Cluster Status Yellow/Red

Certificate Expiration

Unusual Data Access

Health Checks

Automated Health Checks

Manual Health Checks

Next Steps

Not what you were looking for? Try the search.