Production Monitoring
Effective monitoring is critical for maintaining a secure and performant Search Guard deployment. This guide covers what to monitor, how to set up alerts, and how to respond to common issues.
What to Monitor
Security Events
Critical security events requiring immediate attention:
| Event | Severity | Action |
|---|---|---|
| Multiple failed login attempts | High | Investigate potential brute force attack |
| Authorization failures | Medium | Check for misconfigured permissions |
| TLS handshake failures | High | Possible MITM attack or certificate issue |
| Admin configuration changes | Medium | Verify authorized changes |
| Unusual data access patterns | High | Potential data exfiltration |
| Certificate expiration warnings | High | Renew certificates immediately |
Performance Metrics
Search Guard performance indicators:
| Metric | Threshold | Issue |
|---|---|---|
| Authentication latency | > 100ms | Cache miss rate too high, slow auth backend |
| Authorization latency | > 50ms | Complex role definitions, too many roles |
| DLS query overhead | > 20% slower | Inefficient DLS queries |
| Audit log queue size | > 80% full | Audit processing bottleneck |
| Authentication cache hit rate | < 80% | TTL too short, cache too small |
System Health
Elasticsearch cluster health:
- Cluster status (green/yellow/red)
- Node availability
- Shard allocation
- JVM heap usage
- GC pause times
- Thread pool rejections
- Disk space usage
Monitoring Tools
Elasticsearch APIs
Built-in monitoring endpoints:
1. Cluster health:
curl -XGET "https://localhost:9200/_cluster/health?pretty" \
-u admin:password
Response:
{
"cluster_name": "production",
"status": "green",
"number_of_nodes": 5,
"active_shards": 1000
}
Alert if: status is yellow or red
2. Node stats:
curl -XGET "https://localhost:9200/_nodes/stats?pretty" \
-u admin:password
Monitor:
jvm.mem.heap_used_percent(alert if > 75%)jvm.gc.collectors.*.collection_time_in_millis(alert if increasing rapidly)thread_pool.*.rejected(alert if > 0)
3. Search Guard health:
curl -XGET "https://localhost:9200/_searchguard/health?pretty" \
-u admin:password
4. Search Guard authentication info:
curl -XGET "https://localhost:9200/_searchguard/authinfo?pretty" \
-u admin:password
Audit Log Monitoring
Query audit logs for security events:
Failed logins in last hour:
curl -XGET "https://localhost:9200/sg_auditlog-*/_search?pretty" \
-H 'Content-Type: application/json' -d '{
"query": {
"bool": {
"must": [
{"term": {"audit_category": "FAILED_LOGIN"}},
{"range": {"@timestamp": {"gte": "now-1h"}}}
]
}
},
"aggs": {
"by_user": {
"terms": {
"field": "audit_request_effective_user",
"size": 10
}
}
}
}'
Authorization failures:
curl -XGET "https://localhost:9200/sg_auditlog-*/_search?pretty" \
-H 'Content-Type: application/json' -d '{
"query": {
"bool": {
"must": [
{"term": {"audit_category": "MISSING_PRIVILEGES"}},
{"range": {"@timestamp": {"gte": "now-1h"}}}
]
}
}
}'
Configuration changes:
curl -XGET "https://localhost:9200/sg_auditlog-*/_search?pretty" \
-H 'Content-Type: application/json' -d '{
"query": {
"bool": {
"must": [
{"term": {"audit_category": "SG_INDEX_ATTEMPT"}},
{"range": {"@timestamp": {"gte": "now-24h"}}}
]
}
}
}'
Metrics Collection
Popular monitoring stacks:
1. Elastic Stack (Metricbeat + Kibana):
# metricbeat.yml
metricbeat.modules:
- module: elasticsearch
metricsets:
- node
- node_stats
- cluster_stats
period: 10s
hosts: ["https://localhost:9200"]
username: "monitoring_user"
password: "${MONITORING_PASSWORD}"
ssl.verification_mode: "full"
ssl.certificate_authorities: ["/path/to/ca.pem"]
2. Prometheus + Grafana:
Use Elasticsearch exporter:
# elasticsearch_exporter
./elasticsearch_exporter \
--es.uri=https://localhost:9200 \
--es.all \
--es.cluster_settings \
--es.ssl-skip-verify=false \
--es.ca=/path/to/ca.pem
3. Datadog, New Relic, or other APM:
Configure using agent and Elasticsearch integration.
Log Aggregation
Centralize Elasticsearch logs:
1. Filebeat to ship logs:
# filebeat.yml
filebeat.inputs:
- type: log
paths:
- /var/log/elasticsearch/*.log
fields:
cluster: production
node: ${NODE_NAME}
output.elasticsearch:
hosts: ["https://logs-cluster:9200"]
username: "filebeat"
password: "${FILEBEAT_PASSWORD}"
2. Parse logs for errors:
[WARN].*authentication failed
[ERROR].*TLS.*handshake
[ERROR].*certificate.*expired
Alerting with Search Guard Signals
Failed Login Alert
Detect brute force attacks:
sg_signals/watch/failed-logins.json:
{
"trigger": {
"schedule": {
"interval": ["5m"]
}
},
"checks": [
{
"type": "search",
"name": "failed_logins",
"target": "sg_auditlog-*",
"request": {
"query": {
"bool": {
"must": [
{"term": {"audit_category": "FAILED_LOGIN"}},
{"range": {"@timestamp": {"gte": "now-5m"}}}
]
}
},
"aggs": {
"by_user": {
"terms": {
"field": "audit_request_effective_user",
"size": 10,
"min_doc_count": 5
}
}
}
}
}
],
"checks.0.severity": "warning",
"actions": [
{
"type": "email",
"name": "security_team",
"throttle_period": "1h",
"account": "default",
"to": ["security@example.com"],
"subject": "Alert: Multiple Failed Login Attempts Detected",
"body": "User(s) with failed login attempts:\n\n- : failures\n"
}
]
}
Certificate Expiration Alert
Monitor certificate validity:
sg_signals/watch/cert-expiration.json:
{
"trigger": {
"schedule": {
"daily": {"at": ["09:00"]}
}
},
"checks": [
{
"type": "static",
"name": "check_cert",
"severity": "error",
"target": "cert_expiry_days",
"value": ""
}
],
"actions": [
{
"type": "email",
"name": "ops_team",
"account": "default",
"to": ["ops@example.com"],
"subject": "URGENT: SSL Certificate Expiring Soon",
"body": "Certificate expires in days. Renew immediately!",
"checks": {
"check_cert": {
"severity": ["error", "critical"]
}
}
}
]
}
Script to check certificate:
#!/bin/bash
# cert-check.sh
CERT_FILE="/etc/elasticsearch/certs/node.pem"
EXPIRY_DATE=$(openssl x509 -enddate -noout -in "$CERT_FILE" | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s)
NOW_EPOCH=$(date +%s)
DAYS_UNTIL_EXPIRY=$(( ($EXPIRY_EPOCH - $NOW_EPOCH) / 86400 ))
# Index result for Signals to monitor
curl -XPOST "https://localhost:9200/cert-monitoring/_doc" \
-H 'Content-Type: application/json' -d "{
\"@timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",
\"cert_expiry_days\": $DAYS_UNTIL_EXPIRY,
\"cert_file\": \"$CERT_FILE\"
}"
# Exit with error if < 30 days
if [ $DAYS_UNTIL_EXPIRY -lt 30 ]; then
echo "WARNING: Certificate expires in $DAYS_UNTIL_EXPIRY days!"
exit 1
fi
Cluster Health Alert
Alert on cluster status changes:
sg_signals/watch/cluster-health.json:
{
"trigger": {
"schedule": {
"interval": ["1m"]
}
},
"checks": [
{
"type": "condition",
"name": "cluster_status",
"source": "data.cluster_health.status != 'green'",
"severity": {
"mapping": {
"data.cluster_health.status == 'yellow'": "warning",
"data.cluster_health.status == 'red'": "error"
}
}
}
],
"actions": [
{
"type": "slack",
"name": "ops_channel",
"account": "default",
"channel": "#elasticsearch-alerts",
"text": "⚠️ Cluster status is !\n unassigned shards.",
"checks": {
"cluster_status": {
"severity": ["warning", "error"]
}
}
}
]
}
High JVM Heap Usage Alert
{
"trigger": {
"schedule": {
"interval": ["5m"]
}
},
"checks": [
{
"type": "condition",
"name": "high_heap",
"source": "data.nodes_stats.nodes.any(node -> node.jvm.mem.heap_used_percent > 85)",
"severity": "warning"
}
],
"actions": [
{
"type": "pagerduty",
"name": "oncall",
"account": "default",
"routing_key": "${env.PAGERDUTY_KEY}",
"event_action": "trigger",
"summary": "High JVM heap usage detected",
"severity": "warning"
}
]
}
Unusual Data Access Pattern Alert
Detect potential data exfiltration:
{
"trigger": {
"schedule": {
"interval": ["10m"]
}
},
"checks": [
{
"type": "search",
"name": "bulk_reads",
"target": "sg_auditlog-*",
"request": {
"query": {
"bool": {
"must": [
{"term": {"audit_category": "GRANTED_PRIVILEGES"}},
{"term": {"audit_request_privilege": "indices:data/read/*"}},
{"range": {"@timestamp": {"gte": "now-10m"}}}
]
}
},
"aggs": {
"by_user": {
"terms": {
"field": "audit_request_effective_user",
"size": 10
},
"aggs": {
"doc_count": {
"sum": {
"field": "audit_request_body.size"
}
}
}
}
}
}
}
],
"checks.0.severity": {
"mapping": {
"data.bulk_reads.aggregations.by_user.buckets.any(b -> b.doc_count.value > 100000)": "error"
}
},
"actions": [
{
"type": "email",
"name": "security_team",
"account": "default",
"to": ["security@example.com"],
"subject": "SECURITY ALERT: Unusual bulk data access detected",
"body": "Large data access detected:\n\n- User: , Documents: \n"
}
]
}
Monitoring Dashboards
Kibana Dashboards
Create monitoring dashboards in Kibana:
1. Security Overview Dashboard:
- Failed login attempts (last 24h)
- Authorization failures by user
- Recent configuration changes
- Active sessions count
2. Performance Dashboard:
- Search latency (p50, p95, p99)
- Authentication cache hit rate
- Audit log processing rate
- JVM heap usage
- GC pause times
3. Cluster Health Dashboard:
- Cluster status
- Node count
- Shard allocation
- Disk space usage
- CPU and memory usage
Example visualization (Vega):
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"data": {
"url": {
"index": "sg_auditlog-*",
"body": {
"query": {
"range": {"@timestamp": {"gte": "now-24h"}}
},
"aggs": {
"over_time": {
"date_histogram": {
"field": "@timestamp",
"interval": "1h"
},
"aggs": {
"by_category": {
"terms": {"field": "audit_category"}
}
}
}
}
}
}
},
"mark": "bar",
"encoding": {
"x": {"field": "@timestamp", "type": "temporal"},
"y": {"field": "count", "type": "quantitative"},
"color": {"field": "audit_category", "type": "nominal"}
}
}
Grafana Dashboards
Example Prometheus queries:
Authentication latency:
histogram_quantile(0.99,
rate(searchguard_authentication_duration_seconds_bucket[5m])
)
Failed logins per minute:
rate(searchguard_failed_login_total[1m])
JVM heap usage:
elasticsearch_jvm_memory_used_bytes{area="heap"}
/
elasticsearch_jvm_memory_max_bytes{area="heap"} * 100
Log Analysis
Common Log Patterns
1. Failed authentication:
[WARN ][c.f.s.a.BackendRegistry] Authentication finally failed for user1 from 192.168.1.100
Action: Investigate source IP, check for brute force.
2. TLS handshake failure:
[WARN ][o.e.h.n.Netty4HttpServerTransport] caught exception while handling client http traffic, closing connection
io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record
Action: Check client TLS configuration, verify certificate validity.
3. Missing privileges:
[WARN ][c.f.s.p.PrivilegesEvaluator] No index-level permission match for user1 [indices:data/write/index] on index logs-2024
Action: Review user roles and permissions.
4. Certificate expiration:
[ERROR][c.f.s.s.DefaultSearchGuardKeyStore] Certificate expired: CN=node-1
Action: Renew certificate immediately.
Log Retention
Configure log rotation:
logrotate config (/etc/logrotate.d/elasticsearch):
/var/log/elasticsearch/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 644 elasticsearch elasticsearch
postrotate
/usr/bin/systemctl reload elasticsearch > /dev/null 2>&1 || true
endscript
}
Incident Response Runbooks
High Failed Login Rate
Detection: > 5 failed logins per user in 5 minutes
Response:
- Identify affected user(s) from audit logs
- Check source IPs - single IP or distributed?
- If single IP: Block at firewall level
- If distributed: Enable account lockout
- Contact user to verify if legitimate activity
- Reset password if account compromised
Cluster Status Yellow/Red
Detection: Cluster health API returns non-green status
Response:
- Check unassigned shards:
GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason - Check node status:
GET /_cat/nodes?v - Review Elasticsearch logs for errors
- If node failure: Restart failed node
- If disk space: Free up disk or add nodes
- If shard allocation issue: Review allocation settings
Certificate Expiration
Detection: Certificate expires in < 30 days
Response:
- Generate new certificates
- Test new certificates in staging environment
- Schedule maintenance window
- Perform rolling certificate update using hot reload
- Verify TLS connectivity after update
- Update monitoring with new expiration date
Unusual Data Access
Detection: User accessed > 100,000 documents in 10 minutes
Response:
- Identify user from audit logs
- Review query patterns - legitimate or suspicious?
- Check source IP and time of day
- If suspicious: Disable user account immediately
- Contact user to verify if legitimate
- Review data accessed - any sensitive data?
- File security incident report
- Reset user credentials if compromised
Health Checks
Automated Health Checks
Script to run every 5 minutes:
#!/bin/bash
# es-health-check.sh
ES_HOST="https://localhost:9200"
ES_USER="monitoring"
ES_PASS="${MONITORING_PASSWORD}"
# Cluster health
CLUSTER_STATUS=$(curl -s -u $ES_USER:$ES_PASS "$ES_HOST/_cluster/health" | jq -r '.status')
if [ "$CLUSTER_STATUS" != "green" ]; then
echo "ALERT: Cluster status is $CLUSTER_STATUS"
# Send alert
curl -X POST "https://alerts.example.com/api/alert" \
-d "{\"message\": \"Cluster status: $CLUSTER_STATUS\"}"
fi
# Check nodes
NODE_COUNT=$(curl -s -u $ES_USER:$ES_PASS "$ES_HOST/_cat/nodes" | wc -l)
if [ $NODE_COUNT -lt 5 ]; then
echo "ALERT: Only $NODE_COUNT nodes active (expected 5)"
fi
# Check disk space
DISK_USAGE=$(curl -s -u $ES_USER:$ES_PASS "$ES_HOST/_cat/allocation?v&h=disk.percent" | tail -n +2 | sort -rn | head -1)
if [ ${DISK_USAGE%\%} -gt 80 ]; then
echo "ALERT: Disk usage is $DISK_USAGE"
fi
Manual Health Checks
Weekly checklist:
- Review audit logs for security anomalies
- Check certificate expiration dates
- Review failed login patterns
- Verify backup completion
- Check cluster performance metrics
- Review recent configuration changes
Next Steps
- Production Checklist - Complete pre-deployment verification
- Security Hardening - Implement security best practices
- Performance Tuning - Optimize performance
- Audit and Compliance - Configure audit logging