Hi Team,
How to optimize the startup process of the AppDynamics Events Service Cluster (Event Service & Elastic Search) within the system and service manager of the operating system, including implementing self-healing mechanisms to automatically detect and resolve any issues that may arise during startup for below scenarios.
1. Graceful Shutdown
2. Unplanned Downtime (unexpected VM shutdown)
3. Accidentally kill the process (Event service and Elastic)
4. Optional: OOM (Out Of Memory)
For optimizing the AppDynamics Events Service Cluster startup process with self-healing capabilities, I recommend implementing systemd service units with proper configurations. This approach handles all your scenarios: graceful shutdown, unplanned downtime, accidental process termination, and OOM situations.
Here's a comprehensive solution:
1. Create systemd service units for both Events Service and Elasticsearch:
For Events Service (events-service.service):
```
[Unit]
Description=AppDynamics Events Service
After=network.target elasticsearch.service
Requires=elasticsearch.service
[Service]
Type=forking
User=appdynamics
Group=appdynamics
WorkingDirectory=/opt/appdynamics/events-service
ExecStart=/opt/appdynamics/events-service/bin/events-service.sh start
ExecStop=/opt/appdynamics/events-service/bin/events-service.sh stop
PIDFile=/opt/appdynamics/events-service/pid.txt
TimeoutStartSec=300
TimeoutStopSec=120
# Restart settings for self-healing
Restart=always
RestartSec=60
# OOM handling
OOMScoreAdjust=-900
# Health check script
ExecStartPost=/opt/appdynamics/scripts/events-service-health-check.sh
[Install]
WantedBy=multi-user.target
```
For Elasticsearch (elasticsearch.service):
```
[Unit]
Description=Elasticsearch for AppDynamics
After=network.target
[Service]
Type=forking
User=appdynamics
Group=appdynamics
WorkingDirectory=/opt/appdynamics/events-service/elasticsearch
ExecStart=/opt/appdynamics/events-service/elasticsearch/bin/elasticsearch -d -p pid
ExecStop=/bin/kill -SIGTERM $MAINPID
PIDFile=/opt/appdynamics/events-service/elasticsearch/pid
# Restart settings for self-healing
Restart=always
RestartSec=60
# Resource limits for OOM prevention
LimitNOFILE=65536
LimitNPROC=4096
LimitMEMLOCK=infinity
LimitAS=infinity
# OOM handling
OOMScoreAdjust=-800
[Install]
WantedBy=multi-user.target
```
2. Create a health check script (events-service-health-check.sh):
```bash
#!/bin/bash
# Health check for Events Service
EVENT_SERVICE_PORT=9080
MAX_RETRIES=3
RETRY_INTERVAL=20
check_events_service() {
for i in $(seq 1 $MAX_RETRIES); do
if curl -s http://localhost:$EVENT_SERVICE_PORT/healthcheck > /dev/null; then
echo "Events Service is running properly."
return 0
else
echo "Attempt $i: Events Service health check failed. Waiting $RETRY_INTERVAL seconds..."
sleep $RETRY_INTERVAL
fi
done
echo "Events Service failed health check after $MAX_RETRIES attempts."
return 1
}
check_elasticsearch() {
for i in $(seq 1 $MAX_RETRIES); do
if curl -s http://localhost:9200/_cluster/health | grep -q '"status":"green"\|"status":"yellow"'; then
echo "Elasticsearch is running properly."
return 0
else
echo "Attempt $i: Elasticsearch health check failed. Waiting $RETRY_INTERVAL seconds..."
sleep $RETRY_INTERVAL
fi
done
echo "Elasticsearch failed health check after $MAX_RETRIES attempts."
return 1
}
main() {
# Wait for initial startup
sleep 30
if ! check_elasticsearch; then
echo "Restarting Elasticsearch due to failed health check..."
systemctl restart elasticsearch.service
fi
if ! check_events_service; then
echo "Restarting Events Service due to failed health check..."
systemctl restart events-service.service
fi
}
main
```
3. Set up a watchdog script for OOM monitoring (run as a cron job every 5 minutes):
```bash
#!/bin/bash
LOG_FILE="/var/log/appdynamics/oom-watchdog.log"
ES_HEAP_THRESHOLD=90
ES_SERVICE="elasticsearch.service"
EVENTS_HEAP_THRESHOLD=90
EVENTS_SERVICE="events-service.service"
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> $LOG_FILE
}
check_elasticsearch_memory() {
ES_MEMORY_PCT=$(ps -C java -o cmd= | grep elasticsearch | grep -o "Xmx[0-9]*[mMgG]" | head -1)
ES_CURRENT_HEAP=$(jstat -gc $(pgrep -f elasticsearch) 2>&1 | tail -n 1 | awk '{print ($3+$4+$5+$6+$7+$8)/1024}')
if [[ $ES_MEMORY_PCT == *"g"* || $ES_MEMORY_PCT == *"G"* ]]; then
ES_MAX_HEAP=${ES_MEMORY_PCT//[^0-9]/}
ES_MAX_HEAP=$((ES_MAX_HEAP * 1024))
else
ES_MAX_HEAP=${ES_MEMORY_PCT//[^0-9]/}
fi
ES_HEAP_PCT=$((ES_CURRENT_HEAP * 100 / ES_MAX_HEAP))
if [ $ES_HEAP_PCT -gt $ES_HEAP_THRESHOLD ]; then
log_message "Elasticsearch heap usage at ${ES_HEAP_PCT}% - exceeds threshold of ${ES_HEAP_THRESHOLD}%. Restarting service."
systemctl restart $ES_SERVICE
return 0
fi
return 1
}
check_events_service_memory() {
EVENTS_MEMORY_PCT=$(ps -C java -o cmd= | grep events-service | grep -o "Xmx[0-9]*[mMgG]" | head -1)
EVENTS_CURRENT_HEAP=$(jstat -gc $(pgrep -f "events-service" | grep -v grep | head -1) 2>&1 | tail -n 1 | awk '{print ($3+$4+$5+$6+$7+$8)/1024}')
if [[ $EVENTS_MEMORY_PCT == *"g"* || $EVENTS_MEMORY_PCT == *"G"* ]]; then
EVENTS_MAX_HEAP=${EVENTS_MEMORY_PCT//[^0-9]/}
EVENTS_MAX_HEAP=$((EVENTS_MAX_HEAP * 1024))
else
EVENTS_MAX_HEAP=${EVENTS_MEMORY_PCT//[^0-9]/}
fi
EVENTS_HEAP_PCT=$((EVENTS_CURRENT_HEAP * 100 / EVENTS_MAX_HEAP))
if [ $EVENTS_HEAP_PCT -gt $EVENTS_HEAP_THRESHOLD ]; then
log_message "Events Service heap usage at ${EVENTS_HEAP_PCT}% - exceeds threshold of ${EVENTS_HEAP_THRESHOLD}%. Restarting service."
systemctl restart $EVENTS_SERVICE
return 0
fi
return 1
}
# Check if services are running
if ! systemctl is-active --quiet $ES_SERVICE; then
log_message "Elasticsearch service is not running. Attempting to start."
systemctl start $ES_SERVICE
fi
if ! systemctl is-active --quiet $EVENTS_SERVICE; then
log_message "Events Service is not running. Attempting to start."
systemctl start $EVENTS_SERVICE
fi
# Check memory usage
check_elasticsearch_memory
check_events_service_memory
```
4. Enable and start the services:
```bash
# Make scripts executable
chmod +x /opt/appdynamics/scripts/events-service-health-check.sh
chmod +x /opt/appdynamics/scripts/oom-watchdog.sh
# Place service files
cp elasticsearch.service /etc/systemd/system/
cp events-service.service /etc/systemd/system/
# Reload systemd, enable and start services
systemctl daemon-reload
systemctl enable elasticsearch.service
systemctl enable events-service.service
systemctl start elasticsearch.service
systemctl start events-service.service
# Add the OOM watchdog to cron
(crontab -l 2>/dev/null; echo "*/5 * * * * /opt/appdynamics/scripts/oom-watchdog.sh") | crontab -
```
This setup addresses all your scenarios:
1. Graceful Shutdown: The systemd ExecStop commands ensure proper shutdown sequences
2. Unplanned Downtime: The Restart=always ensures services restart after VM reboots
3. Accidental Process Kill: Again, Restart=always handles this automatically
4. OOM Situations: Combination of OOMScoreAdjust, resource limits, and the custom watchdog script
You may need to adjust paths and user/group settings to match your environment. This implementation provides comprehensive self-healing for AppDynamics Events Service and Elasticsearch.