Splunk AppDynamics

Event Service auto restart Scripts which handles graceful and forceful Shutdown

Ajay_Prasad2
New Member

Hi Team,

How to optimize the startup process of the AppDynamics Events Service Cluster (Event Service & Elastic Search) within the system and service manager of the operating system, including implementing self-healing mechanisms to automatically detect and resolve any issues that may arise during startup for below scenarios.

1. Graceful Shutdown
2. Unplanned Downtime (unexpected VM shutdown)
3. Accidentally kill the process (Event service and Elastic)
4. Optional: OOM (Out Of Memory)

Labels (2)
Tags (1)
0 Karma

asimit
Path Finder
For optimizing the AppDynamics Events Service Cluster startup process with self-healing capabilities, I recommend implementing systemd service units with proper configurations. This approach handles all your scenarios: graceful shutdown, unplanned downtime, accidental process termination, and OOM situations.

Here's a comprehensive solution:

1. Create systemd service units for both Events Service and Elasticsearch:

For Events Service (events-service.service):
```
[Unit]
Description=AppDynamics Events Service
After=network.target elasticsearch.service
Requires=elasticsearch.service

[Service]
Type=forking
User=appdynamics
Group=appdynamics
WorkingDirectory=/opt/appdynamics/events-service
ExecStart=/opt/appdynamics/events-service/bin/events-service.sh start
ExecStop=/opt/appdynamics/events-service/bin/events-service.sh stop
PIDFile=/opt/appdynamics/events-service/pid.txt
TimeoutStartSec=300
TimeoutStopSec=120

# Restart settings for self-healing
Restart=always
RestartSec=60

# OOM handling
OOMScoreAdjust=-900

# Health check script
ExecStartPost=/opt/appdynamics/scripts/events-service-health-check.sh

[Install]
WantedBy=multi-user.target
```

For Elasticsearch (elasticsearch.service):
```
[Unit]
Description=Elasticsearch for AppDynamics
After=network.target

[Service]
Type=forking
User=appdynamics
Group=appdynamics
WorkingDirectory=/opt/appdynamics/events-service/elasticsearch
ExecStart=/opt/appdynamics/events-service/elasticsearch/bin/elasticsearch -d -p pid
ExecStop=/bin/kill -SIGTERM $MAINPID
PIDFile=/opt/appdynamics/events-service/elasticsearch/pid

# Restart settings for self-healing
Restart=always
RestartSec=60

# Resource limits for OOM prevention
LimitNOFILE=65536
LimitNPROC=4096
LimitMEMLOCK=infinity
LimitAS=infinity

# OOM handling
OOMScoreAdjust=-800

[Install]
WantedBy=multi-user.target
```

2. Create a health check script (events-service-health-check.sh):
```bash
#!/bin/bash

# Health check for Events Service
EVENT_SERVICE_PORT=9080
MAX_RETRIES=3
RETRY_INTERVAL=20

check_events_service() {
  for i in $(seq 1 $MAX_RETRIES); do
    if curl -s http://localhost:$EVENT_SERVICE_PORT/healthcheck > /dev/null; then
      echo "Events Service is running properly."
      return 0
    else
      echo "Attempt $i: Events Service health check failed. Waiting $RETRY_INTERVAL seconds..."
      sleep $RETRY_INTERVAL
    fi
  done
  
  echo "Events Service failed health check after $MAX_RETRIES attempts."
  return 1
}

check_elasticsearch() {
  for i in $(seq 1 $MAX_RETRIES); do
    if curl -s http://localhost:9200/_cluster/health | grep -q '"status":"green"\|"status":"yellow"'; then
      echo "Elasticsearch is running properly."
      return 0
    else
      echo "Attempt $i: Elasticsearch health check failed. Waiting $RETRY_INTERVAL seconds..."
      sleep $RETRY_INTERVAL
    fi
  done
  
  echo "Elasticsearch failed health check after $MAX_RETRIES attempts."
  return 1
}

main() {
  # Wait for initial startup
  sleep 30
  
  if ! check_elasticsearch; then
    echo "Restarting Elasticsearch due to failed health check..."
    systemctl restart elasticsearch.service
  fi
  
  if ! check_events_service; then
    echo "Restarting Events Service due to failed health check..."
    systemctl restart events-service.service
  fi
}

main
```

3. Set up a watchdog script for OOM monitoring (run as a cron job every 5 minutes):
```bash
#!/bin/bash

LOG_FILE="/var/log/appdynamics/oom-watchdog.log"
ES_HEAP_THRESHOLD=90
ES_SERVICE="elasticsearch.service"
EVENTS_HEAP_THRESHOLD=90
EVENTS_SERVICE="events-service.service"

log_message() {
  echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> $LOG_FILE
}

check_elasticsearch_memory() {
  ES_MEMORY_PCT=$(ps -C java -o cmd= | grep elasticsearch | grep -o "Xmx[0-9]*[mMgG]" | head -1)
  ES_CURRENT_HEAP=$(jstat -gc $(pgrep -f elasticsearch) 2>&1 | tail -n 1 | awk '{print ($3+$4+$5+$6+$7+$8)/1024}')
  
  if [[ $ES_MEMORY_PCT == *"g"* || $ES_MEMORY_PCT == *"G"* ]]; then
    ES_MAX_HEAP=${ES_MEMORY_PCT//[^0-9]/}
    ES_MAX_HEAP=$((ES_MAX_HEAP * 1024))
  else
    ES_MAX_HEAP=${ES_MEMORY_PCT//[^0-9]/}
  fi
  
  ES_HEAP_PCT=$((ES_CURRENT_HEAP * 100 / ES_MAX_HEAP))
  
  if [ $ES_HEAP_PCT -gt $ES_HEAP_THRESHOLD ]; then
    log_message "Elasticsearch heap usage at ${ES_HEAP_PCT}% - exceeds threshold of ${ES_HEAP_THRESHOLD}%. Restarting service."
    systemctl restart $ES_SERVICE
    return 0
  fi
  return 1
}

check_events_service_memory() {
  EVENTS_MEMORY_PCT=$(ps -C java -o cmd= | grep events-service | grep -o "Xmx[0-9]*[mMgG]" | head -1)
  EVENTS_CURRENT_HEAP=$(jstat -gc $(pgrep -f "events-service" | grep -v grep | head -1) 2>&1 | tail -n 1 | awk '{print ($3+$4+$5+$6+$7+$8)/1024}')
  
  if [[ $EVENTS_MEMORY_PCT == *"g"* || $EVENTS_MEMORY_PCT == *"G"* ]]; then
    EVENTS_MAX_HEAP=${EVENTS_MEMORY_PCT//[^0-9]/}
    EVENTS_MAX_HEAP=$((EVENTS_MAX_HEAP * 1024))
  else
    EVENTS_MAX_HEAP=${EVENTS_MEMORY_PCT//[^0-9]/}
  fi
  
  EVENTS_HEAP_PCT=$((EVENTS_CURRENT_HEAP * 100 / EVENTS_MAX_HEAP))
  
  if [ $EVENTS_HEAP_PCT -gt $EVENTS_HEAP_THRESHOLD ]; then
    log_message "Events Service heap usage at ${EVENTS_HEAP_PCT}% - exceeds threshold of ${EVENTS_HEAP_THRESHOLD}%. Restarting service."
    systemctl restart $EVENTS_SERVICE
    return 0
  fi
  return 1
}

# Check if services are running
if ! systemctl is-active --quiet $ES_SERVICE; then
  log_message "Elasticsearch service is not running. Attempting to start."
  systemctl start $ES_SERVICE
fi

if ! systemctl is-active --quiet $EVENTS_SERVICE; then
  log_message "Events Service is not running. Attempting to start."
  systemctl start $EVENTS_SERVICE
fi

# Check memory usage
check_elasticsearch_memory
check_events_service_memory
```

4. Enable and start the services:
```bash
# Make scripts executable
chmod +x /opt/appdynamics/scripts/events-service-health-check.sh
chmod +x /opt/appdynamics/scripts/oom-watchdog.sh

# Place service files
cp elasticsearch.service /etc/systemd/system/
cp events-service.service /etc/systemd/system/

# Reload systemd, enable and start services
systemctl daemon-reload
systemctl enable elasticsearch.service
systemctl enable events-service.service
systemctl start elasticsearch.service
systemctl start events-service.service

# Add the OOM watchdog to cron
(crontab -l 2>/dev/null; echo "*/5 * * * * /opt/appdynamics/scripts/oom-watchdog.sh") | crontab -
```

This setup addresses all your scenarios:

1. Graceful Shutdown: The systemd ExecStop commands ensure proper shutdown sequences
2. Unplanned Downtime: The Restart=always ensures services restart after VM reboots
3. Accidental Process Kill: Again, Restart=always handles this automatically
4. OOM Situations: Combination of OOMScoreAdjust, resource limits, and the custom watchdog script

You may need to adjust paths and user/group settings to match your environment. This implementation provides comprehensive self-healing for AppDynamics Events Service and Elasticsearch. 
0 Karma
Get Updates on the Splunk Community!

Splunk Decoded: Service Maps vs Service Analyzer Tree View vs Flow Maps

It’s Monday morning, and your phone is buzzing with alert escalations – your customer-facing portal is running ...

What’s New in Splunk Observability – September 2025

What's NewWe are excited to announce the latest enhancements to Splunk Observability, designed to help ITOps ...

Fun with Regular Expression - multiples of nine

Fun with Regular Expression - multiples of nineThis challenge was first posted on Slack #regex channel ...