The AppDynamics Machine Agent supports remediation scripts, which allow you to define automated or manual actions in response to specific alerts or conditions. These scripts can be triggered when a predefined health rule violation occurs, enabling proactive responses to issues in your environment. Below is an overview of how remediation scripts work in the Machine Agent and how to configure and use them:
What Are Remediation Scripts?
Remediation scripts are custom scripts (written in languages like Shell, Python, Batch, or PowerShell) that are executed by the Machine Agent when triggered by Health rule violations
These scripts can perform various actions, such as restarting services, freeing up memory, or notifying teams.
Use Cases for Remediation Scripts include:
1. Restarting Services or Applications:
• Automatically restart a failed service (e.g., web server or database).
2. Clearing Logs or Temporary Files:
• Free up disk space by removing unnecessary files.
3. Scaling Infrastructure:
• Trigger an API call to scale up/down infrastructure (e.g., AWS, Kubernetes).
4. Sending Custom Notifications:
• Send notifications to external systems like Slack, PagerDuty, or email.
5. Custom Troubleshooting Steps:
• Collect diagnostics like thread dumps, heap dumps, or system logs.
The steps to configure a remediation script are documented here → https://docs.appdynamics.com/appd/24.x/25.4/en/splunk-appdynamics-essentials/alert-and-respond/actio....
Practical example:
Use case: enable debug-level or trace-level logs on HR violation for troubleshooting purposes.
1. Select HR type
2. Affects Nodes
3. Select specific Nodes.
4. Select one or multiple nodes
5. Add conditions for the HR
6. Select a single metric or Metrics expression (here we select Single Metric value (cpu|%Busy))
2. Setting the action
1. Set the action name.
2. The path to the trace.sh file
3. The path to log files
4. Script timeout in minutes set to 5
5. Set email for approval (if required) and Save.
3. Setting the policies to trigger the action
1. Policy name
2. Enabled
3. Select HR violation event.
4. Select specific Health Rules.
5. Selected the configured Health Rules.
6. Select specific objects.
7. From Tiers and Nodes.
8. Select Nodes.
9. Specific nodes.
10 Selected one or multiple nodes.
11. Add the action to be executed.
#!/bin/bash # Define the target file TARGET_FILE="matest/conf/logging/log4j.xml" # Backup the original file cp "$TARGET_FILE" "${TARGET_FILE}.backup" # Function to update the logging level update_logging_level() { local level=$1 echo "Updating logging level to '$level'..." # Use sed to change all loggers with level="info" to the desired level sed -i "s/level=\"info\"/level=\"$level\"/g" "$TARGET_FILE" if [ $? -eq 0 ]; then echo "Logging level successfully updated to '$level'." else echo "Failed to update logging level." exit 1 fi } # Set the logging level to 'trace' update_logging_level "trace" # Wait for 10 minutes (600 seconds) echo "Waiting for 10 minutes..." sleep 600 # Revert the logging level back to 'info' update_logging_level "info" echo "Logging level reverted to 'info'." |
When the action is triggered, the script will change the log level from info to debug and revert the change after 10 minutes.