Automate debug-level logging in Machine agent usin...

Roland_UN

The AppDynamics Machine Agent supports remediation scripts, which allow you to define automated or manual actions in response to specific alerts or conditions. These scripts can be triggered when a predefined health rule violation occurs, enabling proactive responses to issues in your environment. Below is an overview of how remediation scripts work in the Machine Agent and how to configure and use them:

What Are Remediation Scripts?

Remediation scripts are custom scripts (written in languages like Shell, Python, Batch, or PowerShell) that are executed by the Machine Agent when triggered by Health rule violations

These scripts can perform various actions, such as restarting services, freeing up memory, or notifying teams.

Use Cases for Remediation Scripts include:

1. Restarting Services or Applications:

• Automatically restart a failed service (e.g., web server or database).

2. Clearing Logs or Temporary Files:

• Free up disk space by removing unnecessary files.

3. Scaling Infrastructure:

• Trigger an API call to scale up/down infrastructure (e.g., AWS, Kubernetes).

4. Sending Custom Notifications:

• Send notifications to external systems like Slack, PagerDuty, or email.

5. Custom Troubleshooting Steps:

• Collect diagnostics like thread dumps, heap dumps, or system logs.

Step-by-step guide

The steps to configure a remediation script are documented here → https://docs.appdynamics.com/appd/24.x/25.4/en/splunk-appdynamics-essentials/alert-and-respond/actio....

Practical example:

Use case: enable debug-level or trace-level logs on HR violation for troubleshooting purposes.

Setting the health rule.

Docs:https://docs.appdynamics.com/appd/24.x/25.4/en/splunk-appdynamics-essentials/alert-and-respond/confi...

1. Select HR type

Remediation actions are not available for servers.
You can create and run a remediation action for a health rule with application, tier, or node as an affected entity. Ensure that you select the same entities when you define the Object Scope for the associated policy.

2. Affects Nodes

3. Select specific Nodes.

4. Select one or multiple nodes

5. Add conditions for the HR

6. Select a single metric or Metrics expression (here we select Single Metric value (cpu|%Busy))

2. Setting the action

Docs: https://docs.appdynamics.com/appd/24.x/25.4/en/splunk-appdynamics-essentials/alert-and-respond/actio...

1. Set the action name.

2. The path to the trace.sh file

3. The path to log files

4. Script timeout in minutes set to 5

5. Set email for approval (if required) and Save.

3. Setting the policies to trigger the action

Docs: https://docs.appdynamics.com/appd/24.x/25.4/en/splunk-appdynamics-essentials/alert-and-respond/polic...

1. Policy name

2. Enabled

3. Select HR violation event.

4. Select specific Health Rules.

5. Selected the configured Health Rules.

6. Select specific objects.

7. From Tiers and Nodes.

8. Select Nodes.

9. Specific nodes.

10 Selected one or multiple nodes.

11. Add the action to be executed.

On the agent's side.

Create the trace.sh script and place it in the /local-scripts/ directory

#!/bin/bash

# Define the target file

TARGET_FILE="matest/conf/logging/log4j.xml"

# Backup the original file

cp "$TARGET_FILE" "${TARGET_FILE}.backup"

# Function to update the logging level

update_logging_level() {

local level=$1

echo "Updating logging level to '$level'..."

# Use sed to change all loggers with level="info" to the desired level

sed -i "s/level=\"info\"/level=\"$level\"/g" "$TARGET_FILE"

if [ $? -eq 0 ]; then

echo "Logging level successfully updated to '$level'."

else

echo "Failed to update logging level."

exit 1

fi

}

# Set the logging level to 'trace'

update_logging_level "trace"

# Wait for 10 minutes (600 seconds)

echo "Waiting for 10 minutes..."

sleep 600

# Revert the logging level back to 'info'

update_logging_level "info"

echo "Logging level reverted to 'info'."

When the action is triggered, the script will change the log level from info to debug and revert the change after 10 minutes.

Prerequisites for Local Script Actions

The Machine Agent must be installed running on the host on which the script executes. To see a list of installed Machine Agents for your application, click View machines with machine-agent installed in the bottom left corner of the remediation script configuration window.
To be able to run remediation scripts, the Machine Agent must be connected to a SaaS Controller via SSL. Remediation script execution is disabled if the Machine Agent connects to a SaaS Controller on an unsecured (non-SSL) HTTP connection.
The Machine Agent and the APM agent must be on the same host.
The Machine Agent OS user must have full permissions to the script file and the log files generated by the script and/or its associated child processes.
The script must be placed in <agent install directory>\local-scripts.
The script must be available on the host on which it executes.
Processes spawned from the scripts must be daemon processes.