All Apps and Add-ons

What are the Best Practices for collectd.conf for Infrastructure Monitoring?

sloshburch
Splunk Employee
Splunk Employee

I'm using the HTTP Event Collector and saving the events as metrics (vs an event index) since it's clear that approach is the best practice for sending the data to Splunk from collectd.

I found five documents where Splunk defines configuration for using collectd for OS metric-based data collection:

What is the best practice for collecting OS metrics using collectd and sending to Splunk?

1 Solution

sloshburch
Splunk Employee
Splunk Employee

The Splunk Product Best Practices team provided this response. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

The industry has come a long way since Splunk first released the Splunk Add-on for Unix and Linux. Many folks are migrating to the more performant collectd for gathering OS performance data! Add your data into Splunk as metrics, use the Splunk App for Infrastructure and you've got a game changer when it comes to operational insights!

All that said, there's subtle differences and redundancies when it comes to Splunk's collectd documentation. While the instructions for both create the HTTP Event Collector (HEC) token and installing collectd are well defined, there might be some confusion, as per the question, on the best way to define the collectd configuration file.

Below is the collect.conf configuration that I use. For context on where this configuration fits into the Splunk picture, check out manually configure metrics collection for *nix on Splunk App for Infrastructure. Before you copy and paste, consider the following:

  • write_splunk - this is a plugin for collectd written by Splunk. Use it rather than write_http if possible. While either works, don't use both since that's duplicating your data collection. You can disable the unused item within the collectd configuration by starting the line with the # character.

  • replace_with_hec_* - look for these strings to find where you should edit the file for your specific environment. Remember to keep any quotes but remove greater and less than angled brackets < and >

  • ReportByCpu - By setting to false, it represents CPU as an aggregate of all processors on the system. Change to true for more granular detail if you desire it.

  • Interval - set this to the desired frequency in seconds. While once a minute works for me,adjust the frequency to meet your needs. But remember that more data means more license usage and more storage costs.

I plan to update this collectd.conf as I learn more. I'll add comments to this answer to share updates and keep a change history.

This configuration is applicable to UNIX, Linux, and Mac.

#
# Config file for collectd(1).
# Please read collectd.conf(5) for a list of options.
# http://collectd.org/
#

##############################################################################
# Global                                                                     #
#----------------------------------------------------------------------------#
# Global settings for the daemon.                                            #
##############################################################################

#Hostname    "localhost"
FQDNLookup   false
#BaseDir     "/var/lib/collectd"
#PIDFile     "/var/run/collectd.pid"
#PluginDir   "/usr/lib64/collectd"
#TypesDB     "/usr/share/collectd/types.db"

#----------------------------------------------------------------------------#
# When enabled, plugins are loaded automatically with the default options    #
# when an appropriate <Plugin ...> block is encountered.                     #
# Disabled by default.                                                       #
#----------------------------------------------------------------------------#
#AutoLoadPlugin false

#----------------------------------------------------------------------------#
# When enabled, internal statistics are collected, using "collectd" as the   #
# plugin name.                                                               #
# Disabled by default.                                                       #
#----------------------------------------------------------------------------#
#CollectInternalStats false

#----------------------------------------------------------------------------#
# Interval at which to query values. This may be overwritten on a per-plugin #
# base by using the 'Interval' option of the LoadPlugin block:               #
#   <LoadPlugin foo>                                                         #
#       Interval 60                                                          #
#   </LoadPlugin>                                                            #
#----------------------------------------------------------------------------#
Interval     60

#MaxReadInterval 86400
#Timeout         2
#ReadThreads     5
#WriteThreads    5

# Limit the size of the write queue. Default is no limit. Setting up a limit is
# recommended for servers handling a high volume of traffic.
WriteQueueLimitHigh 1000000
WriteQueueLimitLow   800000

##############################################################################
# Logging                                                                    #
#----------------------------------------------------------------------------#
# Plugins which provide logging functions should be loaded first, so log     #
# messages generated when loading or configuring other plugins can be        #
# accessed.                                                                  #
##############################################################################

LoadPlugin syslog
LoadPlugin logfile
<LoadPlugin "write_splunk">
        FlushInterval 30
</LoadPlugin>

##############################################################################
# LoadPlugin section                                                         #
#----------------------------------------------------------------------------#
# Lines beginning with a single `#' belong to plugins which have been built  #
# but are disabled by default.                                               #
#                                                                            #
# Lines beginning with `##' belong to plugins which have not been built due  #
# to missing dependencies or because they have been deactivated explicitly.  #
##############################################################################

#LoadPlugin csv
LoadPlugin cpu
LoadPlugin memory
LoadPlugin df
LoadPlugin load
LoadPlugin disk
LoadPlugin interface

##############################################################################
# Plugin configuration                                                       #
#----------------------------------------------------------------------------#
# In this section configuration stubs for each plugin are provided. A desc-  #
# ription of those options is available in the collectd.conf(5) manual page. #
##############################################################################

<Plugin logfile>
    LogLevel info
    File "/var/log/collectd.log"
    Timestamp true
    PrintSeverity true
</Plugin>

<Plugin syslog>
    LogLevel info
</Plugin>

<Plugin cpu>
    ReportByCpu false
    ReportByState true
    ValuesPercentage true
</Plugin>

<Plugin memory>
    ValuesAbsolute false
    ValuesPercentage true
</Plugin>

<Plugin df>
    FSType "ext2"
    FSType "ext3"
    FSType "ext4"
    FSType "XFS"
    FSType "rootfs"
    FSType "overlay"
    FSType "hfs"
    FSType "apfs"
    FSType "zfs"
    FSType "ufs"
    ReportByDevice true
    ValuesAbsolute false
    ValuesPercentage true
    IgnoreSelected false
</Plugin>

<Plugin load>
    ReportRelative true
</Plugin>

<Plugin disk>
    Disk ""
    IgnoreSelected true
    UdevNameAttr "DEVNAME"
</Plugin>

<Plugin interface>
    IgnoreSelected true
</Plugin>

##############################################################################
# Customization for Splunk                                                   #
#----------------------------------------------------------------------------#
# This plugin sends all metrics data from other plugins to Splunk via HEC.   #
# Plugin available from https://splunkbase.splunk.com/app/3975.              #
#     within /appserver/static/unix_agent/unix-agent.tgz/write_splunk.so     #
#     save to collectd plugin directory (see PluginDir near top of file)     #
##############################################################################

<Plugin write_splunk>
    server "<replace_with_hec_domain>"
    port "<replace_with_hec_point>"
    token "<replace_with_hec_token>"
    ssl true
    verifyssl false
</Plugin>

##############################################################################
# Using write_http instead of write_splunk:                                  #
#----------------------------------------------------------------------------#
# Comment out the below if using the above. write_splunk OR write_http       #
##############################################################################
LoadPlugin write_http
<Plugin write_http>
    <Node "hec-to-splunk">
        URL "https://<replace_with_hec_domain>:<replace_with_hec_point>/services/collector/raw"
        Header "Authorization: Splunk <replace_with_hec_token>"
        Format "JSON"
        Metrics true
        StoreRates true
        VerifyPeer false
        VerifyHost false
    </Node>
</Plugin>

View solution in original post

sloshburch
Splunk Employee
Splunk Employee

The Splunk Product Best Practices team provided this response. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices.

The industry has come a long way since Splunk first released the Splunk Add-on for Unix and Linux. Many folks are migrating to the more performant collectd for gathering OS performance data! Add your data into Splunk as metrics, use the Splunk App for Infrastructure and you've got a game changer when it comes to operational insights!

All that said, there's subtle differences and redundancies when it comes to Splunk's collectd documentation. While the instructions for both create the HTTP Event Collector (HEC) token and installing collectd are well defined, there might be some confusion, as per the question, on the best way to define the collectd configuration file.

Below is the collect.conf configuration that I use. For context on where this configuration fits into the Splunk picture, check out manually configure metrics collection for *nix on Splunk App for Infrastructure. Before you copy and paste, consider the following:

  • write_splunk - this is a plugin for collectd written by Splunk. Use it rather than write_http if possible. While either works, don't use both since that's duplicating your data collection. You can disable the unused item within the collectd configuration by starting the line with the # character.

  • replace_with_hec_* - look for these strings to find where you should edit the file for your specific environment. Remember to keep any quotes but remove greater and less than angled brackets < and >

  • ReportByCpu - By setting to false, it represents CPU as an aggregate of all processors on the system. Change to true for more granular detail if you desire it.

  • Interval - set this to the desired frequency in seconds. While once a minute works for me,adjust the frequency to meet your needs. But remember that more data means more license usage and more storage costs.

I plan to update this collectd.conf as I learn more. I'll add comments to this answer to share updates and keep a change history.

This configuration is applicable to UNIX, Linux, and Mac.

#
# Config file for collectd(1).
# Please read collectd.conf(5) for a list of options.
# http://collectd.org/
#

##############################################################################
# Global                                                                     #
#----------------------------------------------------------------------------#
# Global settings for the daemon.                                            #
##############################################################################

#Hostname    "localhost"
FQDNLookup   false
#BaseDir     "/var/lib/collectd"
#PIDFile     "/var/run/collectd.pid"
#PluginDir   "/usr/lib64/collectd"
#TypesDB     "/usr/share/collectd/types.db"

#----------------------------------------------------------------------------#
# When enabled, plugins are loaded automatically with the default options    #
# when an appropriate <Plugin ...> block is encountered.                     #
# Disabled by default.                                                       #
#----------------------------------------------------------------------------#
#AutoLoadPlugin false

#----------------------------------------------------------------------------#
# When enabled, internal statistics are collected, using "collectd" as the   #
# plugin name.                                                               #
# Disabled by default.                                                       #
#----------------------------------------------------------------------------#
#CollectInternalStats false

#----------------------------------------------------------------------------#
# Interval at which to query values. This may be overwritten on a per-plugin #
# base by using the 'Interval' option of the LoadPlugin block:               #
#   <LoadPlugin foo>                                                         #
#       Interval 60                                                          #
#   </LoadPlugin>                                                            #
#----------------------------------------------------------------------------#
Interval     60

#MaxReadInterval 86400
#Timeout         2
#ReadThreads     5
#WriteThreads    5

# Limit the size of the write queue. Default is no limit. Setting up a limit is
# recommended for servers handling a high volume of traffic.
WriteQueueLimitHigh 1000000
WriteQueueLimitLow   800000

##############################################################################
# Logging                                                                    #
#----------------------------------------------------------------------------#
# Plugins which provide logging functions should be loaded first, so log     #
# messages generated when loading or configuring other plugins can be        #
# accessed.                                                                  #
##############################################################################

LoadPlugin syslog
LoadPlugin logfile
<LoadPlugin "write_splunk">
        FlushInterval 30
</LoadPlugin>

##############################################################################
# LoadPlugin section                                                         #
#----------------------------------------------------------------------------#
# Lines beginning with a single `#' belong to plugins which have been built  #
# but are disabled by default.                                               #
#                                                                            #
# Lines beginning with `##' belong to plugins which have not been built due  #
# to missing dependencies or because they have been deactivated explicitly.  #
##############################################################################

#LoadPlugin csv
LoadPlugin cpu
LoadPlugin memory
LoadPlugin df
LoadPlugin load
LoadPlugin disk
LoadPlugin interface

##############################################################################
# Plugin configuration                                                       #
#----------------------------------------------------------------------------#
# In this section configuration stubs for each plugin are provided. A desc-  #
# ription of those options is available in the collectd.conf(5) manual page. #
##############################################################################

<Plugin logfile>
    LogLevel info
    File "/var/log/collectd.log"
    Timestamp true
    PrintSeverity true
</Plugin>

<Plugin syslog>
    LogLevel info
</Plugin>

<Plugin cpu>
    ReportByCpu false
    ReportByState true
    ValuesPercentage true
</Plugin>

<Plugin memory>
    ValuesAbsolute false
    ValuesPercentage true
</Plugin>

<Plugin df>
    FSType "ext2"
    FSType "ext3"
    FSType "ext4"
    FSType "XFS"
    FSType "rootfs"
    FSType "overlay"
    FSType "hfs"
    FSType "apfs"
    FSType "zfs"
    FSType "ufs"
    ReportByDevice true
    ValuesAbsolute false
    ValuesPercentage true
    IgnoreSelected false
</Plugin>

<Plugin load>
    ReportRelative true
</Plugin>

<Plugin disk>
    Disk ""
    IgnoreSelected true
    UdevNameAttr "DEVNAME"
</Plugin>

<Plugin interface>
    IgnoreSelected true
</Plugin>

##############################################################################
# Customization for Splunk                                                   #
#----------------------------------------------------------------------------#
# This plugin sends all metrics data from other plugins to Splunk via HEC.   #
# Plugin available from https://splunkbase.splunk.com/app/3975.              #
#     within /appserver/static/unix_agent/unix-agent.tgz/write_splunk.so     #
#     save to collectd plugin directory (see PluginDir near top of file)     #
##############################################################################

<Plugin write_splunk>
    server "<replace_with_hec_domain>"
    port "<replace_with_hec_point>"
    token "<replace_with_hec_token>"
    ssl true
    verifyssl false
</Plugin>

##############################################################################
# Using write_http instead of write_splunk:                                  #
#----------------------------------------------------------------------------#
# Comment out the below if using the above. write_splunk OR write_http       #
##############################################################################
LoadPlugin write_http
<Plugin write_http>
    <Node "hec-to-splunk">
        URL "https://<replace_with_hec_domain>:<replace_with_hec_point>/services/collector/raw"
        Header "Authorization: Splunk <replace_with_hec_token>"
        Format "JSON"
        Metrics true
        StoreRates true
        VerifyPeer false
        VerifyHost false
    </Node>
</Plugin>

sloshburch
Splunk Employee
Splunk Employee

Recent idea floated around about deploying this to instances that have Splunk installed by using a Deployment Server and an Include statement in the installed collectd.conf to loop in the configuration you want that the deployment server pushes out.

For example:

Include "${SPLUNK_HOME}/etc/apps/my_collectd_config/local/collectd.conf"
0 Karma

vladislavplaksy
Explorer

@SloshBurch
Do you know where can I read documentation about Plugin write_splunk?
I am trying to configure Splunk App for Infrastructure for collect metrics from hadoop cluster.
see me question here
https://answers.splunk.com/answers/741247/how-to-extract-custom-dimensions-from-plugin-insta.html
and
https://answers.splunk.com/answers/747633/how-to-extract-custom-dimension-from-metrics-in-sp.html?ch...

0 Karma

ntankersley_spl
Splunk Employee
Splunk Employee

write_splunk is an extension to the write_http plugin. It does some metrics reformatting for common plugins like disk and cpu to set certain portions of the name such as volume or CPU as dimensions. It also pulls information about the host from the OS to add as dimensions as well.

write_splunk is the recommended plugin to use for collectd as it allows for the addition of custom dimensions and in the most recent versions the ability to send to a UF before sending to splunk. It is also the plugin required to use data with SAI.

0 Karma

bsimon_splunk
Splunk Employee
Splunk Employee

@vladislavplaksy - I'm checking to see if anything is available for documentation.

Meanwhile, Administer Splunk App for Infrastructure's Advanced Data Collection may be of use because it taught me that Collectd captures the metrics and then plugin_http or write_splunk can be used simply to direct that output to the HEC. I don't know much about hadoop but if you can write the metrics to a log or to http event collector then you should be all set.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...