Monitoring Splunk

Calculating IOPS using FIO testing

hkeyser
Splunk Employee
Splunk Employee

Installing Fio:

Linux:
Fio is part of the CentOS/Redhat core repository so you can check/install this via the following:
yum check fio
yum install fio

Fio is also part of the core Debian/Ubuntu repository and you can check/install via:
apt-cache search fio
apt-get install fio

Windows:
Navigate to:
https://bsdio.com/fio/

Alternatively:
The latest builds for Windows can also be grabbed from https://ci.appveyor.com/project/axboe/fio by clicking the latest x86 or x64 build, then selecting the ARTIFACTS tab.

Download the most recent version of fio (if this doesn't work, go for one version older)

The installer will run through and do a basic installation into Windows Powershell and standard Command Prompt. There will be no acknowledgement of this beyond completing the installation (i.e. desktop icon or program listing).


Using Fio:

OS Agnostic Info:

  1. Splunk MUST be down before running this test to get an accurate reading of the disk system's capabilities.
  2. The actual size of the file(s) to be tested is a factor of (2x Total RAM)/(# of CPU's reported by Splunk) The reason for this is to fully saturate the RAM and to push the CPU's to work through read/write operations for a thorough test. Splunk doesn't take advantage of hyperthreading/multithreading. Because of this, we only run the test with the number of physical cores available or number of CPU's assigned to the VM. (see below. We're looking for CPU cores, not virtual cores) 12-12-2018 09:59:17.240 -0500 INFO loader - Detected 8 (virtual) CPUs, 8 CPU cores, and 7822MB RAM

IMPORTANT: Once the test has been run, there will be latent test files in the directory where this was run that will need to be cleaned up or else they will occupy disk space equivalent to: (Size of test file(s) x number of CPU cores)

Linux:
Create a file with a ".fio" extension on it (e.g. fiotest.fio)

Edit the file to include the following:

[random-rw]
rw=randrw
size=<2x Total RAM on Machine divided by number of CPU cores, rounded up> (size in: k, m, g)
blocksize=64k
ioengine=libaio
directory=<insert directory to test here>
numjobs=<Num CPU's reported by Splunk>
iodepth=32
group_reporting

Directory must be where Splunk writes to for hot/warm or cold buckets (dependent on where the I/O issue appears to be)

Then run the test by calling fio filename.fio

[root@host]# fio /root/fio/fiotest.fio

This test can take time. It is recommended to be run during a scheduled window or during downtime. (seriously, you could be there for hours)
Once you've run the test, you can copy/paste the results into a text file and (if necessary) upload it to your support case. Fio spits out results to the command line similar to a "cat" command.

See "Interpreting Fio Results" below


Windows:
Similar to the Linux install, have the customer create a file with a ".fio" extension (e.g. fiotest.fio)
Edit the file to include the following:

[random-rw]
rw=randrw
size=<2x Total RAM on Machine divided by number of CPU cores, rounded up> (size in: k, m, g)
blocksize=64k
ioengine=windowsaio
numjobs=<Num CPU's reported by Splunk>
group_reporting
iodepth=32

If the file is not showing as a ".fio" file, you'll need to navigate into the File Explorer to the location where the user created the ".fio" file and show file extensions. Once this has been done, you'll need to change the file to include a ".fio" instead of ".txt" or ".docx", etc.
Unlike linux, you cannot specify a directory nicely with Windows. You'll need to navigate to where the test needs to occur in Powershell or Command Prompt and then run the fio test by doing the following:

EXAMPLE:
C:\Users\Administrator> cd C:\Program Files\Splunk\var\lib\splunk
C:\Program Files\Splunk\var\lib\splunk> fio C:\Users\Administrator\Desktop\fiotest.fio

This test can take time. It is recommended to be run during a scheduled window or during downtime (seriously, you could be there for hours)
Once you've run the test, you can copy/paste the results into a text file and (if necessary) upload it to the case. Fio spits out results to the command line similar to a "cat" command.

See "Interpreting Fio Results" below


Interpreting Fio Results:

So you've successfully run a fio test, what now? This is a lot of crap to parse through. Luckily, there is only one we need to really be concerned with (listed below)

IOPS
This is a fairly straightforward field and is actually what we're looking for. If either of these are too low, we can successfully point to the disk system having issues

Remember your recommended requirements!
http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Referencehardware#Indexer
https://www.splunk.com/pdfs/technical-briefs/splunk-deploying-vmware-tech-brief.pdf

Something to keep in mind is if you are running Parallel Ingestion Pipelines. The requirements of the disk system go up by quite a bit for each additional pipeline (300-400 IOPS)

http://docs.splunk.com/Documentation/Splunk/latest/Capacity/Parallelization#Index_parallelization

Take this chart with a grain of salt, these are approximate values that you should keep in mind while scaling up.

# of Pipelines Extra CPU's Physical IOPS VM IOPS
1 (default) ----- 800 - 1200
2 4-6 1100-1200 1500-1600
3 10-12 1500-1600 1700-1800
4 16-18 1700-1800 2100-2200


My test machine has 8 vCPU and 8GB of RAM. With this in mind, the test file itself looks like:

[random-rw]
rw=randrw
size=2g
blocksize=64k
directory=/opt/splunk/var/lib/splunk
ioengine=libaio
numjobs=8
group_reporting
iodepth=32

The below are the results of the test itself. Note: This was intentionally stopped halfway through the test to give a general idea of the results you'll have to parse through.

random-rw: (g=0): rw=randrw, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=32
...
fio-3.1
Starting 8 processes
random-rw: Laying out IO file (1 file / 2048MiB)
random-rw: Laying out IO file (1 file / 2048MiB)
random-rw: Laying out IO file (1 file / 2048MiB)
random-rw: Laying out IO file (1 file / 2048MiB)
random-rw: Laying out IO file (1 file / 2048MiB)
random-rw: Laying out IO file (1 file / 2048MiB)
random-rw: Laying out IO file (1 file / 2048MiB)
random-rw: Laying out IO file (1 file / 2048MiB)
bs: 8 (f=8): [m(8)][52.8%][r=17.4MiB/s,w=18.1MiB/s][r=278,w=289 IOPS][eta 03m:11s]
fio: terminating on signal 2

random-rw: (groupid=0, jobs=8): err= 0: pid=19368: Wed Dec 12 11:14:50 2018
read: IOPS=329, BW=20.6MiB/s (21.6MB/s)(4401MiB/213631msec)
slat (usec): min=448, max=484089, avg=24177.79, stdev=20589.49
clat (usec): min=4, max=1592.1k, avg=375811.07, stdev=164051.01
lat (msec): min=9, max=1752, avg=399.99, stdev=171.01
clat percentiles (msec):
| 1.00th=[ 155], 5.00th=[ 194], 10.00th=[ 215], 20.00th=[ 247],
| 30.00th=[ 275], 40.00th=[ 300], 50.00th=[ 330], 60.00th=[ 368],
| 70.00th=[ 422], 80.00th=[ 498], 90.00th=[ 600], 95.00th=[ 701],
| 99.00th=[ 911], 99.50th=[ 995], 99.90th=[ 1183], 99.95th=[ 1267],
| 99.99th=[ 1401]
bw ( KiB/s): min= 128, max= 5604, per=12.54%, avg=2645.05, stdev=1038.64, samples=3414
iops : min= 2, max= 87, avg=41.16, stdev=16.15, samples=3414
write: IOPS=329, BW=20.6MiB/s (21.6MB/s)(4396MiB/213631msec)
slat (usec): min=31, max=2663, avg=73.72, stdev=43.24
clat (msec): min=9, max=1509, avg=376.16, stdev=164.08
lat (msec): min=9, max=1509, avg=376.23, stdev=164.09
clat percentiles (msec):
| 1.00th=[ 155], 5.00th=[ 194], 10.00th=[ 215], 20.00th=[ 247],
| 30.00th=[ 271], 40.00th=[ 300], 50.00th=[ 330], 60.00th=[ 368],
| 70.00th=[ 422], 80.00th=[ 498], 90.00th=[ 609], 95.00th=[ 701],
| 99.00th=[ 911], 99.50th=[ 995], 99.90th=[ 1150], 99.95th=[ 1217],
| 99.99th=[ 1435]
bw ( KiB/s): min= 128, max= 6709, per=12.53%, avg=2640.95, stdev=1134.21, samples=3414
iops : min= 2, max= 104, avg=41.10, stdev=17.65, samples=3414
lat (usec) : 10=0.01%
lat (msec) : 10=0.01%, 20=0.01%, 50=0.02%, 100=0.06%, 250=21.09%
lat (msec) : 500=59.10%, 750=16.27%, 1000=2.97%, 2000=0.48%
cpu : usr=0.11%, sys=0.74%, ctx=70440, majf=0, minf=233
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwt: total=70416,70337,0, short=0,0,0, dropped=0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
READ: bw=20.6MiB/s (21.6MB/s), 20.6MiB/s-20.6MiB/s (21.6MB/s-21.6MB/s), io=4401MiB (4615MB), run=213631-213631msec
WRITE: bw=20.6MiB/s (21.6MB/s), 20.6MiB/s-20.6MiB/s (21.6MB/s-21.6MB/s), io=4396MiB (4610MB), run=213631-213631msec

Disk stats (read/write):
dm-0: ios=70357/58882, merge=0/0, ticks=1693633/1848407, in_queue=3547619, util=100.00%, aggrios=70416/58881, aggrmerge=0/1, aggrticks=1694829/1836861, aggrin_queue=3531591, aggrutil=100.00%
sda: ios=70416/58881, merge=0/1, ticks=1694829/1836861, in_queue=3531591, util=100.00%


Summary of the testing variables using my example from above:
[random-rw]
stanza header
rw=randrw
This is a random read/write with a ratio of 50/50
size=2g
This is the size of the testing file that will be written to disk before being randomly read/written to
blocksize=64k
The blocksize needs to be 64k because Splunk writes in 64k blocks
directory=/opt/splunk/var/lib/splunk
Where Splunk writes to disk
ioengine=libaio
For linux/windows the specification is set as such in an asynchronous i/o format.
numjobs=8
I have 8 vCPU's on my test box, so I'm using 8 jobs to simultaneously run through Fio.
group_reporting
This setting does not require an additional variable. What it does is aggregate the results of the 8 simultaneous jobs into one number that the system is capable of sustaining.
iodepth=32
The iodepth setting allows data to be queued and written to many disks at once if they are available.

Docs:
https://media.readthedocs.org/pdf/fio/latest/fio.pdf
https://www.linux.com/learn/inspecting-disk-io-performance-fio
https://github.com/axboe/fio

Edit 10/21/2019

---bluestop.org/fio has moved to bsdio.com/fio. link fixed accordingly

Edit 12/13/2019

---Splunk docs changed the location of the VMWare tech brief pdf. link fixed

1 Solution

hkeyser
Splunk Employee
Splunk Employee

I'm a Technical Support Engineer for Splunk and felt the post would be helpful for anyone who is unsure of how to go about testing for disk I/O.

View solution in original post

burras
Communicator

If you have separate hot/cold storage paths, I assume you would need to run this test multiple times with a different "directory" setting for each?

0 Karma

hkeyser
Splunk Employee
Splunk Employee

@burras yes. You'd want to test this separately against either your hot/warm or your cold/frozen storage paths with a different "directory" setting.

0 Karma

hkeyser
Splunk Employee
Splunk Employee

I'm a Technical Support Engineer for Splunk and felt the post would be helpful for anyone who is unsure of how to go about testing for disk I/O.

Get Updates on the Splunk Community!

Video | Welcome Back to Smartness, Pedro

Remember Splunk Community member, Pedro Borges? If you tuned into Episode 2 of our Smartness interview series, ...

Detector Best Practices: Static Thresholds

Introduction In observability monitoring, static thresholds are used to monitor fixed, known values within ...

Expert Tips from Splunk Education, Observability in Action, Plus More New Articles on ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...