Solved: How do I test my storage system using FIO across m...

ekost · ‎09-21-2020

Overview: Flexible I/O (FIO) is a storage I/O testing tool. It offers options to perform a variety of storage tests, has detailed reporting, is CLI-based, and can run simultaneous tests across many machines using one control node. A pre-compiled version is available for multiple *nix distributions and Windows. See Flexible I/O Binary packages for the latest builds.

ekost · ‎09-25-2020

Pre-requisites to performing a test:
* The FIO package. Compile your own build, pull via the package manager du jour, or grab a precompiled version for your distro. The FIO version must be consistent when testing across multiple machines.

* A sufficient machine count to perform the load test on the storage. These machines will run FIO with the 'server' switch and be managed by a captain. Example: If you're planning on a 10 indexer cluster, run the test across all 10 indexer machines simultaneously.

* Identify the storage volume to be tested on each machine. This can be a shared storage path, a mounted storage path on all machines, etc.

* Check the free space on the selected storage volume. Depending upon the switches chosen for the test, you could use 1TB or more free space on the volume for each machine being tested.

* A user account with sufficient permission to run programs, write files, and open high TCP ports. Using root is not required. The account also needs read and write permissions to the storage volume being tested.

* An allowed open TCP port on each FIO 'server' machine (default TCP:8765) that'll accept an inbound request from the captain.

* Ample time for the test to run.

* A change control, or at least a friendly warning to any infrastructure monitoring teams as the test might trigger load alerts or other warnings.

Implementation tasks for testing:
* Select a machine to be the captain for the FIO testing. The test will not run on this node. It's only a coordinator.

* Distribute or build FIO on all non-captain machines running the test, and check the version.

* If splunk is installed on the test machines, stop splunk services before running the FIO test.

* Start FIO on all of the non-captain machines using the server switch:

fio --server

* OPT: Run tool in background if desired: fio --server &
* OPT: Verify the TCP port is open and listening: netstat -an | grep 8765

* On the captain machine, create a text file containing a list of all hosts to communicate with. For example: host.list
* Add each test machine by IP or hostname, one line per host.

* On the captain machine, create a job file to define the test parameters. For example: random_rw.fio

* On the captain machine, update the job file with the switches that best define the storage parameters you want to use for the test. Field reps testing storage for Splunk Enterprise with FIO commonly use these settings in the job file:

[global]
ioengine=libaio
iodepth=16
bssplit=4k/60:8k/:32k/
direct=1
size=<2x RAM size>
stonewall
group_reporting
directory=/mnt/<volume_name>

[random-read-write-1]
rw=randrw
numjobs=1

[random-read-write-many]
rw=randrw
numjobs=<CPU core count from OS>

[sequential-read-write-1]
rw=readwrite
numjobs=1

[sequential-read-write-many]
rw=readwrite
numjobs=<CPU core count from OS>

Notable settings used in the job file:
* size: The size of the file on disk used for read and write testing. Suggested 2x Total RAM on machine.
* numjobs: The number of jobs to run simultaneously. Suggested as the core count reported by the OS. Setting numjobs >1creates a new file for each job using the 'size' setting.
Example: A test with size=32 and numjobs=16 (representing a 32GB RAM host with 16 cores) will create 16 32GB files (~512GB) on the storage volume for each machine being tested.

Each test defined in the job file creates its own set of files. Example: the header above [sequential-read-write-many] is just one test of 4 defined tests in that file. FIO does not remove these files after the test is complete.
* runtime: in seconds. Use to optionally limit the time the test is allowed to run.

Initiate the test:
Run FIO on the captain using the --client switch:

fio --client=host.list --output-format=json --output="fio_test_"$(date '+%Y-%m-%d_%H.%M.%S')".json" random_rw.fio

Test notes:
* If there's a mismatch between the captain (client switch) and the servers, the error "bad server cmd version XX" will appear on the captain when starting the test. The test will continue on all machines except for the machine having the mismatched version.
* It'll take time to run the test based upon settings chosen. Use the runtime setting under [global] to set a limit.
* Test results are written to the captain in the FIO path.
* The example above uses the JSON output. There are other output reports available. See --output-format in command line options.

Post-test cleanup:
* Save the output report from the captain machine.
* Remove the leftover testing files from tested filesystem. The FIO test files use the naming structure: <hostname>.<job_file_header> and will be found in the path defined in the job file.

Evaluate results:
See the continuation below for some notes on the FIO output report.

View solution in original post

ekost · ‎09-25-2020

Can I do distributed FIO testing on Windows Servers?

At this time, the FIO pre-compiled binaries for Windows do not support the 'server' switch. The error is: 'fio: waitpid: Function not implemented'. But support might come in a future FIO build.

You can still run individual tests on each Windows host. You can also use powershell to spawn a cmd across multiple machines. If someone has a script example, please post it here!

What do I need for a single-machine test using Windows?

* Review the "Pre-requisites to performing a test" for *nix above. You won't need to worry about the TCP port, but the other items apply.
* Get a pre-compiled FIO package for Windows. See Flexible I/O Binary packages.

Implementation for test:
* Set up the FIO folder in a temporary directory.
* Create a job file on the machine. Example: random_rw.fio
* Choose switches for your test. Use the job file example above, and change the settings:
* ioengine=windowsaio
* directory=X\:\<path>\<path>
The directory example above assumes you have mounted the storage as a drive letter.
* size: The size of the file on disk used for read and write testing. Suggested 2x Total RAM on machine.
* numjobs: The number of jobs to run simultaneously. Suggested as the core count reported by the OS. Setting numjobs >1creates a new file for each job using the 'size' setting.
Example: A test with size=32 and numjobs=16 (representing a 32GB RAM host with 16 cores) will create 16 32GB files (~512GB) on the storage volume.
Each test defined in the job file creates its own set of files. Example: the header above [sequential-read-write-many] is just one test of 4 defined tests in that file. FIO does not remove these files after the test is complete.
* runtime: in seconds. Use to optionally limit the time the test is allowed to run.

Initiate the test:
Run FIO.

fio --output-format=json --output="fio_test_windows_host.json" random_rw.fio

Test notes:
* It'll take time to run the test based upon settings chosen. Use the runtime setting under [global] in the job file to set a limit.
* Test results are written to the FIO path.
* The example above uses the JSON output. There are other output reports available. See the FIO documentation option --output-format in command line options.

Post-test cleanup:
* Save the output report.
* Remove the leftover testing files from tested filesystem. The FIO test files use the naming structure: <hostname>.<job_file_header> and will be found in the path defined in the job file.

Evaluate results:
* See the continuation below for some notes on the FIO output report.

ekost · ‎09-25-2020

Pre-requisites to performing a test:
* The FIO package. Compile your own build, pull via the package manager du jour, or grab a precompiled version for your distro. The FIO version must be consistent when testing across multiple machines.

* A sufficient machine count to perform the load test on the storage. These machines will run FIO with the 'server' switch and be managed by a captain. Example: If you're planning on a 10 indexer cluster, run the test across all 10 indexer machines simultaneously.

* Identify the storage volume to be tested on each machine. This can be a shared storage path, a mounted storage path on all machines, etc.

* Check the free space on the selected storage volume. Depending upon the switches chosen for the test, you could use 1TB or more free space on the volume for each machine being tested.

* A user account with sufficient permission to run programs, write files, and open high TCP ports. Using root is not required. The account also needs read and write permissions to the storage volume being tested.

* An allowed open TCP port on each FIO 'server' machine (default TCP:8765) that'll accept an inbound request from the captain.

* Ample time for the test to run.

* A change control, or at least a friendly warning to any infrastructure monitoring teams as the test might trigger load alerts or other warnings.

Implementation tasks for testing:
* Select a machine to be the captain for the FIO testing. The test will not run on this node. It's only a coordinator.

* Distribute or build FIO on all non-captain machines running the test, and check the version.

* If splunk is installed on the test machines, stop splunk services before running the FIO test.

* Start FIO on all of the non-captain machines using the server switch:

fio --server

* OPT: Run tool in background if desired: fio --server &
* OPT: Verify the TCP port is open and listening: netstat -an | grep 8765

* On the captain machine, create a text file containing a list of all hosts to communicate with. For example: host.list
* Add each test machine by IP or hostname, one line per host.

* On the captain machine, create a job file to define the test parameters. For example: random_rw.fio

* On the captain machine, update the job file with the switches that best define the storage parameters you want to use for the test. Field reps testing storage for Splunk Enterprise with FIO commonly use these settings in the job file:

[global]
ioengine=libaio
iodepth=16
bssplit=4k/60:8k/:32k/
direct=1
size=<2x RAM size>
stonewall
group_reporting
directory=/mnt/<volume_name>

[random-read-write-1]
rw=randrw
numjobs=1

[random-read-write-many]
rw=randrw
numjobs=<CPU core count from OS>

[sequential-read-write-1]
rw=readwrite
numjobs=1

[sequential-read-write-many]
rw=readwrite
numjobs=<CPU core count from OS>

Notable settings used in the job file:
* size: The size of the file on disk used for read and write testing. Suggested 2x Total RAM on machine.
* numjobs: The number of jobs to run simultaneously. Suggested as the core count reported by the OS. Setting numjobs >1creates a new file for each job using the 'size' setting.
Example: A test with size=32 and numjobs=16 (representing a 32GB RAM host with 16 cores) will create 16 32GB files (~512GB) on the storage volume for each machine being tested.

Each test defined in the job file creates its own set of files. Example: the header above [sequential-read-write-many] is just one test of 4 defined tests in that file. FIO does not remove these files after the test is complete.
* runtime: in seconds. Use to optionally limit the time the test is allowed to run.

Initiate the test:
Run FIO on the captain using the --client switch:

fio --client=host.list --output-format=json --output="fio_test_"$(date '+%Y-%m-%d_%H.%M.%S')".json" random_rw.fio

Test notes:
* If there's a mismatch between the captain (client switch) and the servers, the error "bad server cmd version XX" will appear on the captain when starting the test. The test will continue on all machines except for the machine having the mismatched version.
* It'll take time to run the test based upon settings chosen. Use the runtime setting under [global] to set a limit.
* Test results are written to the captain in the FIO path.
* The example above uses the JSON output. There are other output reports available. See --output-format in command line options.

Post-test cleanup:
* Save the output report from the captain machine.
* Remove the leftover testing files from tested filesystem. The FIO test files use the naming structure: <hostname>.<job_file_header> and will be found in the path defined in the job file.

Evaluate results:
See the continuation below for some notes on the FIO output report.

How do I test my storage system using FIO across multiple *nix nodes simultaneously?

CLI

indexer

search head

upgrade

New This Month - Splunk Observability updates and improvements for faster ...

What's New in Splunk Cloud Platform 9.3.2411?

Buttercup Games: Further Dashboarding Techniques (Part 6)