I have a SOAP output file that I want to do metrics on in Splunk. There is a lot of data in the envelope that is useless to me, and only a couple of strings of text which is useful to me. Is it possible to either:
Limit or "pick out" the text that splunk actually indexes from the file, maybe based on regex?
or
Set it so that the file will index every minute, but only indexes the changes within the file, and not the whole file over and over again (potentially only one or two strings within the file are subject to change).
Hope this makes sense!
Example would be:
<?xml version="1.0" encoding="UTF-8"?><SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns1="http://www.example.org/Admin/"><SOAP-ENV:Body><ns1:PortalAuthentication><NpsHost name="server-nps-01" port = "5480" qhist = "yes" username = "test" timezoneoffset = "Europe/London" ><Status username="test" desc="Online" rev="7.0.0-0.F-1.P-1.Bld-26407" fss="schema_disabled" upper="true"/>
</ns1:PortalAuthentication></SOAP-ENV:Body></SOAP-ENV:Envelope>
Within that text, only the "desc" field is subject to change, with Online or Offline for example, but the rest is static. I do need the data, but I'll only need it once, or only if it changes (like firmware revision changes etc)
The only way that I was able to solve the problem was to script the output to a staging file, then cat that file into a sed statement which cut out the sections of text that I didn't want, and replace them with a carriage return (\n for new line) which acted to seperate each server name into it's own event. Then with the data input wizard, I specified the time for all events as being the timestamp at the end which I added as a DATE command from Linux.
The only way that I was able to solve the problem was to script the output to a staging file, then cat that file into a sed statement which cut out the sections of text that I didn't want, and replace them with a carriage return (\n for new line) which acted to seperate each server name into it's own event. Then with the data input wizard, I specified the time for all events as being the timestamp at the end which I added as a DATE command from Linux.
If you really want to "pick and choose" what Splunk indexes from a file, there are a couple of other choices.
Index only the first 1200 characters of each event: In props.conf
[yourstanzahere]
TRUNCATE = 1200
Use a regular expression to determine what is/isn't indexed from an event. The following would process your example so that only
NpsHost name="server-nps-01" Status username="test" desc="Online"
would be indexed.
props.conf
[yourstanzahere]
TRANSFORMS-t1=editMyEvent
transforms.conf
[editMyEvent]
SOURCE_KEY=_raw
REGEX=.*(NpsHost name="\S+?").*?(Status username="S+?" desc="\S+?")
DEST_KEY=_raw
FORMAT=$1 $2
When you make these changes to props.conf and transforms.conf
1) they need to be made where the events are parsed (usually that's on the indexer)
2) you need to restart the indexer to have the changes take effect
3) the changes apply only to new data as it is parsed - no change will be made to data that was previously indexed
This modification to the props.conf and transforms.conf sounds like the way forward, but when I make the change, nothing happens for some reason.
I'll try to re-create the file input and make the changes again.
If the new/changed information is appended to the end of the file, then this is the monitor input to Splunk.
If the entire file is replaced by the new data, there is no default input type that will index only the deltas.
However you could use a scripted input. With a scripted input, you write the script and Splunk indexes the stdout of the script. Find out more about scripted inputs here.
Here is one way to code the logic of such a script:
1. When the script begins, check for the presence of an earlier version of the file.
2. If the earlier version does not exist: (a) copy the current file to the previous version (b) output the current file to stdout so it will be indexed by Splunk and (3) exit.
3. Use a "diff" command (whatever is appropriate to your OS) to identify what has changed between the current and previous version of the file. Direct the output (if any changes) to stdout.
4. Copy the current file to the previous version.
I see only one possible problem with this approach - exactly how do you intend to retrieve and use this information? You might want to do a more sophisticated script than just a "diff" command, to make sure that the output of the script will provide sufficient context and be truly useful in Splunk.
I see that what you are doing is effectively a scripted input.
Effectively what I'm already doing is a scripted input, as I have splunk monitoring the file that my script is outputting. I was hoping to try and limit what splunk grabs from the file, because it either doesn't need all the information in the file, or only some output changes.
The main aim being that I don't want splunk to index data unnecessarily
Basically, I intend to create a "status" screen of various machines.
Through this SOAP output, it tells you the current status of the server when you initiate the request. I would intend to make it poll to create an output file every 1-5 minutes and as such only small changes would occur within the output.
With the raw data, I would create a dashboard to visualise the data better.