I'm trying to push Splunk to a Customer to index huge amount of data (almost 4.5GB/10M events per day). Those data come from the "trace" of all the business operation of a major italian bank. The problem with those data is that there is almost no fields in it: it's a tipical mainframe-like output, with positional fields (almost no space or any field separator in between fields). Moreover, the record layout is not fixed and can change depending on the business operation.
I'm doing some search speed tests and the results are quite disappointing. Infact, if I look for a data wich is recognized as a field of some sort (let's say is surrounded by blanks), the search is really fast (EG: search
index=xxxx u024015 ). But if the key is not isolated between blanks I have to use the more general search
index=xxxx *u024015*. In this case the search time rise more than exponentially (Job inspector says that at 13% of completition it took already 1,265.527 secs for 3,361,383 scanned events.
In the first case, the search is very fast because it looks in the index for a certain string (u024015), while in the latter it must perform an almost full text search and index scanning.
Is there anyting we can do to boost performances? Working on segmentation and how?
Thanks for suggestions, Marco
I tried to add a sample here but there's no space enough. It looks like this:
0000060356323.03.201017.04.06010051750SCHASC6E 0000000000123.03.2010>MID010005120000 WQUFLAVDIS01010877208772U095981 w01085821u 9598120100323INQUI170405620 2303201000000215 I 0 01025 6035632010-03-2317.04.06010051750SCHASC6E LUYD03
Adding on to what Stephen said...
If modifying your input data is not an option (using an external tool, or using the
SEDCMD feature), then you could also use indexed fields. Generally speaking, indexed fields should be considered as a last resort because they are less-flexible and more difficult to setup properly. But in your case, if tweaking your input data is out of the question, then this may be the best option for you.
Another thought: If the only reason not to tweak your data is because you have a business requirement that mandates that you be able to retrieve the original content, then you could write your own search command that could "undo" your various changed made by your
SEDCMD filters so that the original data would still be retrievable.
For example, if your raw event looks like this: (this is a scrubbed NACHA record I found laying around)
627031309302 0000200000BATCH_NAME 9016 21-47830 0301 0313233260000003
You could transform into something that looks like this:
6|27|031309302| |0000200000|BATCH_NAME |9016| |21-47830| |0301 |031323326|0000003
Now each individual field could be searched for rather quickly since "|" is a major breaker, and you could still pretty easily retrieve your original content by simply striping out the "|" characters with a search command like
rex mode=sed "s/\|//g", however if you raw data sometimes contains literal "|" characters, then you may need a more sophisticated approach; which is where a custom search command would come into play.
Docs & Resources:
To make this type of search fast, you'll have to make sure that splunk indexes the keywords that you need to key retrievals off of. The easiest way to do this is to add punctuation/whitespace to the fields before the data is indexed. If you don't have access to the logging format, you can modify the data as it comes in using a sed-like expression at index time (http://www.splunk.com/base/Documentation/4.1.3/Admin/Anonymizedatawithsed).
i think it would help to see what the events look like, in this case. wondering if you can post a couple sample events. you can "scrub" it to disguise or hide any sensitive info, if you need to.