We have around two billion claims from roughly three years. The client is interested in good data sampling, of let's say, 100 claims. They mention that with SQL, they do something like -
order by newid(), which fetches 100 random records from a table.
Any ideas how to do something similar with Hunk/Splunk?
In Hunk 6.4 sampling is done by files not events. The Hunk sampling flag is called vix.split.sample.rate
When you set vix.split.sample.rate = 0.25 It means that each split has a 1 out of 4 probability of being accepted (it does not mean that every 4th split will be accepted).
For large numbers of splits, this means that roughly 25% will be accepted, and that it will not be the same 25% each time, and that we are doing our best to make sure that iteration order does not determine which splits we accept. But for small numbers of splits, it’s hard to predict how many we get back.
In Splunk 6.4 we also have the event sampling feature: http://docs.splunk.com/Documentation/Splunk/6.4.0/Search/Retrieveasamplesetofevents
Do you mean 100 events fairly sampled from a virtual index? If so, the problem is that, in order to do it efficiently, you would need an external index of some sort. Since Hunk does not directly manage your data, it does not maintain an index. Hunk cannot know how many events are in each file in your virtual index without reading them, and it does not know the starting offsets of the events within a file, so it cannot do fair sampling without reading all the files that your search would hit.
If you are willing to sample inefficiently, you could have Hunk read all the events and pipe them to a custom command that would do the sampling by randomly deciding whether to keep each event. I believe such a command is included in the Machine Learning Toolkit on Splunkbase. But if the whole reason you are sampling is to speed up the search, then this may not be what you want.
kschon_splunk, let's keep in mind that the claims are spread across several years pretty much evenly. They also reside in 90 sqoop’s generated files. What can be a reasonable way to generate these 100 samples? Speed is not important. It's more important that the process would be able to process 2 billion claims without bailing out.
In that case (i.e. speed is not important, you just need to make sure the query does not fail), then there are indeed some ways to do this. If I wanted a random sample of 1/1000th of my data-set, I could try this:
index=foo | eval synthId=random()/2147483647 | table _time synthId | search synthId < 0.001 | table _raw
—The middle command creates a new synthetic ID field called synthId. The description for the random() function (http://docs.splunk.com/Documentation/Splunk/6.4.0/SearchReference/CommonEvalFunctions) states that it creates a psuedo-random number between 0 and 2^31-1 = 2147483647, so
random()/2147483647 creates a pseudo-random decimal number between 0.0 and 1.0. I want 1/1000th, so I take events with a value less than 1/1000.
table guarantees that filtering will happen on the task nodes, instead of bringing all events to the search head. Any “aggregating” command will do.
In your case, we could try “search synthId < 0.00000005” (100 / 2 billion = 5*10-7). But it’s now reasonable to start worrying about round-off error, so it’s probably better if we calculate for ourselves that (2^31-1) * 100 / 2 billion = 107.4 , so we get:
index=foo | eval synthId=random() | table _time synthId | search synthId <= 107 | ….
This will get you a sample that is approximately the right size, is different every time, and is statically correct. If you want a sample of exactly the right size, you could get too many items and take the first 100, e.g.:
index=foo | eval synthId=random() | table _time synthId | search synthId <= 150 | head 100 | ….
This will slightly bias the sample based on iteration order. On the other hand, if you want a repeatable sample, you can use a hash instead of a random number. For example, you could use
synthId=tonumber(substr(md5(_raw), -8), 16) to get a pseudo-random number between 0 and 4294967296 that will be the same for a given event, every time you calculate it. Then you can use all the tricks above.
Hopefully this helps.
This feature - if I am not mistaken - is available only in Hunk 6.4 not 6.3.3
As we highlighted, the number of split is approximate not exactly. Therefore, for approximate 100, The value should be 100/2,000,000,000 = 0.00000005
If you want to make sure they limit the search to only 100 you can add - in addition to the above, something like this -
index=abc | stats count by source limit=100 | ..
You still asking for a statistically verifiable data sample or are you using this to return a limited subset of the total data?
Using this capability to return a limited subset of the data is more manageable than returning a statistically verifiable random sample of the events.
In native Splunk we own the data storage algorithm and therefore can create a verifiable sample to the total records. In Hunk you are relying on Hadoop to store the data and therefore we can only deliver a verifiable random sample by returning random hadoop files. Any randomly selected hadoop file may contain zero or more records which meet the search criteria. This means that we can only approximate the total percentage of records returned from the random sample of Hadoop files examined.
In fact, we will not know if the 25% sample of files mentioned above contains 10% of the records in question or 45% of the records because we do not know the distributions of target records within a hadoop file.
Remember there is no inherent organization in a hadoop file. It is just whatever data your sources have put into it.
Now, and here is the interesting point in this process, you could place all of the target searchable terms in a Splunk index by extracting them from Hadoop and indexing them together with the hadoop file name they are contained in and then you would gain all of the Splunk advantages and also keep the original data in hadoop. So searches become a two step process.
Search the Splunk index to find the terms you want.
Use the list of hadoop files that the terms you are interested and the dates that you are interested in to run a hunk search to return the actual records and then perform whatever other investigation on a much reduced subset of all your hadoop files.
This means that all searches begin by selecting terms and dates that apply. Then running a final search to return the raw data. At that point, you can format the output to suit your needs in Hunk or export all of the returned results to your favorite tool.
At the end of the first search, all of the statistics and sampling will work for those terms just as you requested.
For a relatively small table of 30 million we tried the following -
index=provider | eval rand=random() % 100 | where rand=0 | head 100
It seems to work just fine for this small data-set but I don't know whether it can be used for 2 billion claims...