## How to get a random sampling from a large data set in Hunk?

Ultra Champion

We have around two billion claims from roughly three years. The client is interested in good data sampling, of let's say, 100 claims. They mention that with SQL, they do something like - `order by newid()`, which fetches 100 random records from a table.
Any ideas how to do something similar with Hunk/Splunk?

Tags (2)
Ultra Champion

For a relatively small table of 30 million we tried the following -

``````index=provider | eval rand=random() % 100 | where rand=0  | head 100
``````

It seems to work just fine for this small data-set but I don't know whether it can be used for 2 billion claims...

Splunk Employee

The reason I didn't suggest something like that before is that random() picks a number uniformly distributed from 0 to 2,147,483,647. Numbers ending in 00 to 47 show up one more time in that range than the numbers 48 to 99. In this case, there are 21,474,836 full sets of numbers and one incomplete set, so it will be very hard to detect the difference. But if you did:

``````...| eval rand=random() % 2000000000 where rand < 100
``````

Then everything from 0 to 147,483,647 will show up twice as often as everything from 147,483,648 to 2,147,483,647, and 00 - 99 will be significantly over-represented, and you will get too many events.

Splunk Employee

This just proves that statistics hurts the brain. @kschon, we will try what you've suggested above. It's an elegant solution, and we'll look to couple it with the vix sample rate for performance.

Splunk Employee

I agree, statistics and brain health are at odds. Hopefully this will help. You can, of course, use the simpler solution and add a "head 100" command. It just won't spread the events around as evenly.

Splunk Employee

You still asking for a statistically verifiable data sample or are you using this to return a limited subset of the total data?

Using this capability to return a limited subset of the data is more manageable than returning a statistically verifiable random sample of the events.

In native Splunk we own the data storage algorithm and therefore can create a verifiable sample to the total records. In Hunk you are relying on Hadoop to store the data and therefore we can only deliver a verifiable random sample by returning random hadoop files. Any randomly selected hadoop file may contain zero or more records which meet the search criteria. This means that we can only approximate the total percentage of records returned from the random sample of Hadoop files examined.

In fact, we will not know if the 25% sample of files mentioned above contains 10% of the records in question or 45% of the records because we do not know the distributions of target records within a hadoop file.

Remember there is no inherent organization in a hadoop file. It is just whatever data your sources have put into it.

Now, and here is the interesting point in this process, you could place all of the target searchable terms in a Splunk index by extracting them from Hadoop and indexing them together with the hadoop file name they are contained in and then you would gain all of the Splunk advantages and also keep the original data in hadoop. So searches become a two step process.

Search the Splunk index to find the terms you want.
Use the list of hadoop files that the terms you are interested and the dates that you are interested in to run a hunk search to return the actual records and then perform whatever other investigation on a much reduced subset of all your hadoop files.
This means that all searches begin by selecting terms and dates that apply. Then running a final search to return the raw data. At that point, you can format the output to suit your needs in Hunk or export all of the returned results to your favorite tool.

At the end of the first search, all of the statistics and sampling will work for those terms just as you requested.

Ultra Champion

Very kind of you Claw!!

Splunk Employee

In Hunk 6.4 sampling is done by files not events. The Hunk sampling flag is called vix.split.sample.rate

When you set vix.split.sample.rate = 0.25 It means that each split has a 1 out of 4 probability of being accepted (it does not mean that every 4th split will be accepted).
For large numbers of splits, this means that roughly 25% will be accepted, and that it will not be the same 25% each time, and that we are doing our best to make sure that iteration order does not determine which splits we accept. But for small numbers of splits, it’s hard to predict how many we get back.

In Splunk 6.4 we also have the event sampling feature: http://docs.splunk.com/Documentation/Splunk/6.4.0/Search/Retrieveasamplesetofevents

Splunk Employee

This feature - if I am not mistaken - is available only in Hunk 6.4 not 6.3.3
As we highlighted, the number of split is approximate not exactly. Therefore, for approximate 100, The value should be 100/2,000,000,000 = 0.00000005
If you want to make sure they limit the search to only 100 you can add - in addition to the above, something like this -
index=abc | stats count by source limit=100 | ..

Ultra Champion

Great. So, if we want exactly 100 "good" samples from two billion events in Splunk 6.3.3, what should we do?

Splunk Employee

Do you mean 100 events fairly sampled from a virtual index? If so, the problem is that, in order to do it efficiently, you would need an external index of some sort. Since Hunk does not directly manage your data, it does not maintain an index. Hunk cannot know how many events are in each file in your virtual index without reading them, and it does not know the starting offsets of the events within a file, so it cannot do fair sampling without reading all the files that your search would hit.

If you are willing to sample inefficiently, you could have Hunk read all the events and pipe them to a custom command that would do the sampling by randomly deciding whether to keep each event. I believe such a command is included in the Machine Learning Toolkit on Splunkbase. But if the whole reason you are sampling is to speed up the search, then this may not be what you want.

Ultra Champion

kschon_splunk, let's keep in mind that the claims are spread across several years pretty much evenly. They also reside in 90 sqoop’s generated files. What can be a reasonable way to generate these 100 samples? Speed is not important. It's more important that the process would be able to process 2 billion claims without bailing out.

Splunk Employee

In that case (i.e. speed is not important, you just need to make sure the query does not fail), then there are indeed some ways to do this. If I wanted a random sample of 1/1000th of my data-set, I could try this:

``````index=foo | eval synthId=random()/2147483647 | table _time synthId | search synthId < 0.001 | table _raw
``````

—The middle command creates a new synthetic ID field called synthId. The description for the random() function (http://docs.splunk.com/Documentation/Splunk/6.4.0/SearchReference/CommonEvalFunctions) states that it creates a psuedo-random number between 0 and 2^31-1 = 2147483647, so `random()/2147483647` creates a pseudo-random decimal number between 0.0 and 1.0. I want 1/1000th, so I take events with a value less than 1/1000.
—The final `table` guarantees that filtering will happen on the task nodes, instead of bringing all events to the search head. Any “aggregating” command will do.

In your case, we could try “search synthId < 0.00000005” (100 / 2 billion = 5*10-7). But it’s now reasonable to start worrying about round-off error, so it’s probably better if we calculate for ourselves that (2^31-1) * 100 / 2 billion = 107.4 , so we get:

``````index=foo | eval synthId=random() | table _time synthId | search synthId <= 107 | ….
``````

This will get you a sample that is approximately the right size, is different every time, and is statically correct. If you want a sample of exactly the right size, you could get too many items and take the first 100, e.g.:

``````index=foo | eval synthId=random() | table _time synthId | search synthId <= 150 | head 100 | ….
``````

This will slightly bias the sample based on iteration order. On the other hand, if you want a repeatable sample, you can use a hash instead of a random number. For example, you could use `synthId=tonumber(substr(md5(_raw), -8), 16)` to get a pseudo-random number between 0 and 4294967296 that will be the same for a given event, every time you calculate it. Then you can use all the tricks above.

Hopefully this helps.

Get Updates on the Splunk Community!

#### Announcing General Availability of Splunk Incident Intelligence!

Digital transformation is real! Across industries, companies big and small are going through rapid digital ...

#### Splunk Training for All: Meet Aspiring Cybersecurity Analyst, Marc Alicea

Splunk Education believes in the value of training and certification in today’s rapidly-changing data-driven ...

#### The Splunk Success Framework: Your Guide to Successful Splunk Implementations

Splunk Lantern is a customer success center that provides advice from Splunk experts on valuable data ...