Solved: Data Discovery of Index Fields

aportela_work · ‎10-02-2013

Does anyone know how to generate a report listing all fields (from an index) and their respective info (example follows)?
% not null of each field,
different values of each field,
% of each field per value.

This would be like the information seen in the Search page, on the left hand side, that shows "selected fields" and "Interesting Fields".

However, instead of being interactive (like in Search Page), it would be a search query that can run to generate a report. And, the report could be in table format or csv file or just regular output.

Any and all help would be much appreciated.
Cheers... Al

Ayn · ‎10-02-2013

First of all, just to make it clear - there are no fields in the index itself. Well almost no fields, a few like source and sourcetype are there but pretty much all field extraction in Splunk happens at search-time, AFTER the events have been extracted from the index. Basically Splunk only indexes the data and breaks it down into segments, and when you search for, say, "myfield=foo" Splunk will look for the term "foo" in the index first and THEN check if the instances it found can be connected to the field "myfield".

Anyway, on to the rest of the question. The first thing you'll want to do is find out how many events there are in the index, so you can calculate percentages. This query will do it:

| dbinspect index=yourindex | stats sum(eventCount) as eventCount

...and to put it in a field that you can use in the rest of your search, put it in a subsearch after eval:

... | eval [| dbinspect index=yourindex | stats sum(eventCount) as eventCount | format "" "" "" "" "" ""]

This will give you an eventCount field with the total number of events in the index.

Now, as for your first example, % of non-null events, I started playing with searches to get something working and it's actually more complex than I thought at first (either that, or I forgot something obvious). The problem lies in performing calculations on an arbitrary list of fields, which is needed for calculating the percentage. I couldn't really find a solution that works on pre-6.x instances of Splunk. Splunk 6 introduces the foreach command which solves this though. Here goes:

index=yourindex | stats count(*) as * | eval [| dbinspect index=yourindex | stats sum(eventCount) as eventCount | format "" "" "" "" "" ""] | foreach * [eval <<FIELD>> = '<<FIELD>>'/eventCount*100]

This will give you a table with one column per extracted field in the search over the whole index, and a value showing how many percent of events this field was extracted in.

As for the second one, values for each field, this is much easier:

index=yourindex | stats values(*) as *

That's it.

The third example I'm not really sure exactly what you mean by, so I'll just leave that for now. Please clarify this example and I'll see if I can write something up.

Anyway the drawback to all this is that it's probably be painfully slow - it requires Splunk to read ALL EVENTS in an index in order to be able to compile the report. You're probably going to want to take just a snapshot in order to make it a bit more speedy.

View solution in original post

Ayn · ‎10-02-2013

First of all, just to make it clear - there are no fields in the index itself. Well almost no fields, a few like source and sourcetype are there but pretty much all field extraction in Splunk happens at search-time, AFTER the events have been extracted from the index. Basically Splunk only indexes the data and breaks it down into segments, and when you search for, say, "myfield=foo" Splunk will look for the term "foo" in the index first and THEN check if the instances it found can be connected to the field "myfield".

Anyway, on to the rest of the question. The first thing you'll want to do is find out how many events there are in the index, so you can calculate percentages. This query will do it:

| dbinspect index=yourindex | stats sum(eventCount) as eventCount

...and to put it in a field that you can use in the rest of your search, put it in a subsearch after eval:

... | eval [| dbinspect index=yourindex | stats sum(eventCount) as eventCount | format "" "" "" "" "" ""]

This will give you an eventCount field with the total number of events in the index.

Now, as for your first example, % of non-null events, I started playing with searches to get something working and it's actually more complex than I thought at first (either that, or I forgot something obvious). The problem lies in performing calculations on an arbitrary list of fields, which is needed for calculating the percentage. I couldn't really find a solution that works on pre-6.x instances of Splunk. Splunk 6 introduces the foreach command which solves this though. Here goes:

index=yourindex | stats count(*) as * | eval [| dbinspect index=yourindex | stats sum(eventCount) as eventCount | format "" "" "" "" "" ""] | foreach * [eval <<FIELD>> = '<<FIELD>>'/eventCount*100]

This will give you a table with one column per extracted field in the search over the whole index, and a value showing how many percent of events this field was extracted in.

As for the second one, values for each field, this is much easier:

index=yourindex | stats values(*) as *

That's it.

The third example I'm not really sure exactly what you mean by, so I'll just leave that for now. Please clarify this example and I'll see if I can write something up.

Anyway the drawback to all this is that it's probably be painfully slow - it requires Splunk to read ALL EVENTS in an index in order to be able to compile the report. You're probably going to want to take just a snapshot in order to make it a bit more speedy.

Ayn · ‎10-03-2013

Mimicking that BEHAVIOUR is a non-trivial problem as it involves writing completely new widgets for splunkweb. I'm going to pass on that one. Sorry. 🙂

aportela_work · ‎10-02-2013

Thanks... good info. First and last tips work and are helpful. But, do you have code that would mimic search page results, left hand margin, "Selected Fields" or "Interest in Fields"? Here, Splunk (by default) provides a list of fields and respective # of distinct values. Click on one, and can see the % info of each value per field, and % not-null of entire field. Regardless, previous info is useful and helps with the data profiling effort.... Thanks again!

Data Discovery of Index Fields

Splunk Observability Cloud's AI Assistant in Action Series: Auditing Compliance and ...

Splunk Community Badges!

What You Read The Most: Splunk Lantern’s Most Popular Articles!

Are you a member of the Splunk Community?

Data Discovery of Index Fields

Splunk Observability Cloud's AI Assistant in Action Series: Auditing Compliance and ...

Splunk Community Badges!

What You Read The Most: Splunk Lantern’s Most Popular Articles!