Archive

categorization based on frequent text

New Member

Hi. I have an excel dump of incident tickets generated from the ticketing tool.
Sample incidents' description from the report:

  1. "Target: CI-xxxx Stateless event alarm Event details: HA recovered from a total cluster failure in cluster"
  2. "Server - CI-aaaa generates Multipath Issue Fibre Channel information: Multipathing ERROR, not all luns have 4 paths"
  3. "Servers generate CI-aaaa & CI-bbbbb - Multipath issue Fibre Channel information: Multipathing ERROR, not all luns have 4 paths"
  4. "Servers generate CI-aaaa & CI-bbbbb - Multipath issue Fibre Channel information: Multipathing ERROR, not all luns have 4 paths"
  5. "[VMware vCenter - Alarm Cluster high availability error] Insufficient resources to satisfy HA failover level on cluster"
  6. "F drive is having less disk space nagios-ebs: CI-xxxx "
  7. "Low disk space alert on CI-yyyyy"
  8. "Failed backup report for 2nd April 2012 : CI-xxxx , CI-aaaa , CI-bbbbb"
  9. "Failed backup report for 3rd April 2012 : CI-xxxx , CI-aaaa , CI-bbbbb"

There is no exclusive "category" field. My end objective is to perform a Trend Analysis to identify top recurring issues.
I could perform a grouping by going through the description fields one by one and identifying the incident type.

Desired output would be :

category ---- count of occurrence

HA ---- 2

Multipath ---- 3

disk space ---- 2

failed backup ---- 2

The manual grouping would not be feasible though for a list of 300+ incidents.

I was wondering if Splunk could identify the common significant text from the description fields and return a similar grouping, without the need to key in search strings ?

0 Karma

Communicator

I know this question was asked quite a while ago, but in case anyone stumbles across this in a search I thought I'd mention that Prelert Anomaly Detective for Splunk (http://splunk-base.splunk.com/apps/68765/prelert-anomaly-detective) can categorize events based on looking for common words in the raw text.

0 Karma

Legend

You mean if Splunk can somehow automatically identify a category for each of these messages and return it? In that case the answer is no. Splunk doesn't know anything about what these logs actually mean, it just indexes it just like any other data. Any other intelligence will have to be provided by you (or if someone else already provided the intelligence through an app or similar).

If you mean that Splunk could match on individual strings in each message and create fields from that, certainly. You could match on the string "disk space" and put that into a field, same goes for any other string you're interested in.

0 Karma

Legend

The index is in a proprietary binary format that can't be read in any way like that, so no, the assumption is false.

0 Karma

New Member

I was referring to the index file that would get generated when I run Splunk on the file containing the incident description.

Based on the example provided, I am assuming the index file would have the following content :
5 aaaa
2 backup
4 bbbbb
3 channel
4 cluster
2 disk
3 fibre
3 luns
3 multipath
where the numbers specify the number of times the string appears in the content.

Was wondering if I could read this index file to obtain the strings and count, provided my assumption about the index file contents are correct.

Thanks !

0 Karma

Legend

Please clarify what you mean - what index file are you referring to, and which various strings?

0 Karma

New Member

Thanks Ayn!

Yes, the first part is what i am looking for, as currently I do not know what are the possible incident categories and associated strings I should be searching for.

Would it be feasible to read the index file from wherein I could identify the various strings and associated number of occurrences?

0 Karma

Splunk Employee
Splunk Employee

Can you provide the data. It's still, to me, a little unclear what you're trying to accomplish.

0 Karma

New Member

Right about the format - it doesn't have a common template. Thanks Lamar !

0 Karma

Splunk Employee
Splunk Employee

The problem that you'll have with this data is the fact that it isn't relatively common in format.

You have some events that have their description after a ":" and then some descriptions actually start at the beginning of the event/line.

You could create a hash of your event and key off that with a lookup or something similar to that.

0 Karma

New Member

Thanks for your time Lamar !
I have edited my original post to include samples of my requirement. Trust this brings in more clarity.

0 Karma