Getting Data In

Indexing CSV files where each line is a single Help Desk ticket. - Not as simple as it sounds.

ocallender
Explorer

This is going to be long, but I hope it presents an interesting problem and hopefully, it has an elegant solution.

One of the things that really sold me on Splunk was the ability to throw a huge CSV file at it, Index it as a one-off and quickly create meaningful graphs and reports. The problem is that I have to manually copy and index each CSV snapshot as a one-off, then I have to modify my searches to point to the new source file. I want to automate this in a way that requires no manual intervention. My saved searches and dashboard should always have the latest information.

Here's an idea of what the CSV looke like.

Created_Time,Request_ID,Request_Type,Request_Status,Completed_Time
DD-MM-YYYY HH:MM:SS,#100,incident,open,,
DD-MM-YYYY HH:MM:SS,#101,request,closed,DD-MM-YYYY HH:MM:SS,

So I'm basically treating tickets as events. _time = Created_Time

#100 is open. The next time a take a snapshot of my tickets, #100 might be closed. I would want my dashboards to pick that up and report on that ticket in it's closed state the next time it is run. But Splunk can't change field values after indexing them. If I made Splunk monitor the CSV for changes, and overwrote it with the entire snapshot, it would re-index the file, duplicating all the tickets from the previous snapshot.

If I were to overwrite the CSV with only those tickets that have been newly created, over time, my indexed data would contain only open tickets because the closure of the ticket would be missing from the updates.

If I updated with newly created tickets plus tickets that have changed state, tickets would be duplicated in the index every time they changed status. Maybe this isn't such a bad thing, but I'm not sure how to search on it without counting the duplicates.

Then I thought about using a lookup table. The CSV could be loaded as a lookup file and I update it with all the tickets by ticket number and create_time. That way, with each search, it would do a lookup for the latest version of the field values. I'd then index a copy of ticket numbers and create_Time and do a lookup on the full list of tickets. Does this sound like it can work?

I'm just learning how to use Splunk and this has been racking my brain.

0 Karma
1 Solution

Ayn
Legend

OK, I'm going to break up a part of your post and comment:

"#100 is open. The next time a take a snapshot of my tickets, #100 might be closed. I would want my dashboards to pick that up and report on that ticket in it's closed state the next time it is run. But Splunk can't change field values after indexing them."

Not sure what you mean by that last sentence. Event DATA cannot be altered after it's been indexed. Fields are generally not created at index-time though, so you're free to alter how they're defined any way you want.

"If I made Splunk monitor the CSV for changes, and overwrote it with the entire snapshot, it would re-index the file, duplicating all the tickets from the previous snapshot."

No - Splunk would certainly notice the change in the file's modtime, but it holds a file pointer to keep track of where it last read something in the file and carry on from there. As long as you're just adding and not modifying data in the file, only your new data will be indexed, so no duplicates.

View solution in original post

lguinn2
Legend

RE: your paragraph 2: there are ways to automate the uptake of CSV files. See the [batch://<path>] stanza in inputs.conf

http://docs.splunk.com/Documentation/Splunk/5.0.2/admin/Inputsconf

Also, you should be able to rewrite your searches so that they always point at the most recent data. As you note below, dedup might be part of this strategy.

0 Karma

dwaddle
SplunkTrust
SplunkTrust

Anytime someone starts talking about events that change over time and wanting to know the most recent state, I have to plug araitz's blog post on the topic.

The nice thing about this type of approach is that you have both types of data -- the most recent state and all of the collected state changes -- available to you.

Of course, there are bonus points available if you can update your 'ticket puller' script to only pull information on tickets that have been updated since the last time you ran it. That way, you avoid a number of duplicate events for a given ticket showing the exact same state as before.

lguinn2
Legend

Thanks for linking to Alex's blog post. Good suggestion about the "ticket puller" script

0 Karma

lguinn2
Legend

If each CSV file is a snapshot of tickets, I would just index them all. This assumes that each snapshot is distinct, and that you are not adding to the end of the existing CSV file. Yes, there would be many events for each ticket; each event represents the state of the ticket at the snapshot time. But you could find the latest state of each ticket very easily:

yoursearchhere | stats latest(Request_Status) as Current_Status by Request_Id

Another interesting search - which frankly can be impossibly slow over very large datasets or very long timeframes -

yoursearchhere | transaction Request_Id

yoursearchhere could be something like sourcetype=tickets or whatever makes sense.
And you could use the first search to create a lookup table automatically on a schedule; depending on the type of searches that you want to run, that could be convenient. But I would try some other ideas first, to get a handle on what is possible.

ocallender
Explorer

Thank you for this. I thought about it but wasn't sure how to search through it. I'm guessing that the "dedup Request_ID" example in the answer above should give me the latest snapshot of each ticket which I can then pipe to my existing searches. The "latest(x)" command seems to return only the latest value of one field. i'd want to run my searches across the latest state of all fields in a ticket.

Thank you. These two answers have put me in the right direction. I'm very grateful for the help. I wish I could make this a second accepted answer.

0 Karma

Ayn
Legend

OK, I'm going to break up a part of your post and comment:

"#100 is open. The next time a take a snapshot of my tickets, #100 might be closed. I would want my dashboards to pick that up and report on that ticket in it's closed state the next time it is run. But Splunk can't change field values after indexing them."

Not sure what you mean by that last sentence. Event DATA cannot be altered after it's been indexed. Fields are generally not created at index-time though, so you're free to alter how they're defined any way you want.

"If I made Splunk monitor the CSV for changes, and overwrote it with the entire snapshot, it would re-index the file, duplicating all the tickets from the previous snapshot."

No - Splunk would certainly notice the change in the file's modtime, but it holds a file pointer to keep track of where it last read something in the file and carry on from there. As long as you're just adding and not modifying data in the file, only your new data will be indexed, so no duplicates.

Ayn
Legend

If you insist on having Splunk act as a general DBMS then you would more or less have to go the lookup route. I think you should index the file just as-is and if that means you get several events for each ticket, why would that be a problem? Just grab the last seen status for each ticket and you'll know the current state.

For instance

... | dedup Request_ID

will grab the latest value, so if you want all currently open tickets, just do

... | dedup Request_ID | search Request_Status="open"

ocallender
Explorer

Thanks for your response. You answered part of it by saying that field values can't be modified after it is indexed. I was looking for a way to modify the value of a field (eg Status) so that when a ticket changes, I still have only one entry in the index for the ticket, with updated values.

If I just log status changes, I should be able to create a search that counts the tickets as transactions with multiple events. How can I can search on the the latest status update for reports like "All open tickets" showing only the most recent state of the ticket. I hope I'm making sense here.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...