Solved: How to correlate multiple CSV files using differen...

changux · ‎06-10-2015

Hi all.

I have almost 6 CSV files extracted from a running system where i can't access the backend to install a forwarder, so, my best option is process the csv output files.

The files, looks like this:

File1.csv = NumberID and almost, 30 columns more.
File2.csv = NumberID, RegID and almost 20 columns more.
File3.csv = RegID and almost 40 columns more.
File4.csv = RegID and almost 20 columns more.
File5.csv = NumberID and almost 5 columns more.
File6.csv = RegID and almost 8 columns more.

I need to correlate all files to build a big file with relevant information of each file (i choose the value columns) based only in the NumberID and RegID but these fields are only present in certain files, so, i need to change the "pattern column" while I finish.

Based on this, i have some questions:

1.) If my csv changes almost 1 time per week, what is the better option to be "ingested" by splunk? I mean, i need to analyze only my last files and not all the history of the records.
2.) How i can do the correlation? I checked other answers like:

http://answers.splunk.com/answers/232031/how-to-correlate-data-from-three-csv-file-sources.html

But i don't know which is the best option.

Thank you so much for your help.

woodcock · ‎06-10-2015

If you don't have very many events, you can use inputcsv and append (has upper limit 10K-50K) and transaction (slows down terribly on large datasets) like this:

inputcsv File1.csv | append [inputcsv File2.csv] | append [inputcsv File3.csv] | append [inputcsv File4.csv] | append [inputcsv File5.csv] | append [inputcsv File6.csv] | transaction NumberID RegID

You pretty much have to use transaction because it is the only practical way to do a transitive key relationship like you have described.

View solution in original post

woodcock · ‎06-10-2015

If you don't have very many events, you can use inputcsv and append (has upper limit 10K-50K) and transaction (slows down terribly on large datasets) like this:

inputcsv File1.csv | append [inputcsv File2.csv] | append [inputcsv File3.csv] | append [inputcsv File4.csv] | append [inputcsv File5.csv] | append [inputcsv File6.csv] | transaction NumberID RegID

You pretty much have to use transaction because it is the only practical way to do a transitive key relationship like you have described.

changux · ‎06-11-2015

Thanks so much! Good recommendation.

lguinn2 · ‎06-10-2015

First: why do you need to "correlate all files to build a big file with relevant information of each file"? And what do you mean by that? In Splunk, you can search across multiple inputs and combine them as you search - you don't normally do this as you ingest the data. Also, you could do it differently for different searches/reports, depending on what you need for each one.

How can you tell past data from current data? Is there a timestamp? All events in Splunk must have a timestamp - if no other timestamp is provided, Splunk uses the time when the data was indexed. So you can probably just search recent data. You can also decide how to age-out data from your indexes, but that's a topic for another post, when you know more about Splunk.

Also, if the data is static and not time-based - and you don't care about past values - you could create lookup files instead of indexing the data. Or you might index some of the data and put the rest in lookup files.

The best option for correlation depends on the searches/reports that you want, and how you have chosen to ingest the data. The community needs a lot more information to answer this.

Finally, I think that you would benefit greatly from going through the Splunk Tutorial. You can even get a free Splunk Sandbox to play with, which has the tutorial data in it already. The sandbox is good for 14 days.

How to correlate multiple CSV files using different columns?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Agentic with Splunk Lantern: Connect to Cisco Cloud Control, Transform ...

July Community Events: Master ITSI 5.0 & Automate Splunk

New Release of Federated Search: Bringing Splunk Analytics to More of Your Data

Join the Conversation