I have a csv file that we're getting from an ALU application that is proving incredibly difficult to work with. This csv file represents metrics collected from various pieces of the system and application all melded into a single file. Each "group" or metrics collection within the file has a differing number of fields and the fields that are there are not always represented in the same order. In general, it has seemed impossible to work with the entire file so I've had to parse it out into multiple individual csv files representing each group. However, even at the group level there are inconsistencies.
Here's a sample of the parsed csv data for one group:
Start_Time_In_MS=1468990800004,Start_Time_Local=Wed_Jul_20_01:00:00_EDT_2016,End_Time_In_MS=1468991700003,End_Time_Local=Wed_Jul_20_01:15:00_EDT_2016,Site=tmpafl,Group=Diameter,Application=Proprietary,Command=PAR,Destination_Host=csb.tmpaflpcrf.prod,Destination_Realm=tmpaflpcrf.prod,Egress_Peer_Origin_Host=csb.tmpaflpcrf.prod,Egress_Peer_Origin_Realm=tmpaflpcrf.prod,Origin_Host=pcrf1.tmpaflpcrf.prod,Origin_Realm=tmpaflpcrf.prod,Result=DIAMETER_SUCCESS,Role=Client,Count=2,Rate=0.002222224691360768,Average_Latency=14.5,
Start_Time_In_MS=1469078100003,Start_Time_Local=Thu_Jul_21_01:15:00_EDT_2016,End_Time_In_MS=1469079000004,End_Time_Local=Thu_Jul_21_01:30:00_EDT_2016,Site=tmpafl,Group=Diameter,Application=NS,Command=PNR,Destination_Host=csb.tmpaflpcrf.prod,Destination_Realm=tmpaflpcrf.prod,Egress_Peer_Origin_Host=csb.tmpaflpcrf.prod,Egress_Peer_Origin_Realm=tmpaflpcrf.prod,Origin_Host=pcrf1.tmpaflpcrf.prod,Origin_Realm=tmpaflpcrf.prod,Result=DIAMETER_SUCCESS,Role=Client,Count=2,Rate=0.002222219753089163,Average_Latency=28.5,
Start_Time_In_MS=1469078100003,Start_Time_Local=Thu_Jul_21_01:15:00_EDT_2016,End_Time_In_MS=1469079000004,End_Time_Local=Thu_Jul_21_01:30:00_EDT_2016,Site=tmpafl,Group=Diameter,Application=GS,Command=PAR,Destination_Host=csb.tmpaflpcrf.prod,Destination_Realm=tmpaflpcrf.prod,Egress_Peer_Origin_Host=csb.tmpaflpcrf.prod,Egress_Peer_Origin_Realm=tmpaflpcrf.prod,Origin_Host=pcrf1.tmpaflpcrf.prod,Origin_Realm=tmpaflpcrf.prod,Result=DIAMETER_SUCCESS,Role=Client,Count=2,Rate=0.002222219753089163,Average_Latency=18.5,
The main problem I'm having is that despite having no header file, Splunk is insisting on trying to represent the first line as the header. Even if I don't attempt to use the INDEXED_EXTRACTION=CSV option and instead use default settings with manual configurations, Splunk still identifies fields improperly. As as example from the above data, there should be a field called "Application" with three resulting values - proprietary, NS, and GS. However, Splunk returns the field as "Application_Proprietary" with values of "Application=Proprietary, Application=NS, Application=GS". I have a feeling that if I can make it work on one field, I can represent that across all of the remaining fields to finally make this ingest work properly. I've tried various options inside of props.conf to try to get it to identify properly but I've had no luck. Any help would be greatly appreciated!
Try this in your props
[your sourcetype]
BREAK_ONLY_BEFORE = (Start)
DATETIME_CONFIG =
KV_MODE = auto
MAX_TIMESTAMP_LOOKAHEAD = 25
NO_BINARY_CHECK = true
TIME_FORMAT = %a_%b_%d_%H:%M:%S
TIME_PREFIX = Start_Time_Local=
category = Custom
pulldown_type = true
That looks like it might have fixed the field issue but definitely caused problems with line breaking. It's now pulling all of the individual lines into a single event. I tried removing the BREAK_ONLY_BEFORE statement and replacing it with SHOULD_LINEMERGE = false but still not breaking properly. Here's what I've got right now in props.conf:
[diameter_metrics]
SHOULD_LINEMERGE = false
KV_MODE=auto
MAX_TIMESTAMP_LOOKAHEAD=48
NO_BINARY_CHECK=true
TIME_PREFIX=Start_Time_Local=
TIME_FORMAT=%a_%b_%e_%H:%M:%S