Getting Data In

How to get splunk to ignore the second line of a log file

capilarity
Path Finder

Our call manager spits out hundreds and hundreds of small log files, each containing a header line and a second line of garbage.

Line 1: Header
"cdrRecordType","globalCallID_callManagerId","globalCallID_callId","origLegCallIdentifier","dateTimeOrigination","origNodeId","origSpan","origIpAddr","callingPartyNumber","callingPartyUnicodeLoginUserID","origCause_location","origCause_value","origPrecedenceLevel","origMediaTransportAddress_IP","origMediaTransportAddress_Port","origMediaCap_payloadCapability","origMediaCap_maxFramesPerPacket","origMediaCap_g723BitRate","origVideoCap_Codec","origVideoCap_Bandwidth","origVideoCap_Resolution","origVideoTransportAddress_IP","origVideoTransportAddress_Port","origRSVPAudioStat","origRSVPVideoStat","destLegIdentifier","destNodeId","destSpan","destIpAddr","originalCalledPartyNumber","finalCalledPartyNumber","finalCalledPartyUnicodeLoginUserID","destCause_location","destCause_value","destPrecedenceLevel","destMediaTransportAddress_IP","destMediaTransportAddress_Port","destMediaCap_payloadCapability","destMediaCap_maxFramesPerPacket","destMediaCap_g723BitRate","destVideoCap_Codec","destVideoCap_Bandwidth","destVideoCap_Resolution","destVideoTransportAddress_IP","destVideoTransportAddress_Port","destRSVPAudioStat","destRSVPVideoStat","dateTimeConnect","dateTimeDisconnect","lastRedirectDn","pkid","originalCalledPartyNumberPartition","callingPartyNumberPartition","finalCalledPartyNumberPartition","lastRedirectDnPartition","duration","origDeviceName","destDeviceName","origCallTerminationOnBehalfOf","destCallTerminationOnBehalfOf","origCalledPartyRedirectOnBehalfOf","lastRedirectRedirectOnBehalfOf","origCalledPartyRedirectReason","lastRedirectRedirectReason","destConversationId","globalCallId_ClusterID","joinOnBehalfOf","comment","authCodeDescription","authorizationLevel","clientMatterCode","origDTMFMethod","destDTMFMethod","callSecuredStatus","origConversationId","origMediaCap_Bandwidth","destMediaCap_Bandwidth","authorizationCodeValue","outpulsedCallingPartyNumber","outpulsedCalledPartyNumber","origIpv4v6Addr","destIpv4v6Addr","origVideoCap_Codec_Channel2","origVideoCap_Bandwidth_Channel2","origVideoCap_Resolution_Channel2","origVideoTransportAddress_IP_Channel2","origVideoTransportAddress_Port_Channel2","origVideoChannel_Role_Channel2","destVideoCap_Codec_Channel2","destVideoCap_Bandwidth_Channel2","destVideoCap_Resolution_Channel2","destVideoTransportAddress_IP_Channel2","destVideoTransportAddress_Port_Channel2","destVideoChannel_Role_Channel2","IncomingProtocolID","IncomingProtocolCallRef","OutgoingProtocolID","OutgoingProtocolCallRef","currentRoutingReason","origRoutingReason","lastRedirectingRoutingReason","huntPilotPartition","huntPilotDN","calledPartyPatternUsage","IncomingICID","IncomingOrigIOI","IncomingTermIOI","OutgoingICID","OutgoingOrigIOI","OutgoingTermIOI","outpulsedOriginalCalledPartyNumber","outpulsedLastRedirectingNumber"

Line 2: Garbage

INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,VARCHAR(50),VARCHAR(128),INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,VARCHAR(64),VARCHAR(64),INTEGER,INTEGER,INTEGER,INTEGER,VARCHAR(50),VARCHAR(50),VARCHAR(128),INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,VARCHAR(64),VARCHAR(64),INTEGER,INTEGER,VARCHAR(50),UNIQUEIDENTIFIER,VARCHAR(50),VARCHAR(50),VARCHAR(50),VARCHAR(50),INTEGER,VARCHAR(129),VARCHAR(129),INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,VARCHAR(50),INTEGER,VARCHAR(2048),VARCHAR(50),INTEGER,VARCHAR(32),INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,VARCHAR(32),VARCHAR(50),VARCHAR(50),VARCHAR(64),VARCHAR(64),INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,INTEGER,VARCHAR(32),INTEGER,VARCHAR(32),INTEGER,INTEGER,INTEGER,VARCHAR(50),VARCHAR(50),INTEGER,VARCHAR(50),VARCHAR(50),VARCHAR(50),VARCHAR(50),VARCHAR(50),VARCHAR(50),VARCHAR(50),VARCHAR(50)

Lines 3 onwards contain the useful data:
1,3,1675985,58571830,1421194778,3,0,1210359306,"7435220","USERNAME",0,16,4,1210359306,20952,4,20,0,0,0,0,0,0,"0","0",58571831,2,22,-399533046,"926011605","926011605","",0,0,4,-399533046,18484,4,20,0,0,0,0,0,0,"0","0",1421194792,1421195072,"926011605","a7220bcb-ff5c-4910-8c2a-ac653d828890","P_LOC_PSTN","P_AS_INTERNAL_MIGRATED","P_LOC_PSTN","P_LOC_PSTN",280,"IPPHONE","S0/SU0/DS1-1@LOC_GW_SRST_2",12,0,0,0,0,0,0,"Region",0,"","",0,"",3,1,0,0,64,64,"","29015220","26011605","ipaddress","ipaddress",0,0,0,0,0,0,0,0,0,0,0,0,3,"00000000001992D1037DBC3600000000",4,"6044-29015220-26011605",0,0,0,"","",5,"","","","","","","",""
1,3,1675987,58571834,1421195075,3,0,1210097162,"7435904","USERNAME",0,16,4,0,0,0,0,0,0,0,0,0,0,"0","0",58571835,0,0,0,"","","",0,0,4,0,0,0,0,0,0,0,0,0,0,"0","0",0,1421195075,"","209334c9-b9a6-4ad8-b592-1c2072ee874f","","P_AS_INTERNAL_MIGRATED","","",0,"IPPHONE","",12,0,0,0,0,0,0,"Region",0,"","",0,"",3,0,0,0,0,0,"","","","ipaddress","",0,0,0,0,0,0,0,0,0,0,0,0,0,"",0,"",0,0,0,"","",2,"","","","","","","",""

Problem is that I can't get splunk to ignore Line 2 and this is causing a problem where is doesn't extract the headers properly and at the moment I get all lines added to the index with no KV pairs.

I've tried numerous ways of excluding line 2, even excluding the headers all together but all avail.

Any ideas?? Running 6.2.1 on windows

(editing each file to delete is not an option as there are too many and the same files are used else where)

0 Karma

claudio_manig
Communicator

hi folks

I have exactly the same situation as capilarity had and i this should be a perfect scenario to use PREAMBLE_REGEX but it does not work.

Did actually ever someone get this scenario to work with PREAMBLE_REGEX? I have the feeling that this never works as documented. I'm not a big fan of nullqueues as the indexer has to process this and we neither want to put any unnecessary load on them, nor send the data from the uf as we obviously don't need those events. I know this can be done on hf as well, but there's not always one in place.

0 Karma

gjanders
SplunkTrust
SplunkTrust

Just to confirm your following this rule?

Splunk props.conf

  • This feature and all of its settings apply at input time, when data is
    first read by Splunk. The setting is used on a Splunk system that has
    configured inputs acquiring the data.

In other words the PREAMBLE_REGEX must be on the universal forwarder where the inputs.conf is...

Also it might be worth posting a new question since this one goes back to 2015...also refer to the caveats here (Caveats to extracting fields from structured data files) if you do use this property...

claudio_manig
Communicator

yes i did as i test this out on a standalone machine-

0 Karma

gjanders
SplunkTrust
SplunkTrust

While in theory you could have PREAMBLE_REGEX ignore the first two lines (I haven't tested it but you could make a regex that matches both), my personal opinion is that unless you need structured data extraction, use a LINE_BREAKER to drop the non-required lines (if they are at the start) and split the events using this option...

You might need a new question if you need help with that...

0 Karma

claudio_manig
Communicator

Well as linebraking is also done on the indexer/hf and based on the logic using a LINE_BREAKER who captures the unneeded lines in the pattern and get rid of them this way is basically the same as using a transform and a null queue regarding process costs no?

0 Karma

gjanders
SplunkTrust
SplunkTrust

LINE_BREAKING is indexer/HF but it's more efficient than not specifying a LINE_BREAKER...
If you use the structured extraction method the processing happens more on the universal forwarder, but the more important question is, why are you trying to avoid indexer/HF load here?

The monitoring console has pages about the % CPU used by different processes, previously I found the index time processing (when LINE_BREAKER was in use and minimal HF layer), was around 1-3% of the CPU time, the 80+% was search related CPU usage!

Personally I would use the LINE_BREAKER, in my opinion this is more efficient than the nullQueue, as this is all done in the parsing queue. Furthermore it removes the overhead of aggregating the events using other parameters.
I've avoid the structured field parsing but this is all your choice here.

Also I'd carefully try to measure how much CPU savings you might get by moving this off the HF/indexer, I suspect it is very minimal but always happy to see evidence otherwise!

0 Karma

claudio_manig
Communicator

I dont wont the indexers to do more work as they had to, thats just based on personal experiences with heavy load environments - philosophy-wise if you know what i mean

LineBreaker's set already but i feel that i could be difficult to hande if it has to check for the example above as the line who we want to get rid of also matches the linebreaker of all other lines- i tried something like

([\r\n]+|[\r\n]+INTEGER.*$)\d

But it did not work-

Well i think we got a bit off topic here, however i appreciate your input on that- what bothers me at the end that there's a setting which never works for me and as it seems for other splunkers as well.

0 Karma

krusty
Contributor

It's me again.

I'm able to find the problem and solve it.
First I just did a search on the _internal index for my configured source.
There I found this entry:

ERROR TailingProcessor - Ignoring path due to: File will not be read, seekptr checksum did not match (file=/myfile/test.out). Last time we saw this initcrc, the filename was different. You may wish to use a CRC salt on this source. Consult the documentation or file a support case online at http://www.splunk.com/page/submit_issue">http://www.splunk.com/page/submit_issue for more info.

After I changed the Inputs.conf to this,

[monitor:///splunk/ftp/cisco/cdr_*]
disabled = false
followTail = 0
host = cucm
sourcetype = csv
index = voice
recursive = false
followSymlink = false
crcSalt = <SOURCE>

I was able to index the files into splunk. yeeeeha!!!
So the important thing which was missing is crcSalt = <SOURCE>.

By the way, I changed the REGEX back to INTEGER. This works too.

Thanks everybody for help.

0 Karma

charlescabico
New Member

Hi Krusty,

I followed your recommended solution but it doesn't seem to work in my case.
Logs seem to indicate the configured transforms are processed as expected, bu the 2nd line still shows up on the cucm index search.

inputs.conf
[monitor://C:\FTP]
disabled = false
index = cucm
sourcetype = csv
recursive = false
followTail = 0
followSymlink = false
crcSalt =

transforms.conf
[source::...\cdr_*]
TRANSFORMS-cdr_discard = eliminate_line_cdr

[source::...\cmr_*]
TRANSFORMS-cmr_discard = eliminate_line_cdr

props.conf
[source::...\cdr_*]
TRANSFORMS-cdr_discard = eliminate_line_cdr

[source::...\cmr_*]
TRANSFORMS-cmr_discard = eliminate_line_cdr

splunkd.log
03-24-2017 11:11:06.206 +1000 DEBUG PropertiesMapConfig - Performing pattern matching for: source::C:\FTP\cmr_StandAloneCluster_01_201703240110_466484|host::AU-BNE-SVR-SPK01|csv|2884
03-24-2017 11:11:06.206 +1000 DEBUG PropertiesMapConfig - Pattern 'source::...\cmr_' matches with lowest priority
03-24-2017 11:11:06.206 +1000 DEBUG PropertiesMapConfig - Pattern 'csv' matches with priority 100
03-24-2017 11:11:06.206 +1000 DEBUG PropertiesMapConfig - Performing pattern matching for: source::C:\FTP\cmr_StandAloneCluster_01_201703240110_466484|host::AU-BNE-SVR-SPK01|csv|2884
03-24-2017 11:11:06.207 +1000 DEBUG PropertiesMapConfig - Pattern 'source::...\cmr_
' matches with lowest priority
03-24-2017 11:11:06.207 +1000 DEBUG PropertiesMapConfig - Pattern 'csv' matches with priority 100
03-24-2017 11:11:06.207 +1000 DEBUG regexExtractionProcessor - RegexExtractor: Instance found for eliminate_line_cdr
03-24-2017 11:11:06.207 +1000 DEBUG regexExtractionProcessor - RegexExtractor: Interpolated to nullQueue
03-24-2017 11:11:06.207 +1000 DEBUG regexExtractionProcessor - RegexExtractor: Extracted nullQueue

0 Karma

krusty
Contributor

Hi,

I'm in the same situation. I have to "remove"/"delete"/"ignore" the second line of the csv file.
My configuration looks like this:

inputs.conf

[Monitor://splunk/ftp/cisco/cdr_*]
disabled = false
followTail = 0
host = <hostname>
sourcetype = csv
index = voice
recursive = false
followSymlink = false

transforms.conf

[eliminate_line_cdr]
REGEX=^INTEGER.*
DEST_KEY = Queue
FORMAT = nullQueue

props.conf

[source::/splunk/ftp/cisco/cdr_*]
TRANSFORMS-cisco_voice_cdr = eliminate_line_cdr

Could somebody tell me, where I do a mistake?
For me it looks good, but it will not work.

FYI: the configuration files are located on the splunk indexer. The same for the cdr_* files.

Any idea would be helpful.

Thanks

0 Karma

DalJeanis
Legend

That looks good to me. Have you restarted the indexer since you made the change?

You could also try this -

 REGEX=^(INTEGER|UNIQUEIDENTIFIER|VARCHAR(?>\(\d+\))?|(?>,)?)+

That assumes that your second header has only INTEGER, VARCHAR and UNIQUEIDENTIFIER field types.

0 Karma

krusty
Contributor

Hi DaIJeanis,
thanks for you answer. Unfortunately with your REGEX it will also not work.

If I manually remove the line which starts with INTEGER,... the file will be indexed. But this is not the goal for me.

0 Karma

somesoni2
Revered Legend

Few question- the props.conf and transforms.conf are placed on Indexers/Heavyforwarders right (and it was restarted after making the change)? Also, it seems there is typo in the DEST_KEY attribute, it should all small case letters queue. Also, can you try this

[eliminate_line_cdr]
REGEX=,INTEGER,INTEGER
DEST_KEY = queue
FORMAT = nullQueue
0 Karma

krusty
Contributor

Hi somesoni2,

yes I place the props.conf and transforms.conf on the indexer. We only have a universal forwarder running for windows events but in this case it is not used. For the specific cdr logs I configured our voice environment to send the data by sftp through our splunk indexer.

Yes I always restart the splunk service by typing service splunk restart on the command line. 😉 I didn't see any errors, regarding misconfiguration of *.conf files. So I gues that my configuration is fine.

I tried your REGEX and a DEST_KEY with lower letters, but it didn't solf the problem. If I place a new file into the folder which is monitored by splunk, splunk will do nothing.
Do you have any idea how I can debug it?

Thanks

0 Karma

asaste
Path Finder

Hi Matt,
I have exactly same requirement as yours and I am also not able to ignore 2 line. Did you get solution for this ? IF yes, Can you please share it.
Thanks in Advance.

0 Karma

gjanders
SplunkTrust
SplunkTrust

Sending data to the null queue as per one of the previous answers should work, you would need to ensure this configuration is on the first heavy forwarder or indexer that works with the data.

The null queue is offically documented here http://docs.splunk.com/Documentation/Splunk/6.4.3/Forwarding/Routeandfilterdatad and there are some examples on Splunk Answers...

chanfoli
Builder

I have not experimented with PREAMBLE_REGEX so I am not sure it will work. Have you tried something like:

INDEXED_EXTRACTIONS = CSV
HEADER_FIELD_LINE_NUMBER = 1
PREAMBLE_REGEX = ^INTEGER

Failing this you could try a null-queue transform:

In props.conf:

[your source or sourcetype spec]
TRANSFORMS-null = discardit

and in transforms.conf:

[discardit]
REGEX=^INTEGER
DEST_KEY = queue
FORMAT = nullQueue

andrewtrobec
Motivator

This has just worked for me, thanks!

0 Karma

capilarity
Path Finder

Tried the null-queue transformation, had no effect. Thought it might have been my regex but your version didn't work either.

Have also tired PREAMBLE-REGEX to delete both lines 1 and 2 but nothing gets indexed at all if use this.

thanks for the help though

0 Karma

chanfoli
Builder

Sorry it did not work out. If you do not find an answer using the "automatic" CSV index time extractions, you might consider using a different approach, namely a search-time extraction using KV_MODE=none, and the DELIMS = "," and FIELDS= "field1, field2, field3..." with a REPORT-class extraction as described here:

http://docs.splunk.com/Documentation/Splunk/6.2.1/Knowledge/Createandmaintainsearch-timefieldextract...

and in this old-ish example question:
http://answers.splunk.com/answers/3006/best-way-to-have-the-splunk-indexers-handle-a-csv-log-file.ht...

0 Karma
Get Updates on the Splunk Community!

Splunk Edge Processor | Popular Use Cases to Get Started with Edge Processor

Splunk Edge Processor offers more efficient, flexible data transformation – helping you reduce noise, control ...

3 Ways to Make OpenTelemetry Even Better

My role as an Observability Specialist at Splunk provides me with the opportunity to work with customers of ...

What's New in Splunk Cloud Platform 9.2.2406?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.2.2406 with many ...