Splunk Search

Problems indexing into zip files

tim9gray
Explorer

Hi All,

I am monitoring files that land in the same directory that I wish to be considered as different source types. The way
I want to distinguish them is with their names. There will be three different source types and they will be csv files.
The naming conventions will be time_*.csv, pulse_*.csv, and flow_*.csv.

I actually have this working using the following in inputs.conf:

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\pulse_*.csv]
sourcetype = DGC_PULSE
index=main
host_segment = 4
crcSalt = <SOURCE>

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\flow_*.csv]
sourcetype = DGC_FLOW
index=main
host_segment = 4
crcSalt = <SOURCE>

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\time_*.csv]
sourcetype = DGC_TIME
index=main
host_segment = 4
crcSalt = <SOURCE>

This works exactly as I want. The use of crcSalt turns out to be necessary as many of the files have meta information that
is identical and this forces the indexer to consider them all.

As I said, the above works fine as long as the files to be monitored are landed as .csv files. My requirements have changed
and I will now be landing *.zip files containing the desired .csv files.

It is not clear to me why, but splunk is not indexing the zip files using the above configuration. Everything I read would seem
to indicate that it should index the zip files. Perhaps the monitor stanza is excluding the zip files - I haven't been able to figure
that one out.

I can say that if the monitor stanza is left open([monitor://C:\tpg\leamcsv\dualgamma_logs\...\]), it will index the contents of the zip files, but that leaves me unable to distingush
the different sourcetypes(at least not in the way that I was doing).

After doing some research I read that attempting to index multiple sourcetypes from a common directory could lead to inconsistent
results(I dont have that link handy at the moment). At any rate, the suggestion was to use a more open qualification as I mentioned
in the previous paragraph and assign the sourcetype on a per event basis or in props.conf. I chose to do this in props.conf. I
am using the following configuration:

inputs.conf:

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\]
index=main
host_segment = 4
crcSalt = <SOURCE>

props.conf

[source::...\pulse_*\.csv]
sourcetype=DGC_PULSE

[source::...\flow_*\.csv]
sourcetype=DGC_FLOW

[source::...\time_*\.csv]
sourcetype=DGC_TIME

The problem I see now is that none of my expected sourcetypes are assigned. Instead, I get csv, csv1, csv2, etc... for sourcetypes.
I suspect the issue is with my regular expressions I have used in props.conf. From everything I have read, these look like they
are correct, but I haven't been able to figure out what I am missing.

Does any have any suggestions about my approach, and/or what might be wrong with my regular expressions?

Thanks

0 Karma

ShaneNewman
Motivator

Copy and paste these into the identified conf files. Then restart each instance they are deployed to. Be sure to change your time format in the props.conf.

inputs.conf

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\]
sourcetype = DGC_TIME
index=main
host_segment = 4
crcSalt = <SOURCE>

transforms.conf

[extract_pulse_sourcetype]
SOURCE_KEY = MetaData:Source
REGEX = pulse_.*\.csv
DEST_KEY = MetaData:Sourcetype
FORMAT =  sourcetype::DGC_PULSE

[extract_flow_sourcetype]
SOURCE_KEY = MetaData:Source
REGEX = flow_.*\.csv
DEST_KEY = MetaData:Sourcetype
FORMAT =  sourcetype::DGC_FLOW

props.conf

[DGC_TIME]
TRANSFORMS-transform_1 = extract_pulse_sourcetype
TRANSFORMS-transform_2 = extract_flow_sourcetype
TIME_FORMAT = timeformat
SHOULD_LINEMERGE = false|true
0 Karma

ShaneNewman
Motivator

Did that work for you?

0 Karma

ShaneNewman
Motivator

How about...

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\pulse_*]
sourcetype = DGC_PULSE
index=main
host_segment = 4
crcSalt = <SOURCE>

that would work regardless if they are .zip or .csv

Are they being bundled inside of a single .zip?

If so:
inputs.conf

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\]
sourcetype = DGC_TIME
index=main
host_segment = 4
crcSalt = <SOURCE>

transforms.conf

[transform_name1]
SOURCE_KEY = MetaData:Source
REGEX = pulse_*\.csv
DEST_KEY = MetaData:Sourcetype
FORMAT =  sourcetype::DGC_PULSE

[transform_name2]
SOURCE_KEY = MetaData:Source
REGEX = flow_*\.csv
DEST_KEY = MetaData:Sourcetype
FORMAT =  sourcetype::DGC_FLOW

props.conf

[DGC_TIME]
TRANSFORMS-transform_name = transform_name1, transform_name2
TIME_FORMAT = timeformat
SHOULD_LINEMERGE = false|true
0 Karma

ShaneNewman
Motivator

I will post a new answer in the answer field so I can get the code bit to work.

0 Karma

tim9gray
Explorer

looks like I had the same problem you had the last one should have been "pulse_.*\csv" as you put in your last comment.

0 Karma

tim9gray
Explorer

the source names look something like this:
time_DGC_DG14_23_2013_10_09_09_07_37.csv

so are you saying the regex ought to look something like this:
pulse_.csv or pulse_..csv or pulse_.\csv? None of those seem obvious to me.

0 Karma

ShaneNewman
Motivator

iPad isn't letting me select code "_.*\.csv"

0 Karma

ShaneNewman
Motivator

Ah, also the last bit goes in the props.conf.

What we are doing is saying by default, all data from the inputs path are to be known as source type DGC_TIME. Then in the props.conf (by way of the transforms.conf) we say that if the source matches pulse_.csv that it's source type should be DGC_PULSE, if it matches flow_.csv then it should be source type DGC_FLOW

And I just noticed I did not escape the . So, replace _.csv in regex with _..csv

0 Karma

ShaneNewman
Motivator

Btw, you can replace transform_name 1,2 with anything you want, I was just using it as a filler name. Just make sure the names get put into the props.conf

0 Karma

ShaneNewman
Motivator

What are the source names?

0 Karma

tim9gray
Explorer

this is what I am using in transforms.conf:

[transform_name1]
SOURCE_KEY = MetaData:Source
REGEX = pulse_*.csv
DEST_KEY = MetaData:Sourcetype
FORMAT = sourcetype::DGC_PULSE

and this is what I am using in props.conf:

[DGC_PULSE]
TRANSFORMS-transform_name = transform_name1

I am not sure about this one - not sure about the mapping of the stanza name to sourcetype although I must admit I haven't look at the doc on this yet...

0 Karma

tim9gray
Explorer

Thanks for the input. I tried this and am still getting csv, csv_1, etc for sourcetype. I did splunk clean all on both my splunk instance and my universal forwarder.

I think I understand what you have suggested and it looks very similar to what I was initially trying. Is it substantially different?

I am guessing that it is still failing on the regexes being used.

0 Karma

ShaneNewman
Motivator

After you do this, you will need to either go to yoursplunkrul:8000/info and click reload EAI Objects where ever these configs are deployed to: UF (will need instance restart), Indexer, ect.

You may even want to restart the instance just for good measure.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...