Splunk Search
Highlighted

Problems indexing into zip files

Explorer

Hi All,

I am monitoring files that land in the same directory that I wish to be considered as different source types. The way
I want to distinguish them is with their names. There will be three different source types and they will be csv files.
The naming conventions will be time_*.csv, pulse_*.csv, and flow_*.csv.

I actually have this working using the following in inputs.conf:

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\pulse_*.csv]
sourcetype = DGC_PULSE
index=main
host_segment = 4
crcSalt = <SOURCE>

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\flow_*.csv]
sourcetype = DGC_FLOW
index=main
host_segment = 4
crcSalt = <SOURCE>

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\time_*.csv]
sourcetype = DGC_TIME
index=main
host_segment = 4
crcSalt = <SOURCE>

This works exactly as I want. The use of crcSalt turns out to be necessary as many of the files have meta information that
is identical and this forces the indexer to consider them all.

As I said, the above works fine as long as the files to be monitored are landed as .csv files. My requirements have changed
and I will now be landing *.zip files containing the desired .csv files.

It is not clear to me why, but splunk is not indexing the zip files using the above configuration. Everything I read would seem
to indicate that it should index the zip files. Perhaps the monitor stanza is excluding the zip files - I haven't been able to figure
that one out.

I can say that if the monitor stanza is left open([monitor://C:\tpg\leamcsv\dualgamma_logs\...\]), it will index the contents of the zip files, but that leaves me unable to distingush
the different sourcetypes(at least not in the way that I was doing).

After doing some research I read that attempting to index multiple sourcetypes from a common directory could lead to inconsistent
results(I dont have that link handy at the moment). At any rate, the suggestion was to use a more open qualification as I mentioned
in the previous paragraph and assign the sourcetype on a per event basis or in props.conf. I chose to do this in props.conf. I
am using the following configuration:

inputs.conf:

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\]
index=main
host_segment = 4
crcSalt = <SOURCE>

props.conf

[source::...\pulse_*\.csv]
sourcetype=DGC_PULSE

[source::...\flow_*\.csv]
sourcetype=DGC_FLOW

[source::...\time_*\.csv]
sourcetype=DGC_TIME

The problem I see now is that none of my expected sourcetypes are assigned. Instead, I get csv, csv1, csv2, etc... for sourcetypes.
I suspect the issue is with my regular expressions I have used in props.conf. From everything I have read, these look like they
are correct, but I haven't been able to figure out what I am missing.

Does any have any suggestions about my approach, and/or what might be wrong with my regular expressions?

Thanks

0 Karma
Highlighted

Re: Problems indexing into zip files

Motivator

How about...

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\pulse_*]
sourcetype = DGC_PULSE
index=main
host_segment = 4
crcSalt = <SOURCE>

that would work regardless if they are .zip or .csv

Are they being bundled inside of a single .zip?

If so:
inputs.conf

[monitor://C:\tpg\leamcsv\dualgamma_logs\...\]
sourcetype = DGC_TIME
index=main
host_segment = 4
crcSalt = <SOURCE>

transforms.conf

[transform_name1]
SOURCE_KEY = MetaData:Source
REGEX = pulse_*\.csv
DEST_KEY = MetaData:Sourcetype
FORMAT =  sourcetype::DGC_PULSE

[transform_name2]
SOURCE_KEY = MetaData:Source
REGEX = flow_*\.csv
DEST_KEY = MetaData:Sourcetype
FORMAT =  sourcetype::DGC_FLOW

props.conf

[DGC_TIME]
TRANSFORMS-transform_name = transform_name1, transform_name2
TIME_FORMAT = timeformat
SHOULD_LINEMERGE = false|true
0 Karma
Highlighted

Re: Problems indexing into zip files

Motivator

After you do this, you will need to either go to yoursplunkrul:8000/info and click reload EAI Objects where ever these configs are deployed to: UF (will need instance restart), Indexer, ect.

You may even want to restart the instance just for good measure.

0 Karma
Highlighted

Re: Problems indexing into zip files

Explorer

Thanks for the input. I tried this and am still getting csv, csv_1, etc for sourcetype. I did splunk clean all on both my splunk instance and my universal forwarder.

I think I understand what you have suggested and it looks very similar to what I was initially trying. Is it substantially different?

I am guessing that it is still failing on the regexes being used.

0 Karma
Highlighted

Re: Problems indexing into zip files

Explorer

this is what I am using in transforms.conf:

[transformname1]
SOURCE
KEY = MetaData:Source
REGEX = pulse*.csv
DEST
KEY = MetaData:Sourcetype
FORMAT = sourcetype::DGC_PULSE

and this is what I am using in props.conf:

[DGCPULSE]
TRANSFORMS-transform
name = transform_name1

I am not sure about this one - not sure about the mapping of the stanza name to sourcetype although I must admit I haven't look at the doc on this yet...

0 Karma
Highlighted

Re: Problems indexing into zip files

Motivator

What are the source names?

0 Karma
Highlighted

Re: Problems indexing into zip files

Motivator

Btw, you can replace transform_name 1,2 with anything you want, I was just using it as a filler name. Just make sure the names get put into the props.conf

0 Karma
Highlighted

Re: Problems indexing into zip files

Motivator

Ah, also the last bit goes in the props.conf.

What we are doing is saying by default, all data from the inputs path are to be known as source type DGCTIME. Then in the props.conf (by way of the transforms.conf) we say that if the source matches pulse.csv that it's source type should be DGCPULSE, if it matches flow.csv then it should be source type DGC_FLOW

And I just noticed I did not escape the . So, replace _.csv in regex with _..csv

0 Karma
Highlighted

Re: Problems indexing into zip files

Motivator

iPad isn't letting me select code "_.*\.csv"

0 Karma
Highlighted

Re: Problems indexing into zip files

Explorer

the source names look something like this:
timeDGCDG1423201310090907_37.csv

so are you saying the regex ought to look something like this:
pulse.*csv or pulse..csv or pulse_.*\csv? None of those seem obvious to me.

0 Karma