Getting Data In

inconsistent # of events parsed - /w custom SourceType & same source file

AccentureQBETA
Path Finder

Using Splunk version 4.3.3, build 128297
Using Windows Server 2008 Enterprise version 6 (Build 6002: Service Pack 2) - a Virtual Machine.

Why do I see a different number of events indexed (Event Count) via /en-GB/manager/launcher/data/indexes using the UI. When I'm adding data to Splunk from a static file, using the same file and a new index (created using the defualt settings) each time...

So far I have gotten these counts:

  • 13,281
  • 17,469
  • 16,273
  • 20,202

The source file which is an Apache Tomcat Server Log, is 3,637,248 bytes on disk, with 21319 Lines. I've created a custom Source Type for it:

My props.conf:

[Apache-TomCat]
pulldown_type = true
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE = False
REPORT-Apache-TomCat = Apache-TomCat
TRANSFORMS-comment = comment
LINE_BREAKER = ([\r\n]+)

My transforms.conf:

[comment]
REGEX = ^#
DEST_KEY = queue
FORMAT = nullQueue

[Apache-TomCat]
FIELDS="date", "time", "c-ip", "x-H(remoteUser)", "cs-method", "cs-uri", "sc-status", "time-taken", "x-H(requestedSessionId)", "x-P(inFrame)", "x-P(eventSource)", "x-P(eventParam)", "x-P(eventShift)", "x-P(rcounter)", "x-P(scrollPositions)", "x-P(objFocusId)", "x-P(__navigator_index)", "x-R(username)", "x-S(int_user_id)
DELIMS = " "

I'm adding data to splunk via the Splunk UI, navigating from Manager > Data inputs > Add data > Files and directories > Add new Selecting Upload and index a file Browsing for the file (D:\NTPA1111_log_2012-07-30 - sample.txt) and adding the below for More Settings:

  • Set Host: constant value
  • Host field value: NTXA1528
  • Set the source type: From List: Apache-TomCat
  • Set the destination index: test1

For testing, I created 6 more indexes and tried adding the file two more times with the current settings specified above:

  • 18921
  • 15590

I removed LINE_BREAKER = ([\r\n]+) from the local props.conf file and tried 2 more times:

  • 17,729
  • 18,803

I removed the [comment] Stnza from the local transforms.conf file, removed TRANSFORMS-comment = comment from the local props.config and ran it 2 more times:

  • 15,244
  • 16,465

Still my results are inconsistant 😞

I've just reinstalled Splunk, created the local transforms.conf and props.conf (without the comment stanza and line_break line...) files, restarted splunk and then tried to index the file 3 more times:

  • 21321
  • 19,063
  • 18995

I'm really surpried this is happening. any help/ideas would be greatful.

Example of the Log:

#Fields: date time c-ip x-H(remoteUser) cs-method cs-uri sc-status time-taken x-H(requestedSessionId) x-P(inFrame) x-P(eventSource) x-P(eventParam) x-P(eventShift) x-P(rcounter) x-P(scrollPositions) x-P(objFocusId) x-P(__navigator_index) x-R(username) x-S(int_user_id)
#Version: 2.0
#Software: Apache Tomcat/6.0.26
2012-07-30 07:00:01 255.255.255.255 - POST /Name/APP.do?ts=20383926 200 0.041 'F039AE0E56089412190ABAE26496B80E' - - - - - - - '0' - 'BBBBBB'
2012-07-30 07:00:01 255.255.255.255 - GET /Name/resources/Folder/images/image.gif 200 0.000 'F039AE0E56089412190ABEE26496B80E' - - - - - - - - - 'BBBBBB'
2012-07-30 07:00:05 255.255.255.255 - GET /Name/?internal=Y 401 0.001 - - - - - - - - - - -
0 Karma

lguinn2
Legend

How are you comparing the sizes? By looking at the Manager->Indexes page, or by running this command

index=* sourcetype=Apache-TomCat | stats count by index

And do you get the same answer both ways?

Did you consider using one of the built-in sourcetypes for Apache data - access_combined or access_combined_wcookie?

0 Karma

AccentureQBETA
Path Finder

In terms of considering Access_combined, Yes, but it doesn't capture the fields I would like. I'm unsure how that sourcetype will turn my logs into events either and if we will be able to add any index/search time field extraction with this soucetype. I'll try using that today and see if it is any better.

0 Karma

AccentureQBETA
Path Finder

Hi Iguinn, I was only previously looking at the Manager->Indexes page.

Now when I run this: index=cms_test_1 | stats count by index

I get this

index count

1 cms_test_1 20442

Notepad without wordwrap shows I should get: 20445 (so minues 3 for comments and woohoo!)

I tried it on 3 more files and it appears to not be working now...

Splunk Indexed:

File1 = 20442
File2 = 24350
File3 = 25425

Notepad shows:

file1 = 20442
file2 = 25467
file3 = 26540

Running this index=cms_test_1 | stats count by index shows the total of 72449 all in 1 result.. so the line break appears to be working.

0 Karma

AccentureQBETA
Path Finder

OK 🙂 I've updated. Thanks. What about my main problem? any ideas?

0 Karma

lguinn2
Legend

The circumflex is required to anchor the regular expression at the beginning of the line. Your regex will match comments - but it will also match other lines that have a #. If you are sure that no other events will have a # anywhere in the event, no worries.

I didn't think that # was a reserved character, but perhaps it is in some regex flavors. So maybe

REGEX = ^\#

is better and will work with RegExr

0 Karma

AccentureQBETA
Path Finder

This is a statiuc file.

Thanks for pointing out the field name problem, I've changed them now. After re-reading the Transforms.conf doc, I realise CLEAN_KEYS which defaults to true, implicitly solved my problem with the field names. Probabaly has a performace impact..

Regarding the Regex, I just checked your suggested syntax vs what I was using, in http://gskinner.com/RegExr/ and your didn't highlight any comments begining with #

Splunk Team seem to suggest this tool too: http://wiki.splunk.com/Community:RegexTestingTools

How sure are you my regex is incorrect?

0 Karma

lguinn2
Legend

Is this a static file? Are more events being added to the file? What is the "linecount" of the file according to other tools?

Second, although you didn't ask, some of your field names are invalid in the Apache-TomCat stanza of the transforms.conf. Field names may contain only alphabetic characters, numbers and underscore; they must begin with an alphabetic character.

Finally, your comment regex should be
REGEX = ^#

You were not requiring that the line begin with a #!

0 Karma
Get Updates on the Splunk Community!

Registration for Splunk University is Now Open!

Are you ready for an adventure in learning?   Brace yourselves because Splunk University is back, and it's ...

Splunkbase | Splunk Dashboard Examples App for SimpleXML End of Life

The Splunk Dashboard Examples App for SimpleXML will reach end of support on Dec 19, 2024, after which no new ...

Understanding Generative AI Techniques and Their Application in Cybersecurity

Watch On-Demand Artificial intelligence is the talk of the town nowadays, with industries of all kinds ...