Using Splunk version 4.3.3, build 128297
Using Windows Server 2008 Enterprise version 6 (Build 6002: Service Pack 2) - a Virtual Machine.
Why do I see a different number of events indexed (Event Count) via /en-GB/manager/launcher/data/indexes using the UI. When I'm adding data to Splunk from a static file, using the same file and a new index (created using the defualt settings) each time...
So far I have gotten these counts:
The source file which is an Apache Tomcat Server Log, is 3,637,248 bytes on disk, with 21319 Lines. I've created a custom Source Type for it:
[Apache-TomCat] pulldown_type = true MAX_TIMESTAMP_LOOKAHEAD = 32 SHOULD_LINEMERGE = False REPORT-Apache-TomCat = Apache-TomCat TRANSFORMS-comment = comment LINE_BREAKER = ([\r\n]+)
[comment] REGEX = ^# DEST_KEY = queue FORMAT = nullQueue [Apache-TomCat] FIELDS="date", "time", "c-ip", "x-H(remoteUser)", "cs-method", "cs-uri", "sc-status", "time-taken", "x-H(requestedSessionId)", "x-P(inFrame)", "x-P(eventSource)", "x-P(eventParam)", "x-P(eventShift)", "x-P(rcounter)", "x-P(scrollPositions)", "x-P(objFocusId)", "x-P(__navigator_index)", "x-R(username)", "x-S(int_user_id) DELIMS = " "
I'm adding data to splunk via the Splunk UI, navigating from Manager > Data inputs > Add data > Files and directories > Add new Selecting Upload and index a file Browsing for the file (D:\NTPA1111_log_2012-07-30 - sample.txt) and adding the below for More Settings:
For testing, I created 6 more indexes and tried adding the file two more times with the current settings specified above:
I removed LINE_BREAKER = ([\r\n]+) from the local props.conf file and tried 2 more times:
I removed the [comment] Stnza from the local transforms.conf file, removed TRANSFORMS-comment = comment from the local props.config and ran it 2 more times:
Still my results are inconsistant 😞
I've just reinstalled Splunk, created the local transforms.conf and props.conf (without the comment stanza and line_break line...) files, restarted splunk and then tried to index the file 3 more times:
I'm really surpried this is happening. any help/ideas would be greatful.
Example of the Log:
#Fields: date time c-ip x-H(remoteUser) cs-method cs-uri sc-status time-taken x-H(requestedSessionId) x-P(inFrame) x-P(eventSource) x-P(eventParam) x-P(eventShift) x-P(rcounter) x-P(scrollPositions) x-P(objFocusId) x-P(__navigator_index) x-R(username) x-S(int_user_id) #Version: 2.0 #Software: Apache Tomcat/6.0.26 2012-07-30 07:00:01 255.255.255.255 - POST /Name/APP.do?ts=20383926 200 0.041 'F039AE0E56089412190ABAE26496B80E' - - - - - - - '0' - 'BBBBBB' 2012-07-30 07:00:01 255.255.255.255 - GET /Name/resources/Folder/images/image.gif 200 0.000 'F039AE0E56089412190ABEE26496B80E' - - - - - - - - - 'BBBBBB' 2012-07-30 07:00:05 255.255.255.255 - GET /Name/?internal=Y 401 0.001 - - - - - - - - - - -
How are you comparing the sizes? By looking at the Manager->Indexes page, or by running this command
index=* sourcetype=Apache-TomCat | stats count by index
And do you get the same answer both ways?
Did you consider using one of the built-in sourcetypes for Apache data - access_combined or access_combined_wcookie?
In terms of considering Access_combined, Yes, but it doesn't capture the fields I would like. I'm unsure how that sourcetype will turn my logs into events either and if we will be able to add any index/search time field extraction with this soucetype. I'll try using that today and see if it is any better.
Hi Iguinn, I was only previously looking at the Manager->Indexes page.
Now when I run this: index=cms_test_1 | stats count by index
I get this
1 cms_test_1 20442
Notepad without wordwrap shows I should get: 20445 (so minues 3 for comments and woohoo!)
I tried it on 3 more files and it appears to not be working now...
File1 = 20442
File2 = 24350
File3 = 25425
file1 = 20442
file2 = 25467
file3 = 26540
Running this index=cms_test_1 | stats count by index shows the total of 72449 all in 1 result.. so the line break appears to be working.
The circumflex is required to anchor the regular expression at the beginning of the line. Your regex will match comments - but it will also match other lines that have a #. If you are sure that no other events will have a # anywhere in the event, no worries.
I didn't think that # was a reserved character, but perhaps it is in some regex flavors. So maybe
REGEX = ^\#
is better and will work with RegExr
This is a statiuc file.
Thanks for pointing out the field name problem, I've changed them now. After re-reading the Transforms.conf doc, I realise CLEAN_KEYS which defaults to true, implicitly solved my problem with the field names. Probabaly has a performace impact..
Regarding the Regex, I just checked your suggested syntax vs what I was using, in http://gskinner.com/RegExr/ and your didn't highlight any comments begining with #
Splunk Team seem to suggest this tool too: http://wiki.splunk.com/Community:RegexTestingTools
How sure are you my regex is incorrect?
Is this a static file? Are more events being added to the file? What is the "linecount" of the file according to other tools?
Second, although you didn't ask, some of your field names are invalid in the Apache-TomCat stanza of the transforms.conf. Field names may contain only alphabetic characters, numbers and underscore; they must begin with an alphabetic character.
Finally, your comment regex should be
REGEX = ^#
You were not requiring that the line begin with a #!