Events are going missing from our search results. The "scanned events" total during the search is correct, but the "matched events" is much smaller even though we are doing simple "source=foo" type of search, which normally does not filter out any events from a source.
The events are missing from contiguous timestamp ranges, out of 9 similar sources they are only missing from the 2 sources bigger than 500MB, and they only went missing after we started forwarding new data for the same "source" and "host" (most of the data for the source comes from a massive uncompressed backlog archive). Before re-doing it all on the 18th, it was all these for <14:32 on 13 Nov, a big missing patch for 13-14 Nov, and a few scattered patches of "matching" events sinnce 14 Nov. Currently events on the 2 sources are not "matching" for times before 17:23 on 17 Nov.
We re-did our Splunk setup from scratch, and encountered exactly the same problem. This is what we did to cause the problem:
Forward massive back-logs: on 18 Nov 2010 11:00, forward events from 9 uncompressed log archives totalling 1.7GB in a single monitored directory on a local Linux Box A, to the "livelogs" index on the Windows Splunk server. These logs span 2009-11-01 00:00 to 2010-11-18 00:00.
Besides the "license violation #1" warning on the 19th, this step goes smoothly. Searching for "*" showed "scanned events" equals "matched events" the whole way, and every last line of the logs is accounted for. NO MISSING DATA
Forward realtime Events*: around 18 Nov 16:45, start forwarding real-time events starting from 2010-11-18 00:00 from a monitored directory on a different Linux Box B (in a different country) to the "livelogs" index on the Windows Splunk server. Events have the same "source" and "host" as the imported backlogs, but come from the live enviroment and trickle in in real-time. These log files are of course much smaller and roll at midnight (foo.log->foo.log.1).
Some back-logs were >500MB: The largest archives are a 540MB log starting from 1 Nov 2009, and a 777MB request log that starts from 1 Oct 2010. There also are some smaller logs (180MB, 2MB, etc) all starting from 1 Nov 2009.
Events go missing: We come in the next morning, to discover searching "source=" for the two largest sources (540MB and 777MB) scans all the events but only events since 17:23 yesterday are "matching" and show in the results. For the 180MB source, the 2MB source and the other sub-500MB sources, all events come back (matched events equal to scanned events).
The livelogs index: I notice that the "main" index only has hot_v1_0, and about 50MB of misc. logs. The "livelogs" index has hot_v1_1 (28MB) and db_1290031199_1257053531_0 (846MB). The "Sources.data" therein shows the the 9 logs for timestamps up to 2010-11-17 23:59:59.
Search errors Sometimes we get this: Splunkd daemon is not responding: ('[Errno 10054] An existing connection was forcibly closed by the remote host',) on a search for source="/opt/logging/events/live/*.log" for "previous week" with "9,192 matching events|214,067 scanned events" -- note that the live/*.log are the only data in the "livelogs" index, so matching events is supposed to equal the scanned events.
Environment:
Splunk 4.1.5 Windows 64-bit on Windows Server 2008 R2
Free License, with plans to purchase next year provided we can resolve this issue.
Log entry flavour: ts=2010-11-17T23:59:59.170|Url=http://www.example.com/foo/bar|IPAddress=66.249.65.68
Log files are uncompressed
All logs are compressed-forwarded from a Linux Splunk (4.1.5) to the Windows server.
Related questions:
Disappearing logs?
Splunk Indexed Data Mysteriously Disappears
Clarifications for @gkanapathy
extract from system/local/indexes.conf:
[livelogs]
coldPath = $SPLUNK_DB\livelogs\colddb
homePath = $SPLUNK_DB\livelogs\db
thawedPath = $SPLUNK_DB\livelogs\thaweddb
extract from system/local/props.conf:
[default]
LEARN_SOURCETYPE = false
[pipekv]
LEARN_MODEL = false
SHOULD_LINEMERGE = false
TIME_FORMAT = %Y-%m-%dT%T.%Q
MAX_TIMESTAMP_LOOKAHEAD = 30
REPORT-fields = pipe-kv
extract from system/local/transforms.conf:
[pipe-kv]
DELIMS = "|", "="
every event from the large source was visible in splunk searches on the day that we indexed it. Things only disappeared the following day, after forwarding fresh data having the same "source".
By "missing", I mean that not all of the events scanned for "source=foo" are matching. The drop-down total of events in the source is correct, the "scanned events" during the search is correct, but the "matching events" is much less, and large spans of time do not match.
searching for source="/opt/logging/events/live/RequestLog.log" right now, which should match events from 1 Oct to present, but only has "matching" events from 5pm on Wed 17 Nov to 2pm on 19 Nov, then "missing" (scanned but not matched) until 00:00 Sun 21 Nov, and then "matching" events again to present time (9:26 Sun 21 Nov). Searching for all events from (index=livelogs) shows that nothing matches, from sources large or small, between 14:22:38 on 19 Nov and 00:00 on 21 Nov. .... and then SplunkWeb crashed ("Your network connection may have been lost or Splunk web may be down."). I'm beginning to regret the move to Windows Server 2008.
Re timestamping, the logs have timestamps on every line (ts=2010-11-17T23:59:59.170 as the first key=value), and RequestLog events are Poisson-distributed at about 1 per second.
... View more