Solved: Spans of missing events in search results from lar...

grahampoulter · ‎11-19-2010

Events are going missing from our search results. The "scanned events" total during the search is correct, but the "matched events" is much smaller even though we are doing simple "source=foo" type of search, which normally does not filter out any events from a source.

The events are missing from contiguous timestamp ranges, out of 9 similar sources they are only missing from the 2 sources bigger than 500MB, and they only went missing after we started forwarding new data for the same "source" and "host" (most of the data for the source comes from a massive uncompressed backlog archive). Before re-doing it all on the 18th, it was all these for <14:32 on 13 Nov, a big missing patch for 13-14 Nov, and a few scattered patches of "matching" events sinnce 14 Nov. Currently events on the 2 sources are not "matching" for times before 17:23 on 17 Nov.

We re-did our Splunk setup from scratch, and encountered exactly the same problem. This is what we did to cause the problem:

Forward massive back-logs: on 18 Nov 2010 11:00, forward events from 9 uncompressed log archives totalling 1.7GB in a single monitored directory on a local Linux Box A, to the "livelogs" index on the Windows Splunk server. These logs span 2009-11-01 00:00 to 2010-11-18 00:00.

Besides the "license violation #1" warning on the 19th, this step goes smoothly. Searching for "*" showed "scanned events" equals "matched events" the whole way, and every last line of the logs is accounted for. NO MISSING DATA

Forward realtime Events*: around 18 Nov 16:45, start forwarding real-time events starting from 2010-11-18 00:00 from a monitored directory on a different Linux Box B (in a different country) to the "livelogs" index on the Windows Splunk server. Events have the same "source" and "host" as the imported backlogs, but come from the live enviroment and trickle in in real-time. These log files are of course much smaller and roll at midnight (foo.log->foo.log.1).

Some back-logs were >500MB: The largest archives are a 540MB log starting from 1 Nov 2009, and a 777MB request log that starts from 1 Oct 2010. There also are some smaller logs (180MB, 2MB, etc) all starting from 1 Nov 2009.

Events go missing: We come in the next morning, to discover searching "source=" for the two largest sources (540MB and 777MB) scans all the events but only events since 17:23 yesterday are "matching" and show in the results. For the 180MB source, the 2MB source and the other sub-500MB sources, all events come back (matched events equal to scanned events).

The livelogs index: I notice that the "main" index only has hot_v1_0, and about 50MB of misc. logs. The "livelogs" index has hot_v1_1 (28MB) and db_1290031199_1257053531_0 (846MB). The "Sources.data" therein shows the the 9 logs for timestamps up to 2010-11-17 23:59:59.

Search errors Sometimes we get this: Splunkd daemon is not responding: ('[Errno 10054] An existing connection was forcibly closed by the remote host',) on a search for source="/opt/logging/events/live/*.log" for "previous week" with "9,192 matching events|214,067 scanned events" -- note that the live/*.log are the only data in the "livelogs" index, so matching events is supposed to equal the scanned events.

Environment:

Splunk 4.1.5 Windows 64-bit on Windows Server 2008 R2
Free License, with plans to purchase next year provided we can resolve this issue.
Log entry flavour: ts=2010-11-17T23:59:59.170|Url=http://www.example.com/foo/bar|IPAddress=66.249.65.68
Log files are uncompressed
All logs are compressed-forwarded from a Linux Splunk (4.1.5) to the Windows server.

Related questions:

Clarifications for @gkanapathy

extract from system/local/indexes.conf:

[livelogs]
coldPath = $SPLUNK_DB\livelogs\colddb
homePath = $SPLUNK_DB\livelogs\db
thawedPath = $SPLUNK_DB\livelogs\thaweddb

extract from system/local/props.conf:

[default]
LEARN_SOURCETYPE = false
[pipekv]
LEARN_MODEL = false
SHOULD_LINEMERGE = false
TIME_FORMAT = %Y-%m-%dT%T.%Q
MAX_TIMESTAMP_LOOKAHEAD = 30
REPORT-fields = pipe-kv

extract from system/local/transforms.conf:

[pipe-kv]
DELIMS = "|", "="

every event from the large source was visible in splunk searches on the day that we indexed it. Things only disappeared the following day, after forwarding fresh data having the same "source".
By "missing", I mean that not all of the events scanned for "source=foo" are matching. The drop-down total of events in the source is correct, the "scanned events" during the search is correct, but the "matching events" is much less, and large spans of time do not match.
searching for source="/opt/logging/events/live/RequestLog.log" right now, which should match events from 1 Oct to present, but only has "matching" events from 5pm on Wed 17 Nov to 2pm on 19 Nov, then "missing" (scanned but not matched) until 00:00 Sun 21 Nov, and then "matching" events again to present time (9:26 Sun 21 Nov). Searching for all events from (index=livelogs) shows that nothing matches, from sources large or small, between 14:22:38 on 19 Nov and 00:00 on 21 Nov. .... and then SplunkWeb crashed ("Your network connection may have been lost or Splunk web may be down."). I'm beginning to regret the move to Windows Server 2008.
Re timestamping, the logs have timestamps on every line (ts=2010-11-17T23:59:59.170 as the first key=value), and RequestLog events are Poisson-distributed at about 1 per second.

grahampoulter · ‎11-22-2010

I have a possible solution that I'm going to test.

I was reading HowSplunkStoresIndexes and think the problem may be that the "livelogs" index, being a "custom" index, by default has only one hot bucket, and the hot bucket max range is 90 days. So it makes sense that indexing a 1.7GB, 12 month span of archival data in "livelogs" would have some issues.

Also in Indexes.conf the default quarantinePastSecs is 300 days, but part of each archival data source is as much as 360 days old, so I'm setting it up to 420 days to be safe.

After the archival data is indexed, I'm going to manually rotate the hot buckets to make sure there's fresh hot buckets for the new data.

Here is my new system/local/indexes.conf:

# only have 327GB on D:
maxTotalDataSizeMB = 250000
# quarantineFutureSecs 14 days
quarantineFutureSecs = 1209600
# allow archival data up to 420 days old
quarantinePastSecs = 36288000

[livelogs]
coldPath = $SPLUNK_DB\large\colddb
homePath = $SPLUNK_DB\large\db
thawedPath = $SPLUNK_DB\large\thaweddb
maxMemMB = 20
maxConcurrentOptimizes = 6
# roll after 1 day if exceeding maxHotBuckets
maxHotIdleSecs = 86400
# 10 x 90-day hot buckets to deal with archival data
maxHotBuckets = 10
# at most 750MB per bucket
maxDataSize = auto
# freeze(delete) after 3 years (1095 days)
frozenTimePeriodInSecs = 94608000

View solution in original post

grahampoulter · ‎11-22-2010

I have a possible solution that I'm going to test.

I was reading HowSplunkStoresIndexes and think the problem may be that the "livelogs" index, being a "custom" index, by default has only one hot bucket, and the hot bucket max range is 90 days. So it makes sense that indexing a 1.7GB, 12 month span of archival data in "livelogs" would have some issues.

Also in Indexes.conf the default quarantinePastSecs is 300 days, but part of each archival data source is as much as 360 days old, so I'm setting it up to 420 days to be safe.

After the archival data is indexed, I'm going to manually rotate the hot buckets to make sure there's fresh hot buckets for the new data.

Here is my new system/local/indexes.conf:

# only have 327GB on D:
maxTotalDataSizeMB = 250000
# quarantineFutureSecs 14 days
quarantineFutureSecs = 1209600
# allow archival data up to 420 days old
quarantinePastSecs = 36288000

[livelogs]
coldPath = $SPLUNK_DB\large\colddb
homePath = $SPLUNK_DB\large\db
thawedPath = $SPLUNK_DB\large\thaweddb
maxMemMB = 20
maxConcurrentOptimizes = 6
# roll after 1 day if exceeding maxHotBuckets
maxHotIdleSecs = 86400
# 10 x 90-day hot buckets to deal with archival data
maxHotBuckets = 10
# at most 750MB per bucket
maxDataSize = auto
# freeze(delete) after 3 years (1095 days)
frozenTimePeriodInSecs = 94608000

grahampoulter · ‎11-23-2010

It worked. The lesson is, when adding archival data, do not merely use a separate index, but ALSO configure it similarly to the settings for the "main" index found in defaults/indexes.conf -- one hot buckets and a longer than 90 day-range worth of data arriving all at once is a recipe for disaster.

Also, when your archival data is very old, adjust quarantinePastSecs to avoid your oldest data being quarantined.

http://answers.splunk.com/questions/712/put-data-in-separate-index-based-on-timestamp

gkanapathy · ‎11-20-2010

Hmm. Would you mind posting your indexes.conf settings, update them in your posting?

Can you also clarify, for the logs that are "missing" data, did you ever see them in Splunk? i.e., was the data there and searchable but then gone, or has it always (as far as you know) been unsearchable?

Also, do you possibly have large logs either without timestamps, or many hundreds of thousands of entries with identical (down to 1 second) timestamps?

grahampoulter · ‎11-21-2010

I have added the requested clarifications to the post under "Clarifications for @gkanapathy"

Spans of missing events in search results from large sources (>500MB)

What's New in Splunk Enterprise 9.4: Features to Power Your Digital Resilience

Take Your Breath Away with Splunk Risk-Based Alerting (RBA)

SignalFlow: What? Why? How?