About jrodman

jrodman · ‎01-19-2016

I do not agree with this flat advice. If you have to monitor a location where you have a very large corpus of files that will not change, ignoreOlderThan can be critical to achieving a workable result. The cost of repeatedly calling stat() on a fileset in the hundreds of thousands or millions of files will preclude effective data acquisition on any type of rotating storage or even hybrid storage. Even on solid state storage the cpu overhead may become unworkable. Yes, it's vastly better in many ways to simply not retain your archived data in the location that Splunk monitors. ignoreOlderThan is a workaround for cases where you cannot dictate the policy in the log-storage location. This will lower unavoidable operating-system overhead caused by providing metadata on these files to Splunk, as well as Splunk costs related to content-tracking to determine these files are already handled. There are also disk-space costs with all the content tracking, and in-memory costs in file input to retain some information about all of these files. But sometimes the choice is simply not available.

jrodman · ‎01-15-2016

Additionally, you are likely seeing interaction with modern windows (post-vista) modtimes. Windows, for its inscrutable reasons, does not update the modification time on files when the files are modified. Thus a file that is open for many days will become many days old, with the modtime only updated when the file is closed. Independently, ignoreOlderThan must be very aggressive, because it is designed to improve cases where extremely large numbers of files are present, and even checking the modtime on the files becomes too large an expensive to keep up with the data. That said, there is room to make this more explicitly documented and we are doing so.

jrodman · ‎01-15-2016

Indeed, this is precisely why the spec file states: * As a result, do not select a cutoff that could ever occur for a file you wish to index. Take downtime into account! Suggested value: 14d , which means 2 weeks

jrodman · ‎01-13-2016

Huh, we use rpath already on some platforms. No real reason we don't use it on os x.

jrodman · ‎12-24-2015

Incidentally, hyperthreaded cores will not offer a significant increase in real performance for splunk tasks commensurate to their quantity. The benefit is real, but not linear with hyperthread counts.

jrodman · ‎12-12-2015

Since the LDAP rework in Splunk Enterprise 4.2, Splunk always uses v3 of the LDAP protocol.

jrodman · ‎12-12-2015

In old versions of Splunk (e.g 4.0) it was possible to select the use of LDAP v2 or v3. In the current product is this still possible? If not, what version does Splunk use?

jrodman · ‎12-07-2015

Indeed, I've worked through multi-week-long painful system diagnosis problems where customers were enduring outages and brownouts due to running over capacity. Determining the full set of necessary work, the physical configurations, etc of the entire search environment, as well as the workload of user configuration as well as the aggregate of all installed apps is just not quick work without knowing the environment. Messages like this will save tens of thousands of dollars in waste.

jrodman · ‎10-08-2015

That's true, but I'm not speaking of the number of open files, but the number of files that splunk has discovered in the configured monitored location.

jrodman · ‎10-08-2015

[batch://] inputs and [monitor://] inputs are both handled by tailing & batchreader. batchreader handles the over-20MB-to-read files, for both batch:// and monitor://, however batchreader is a distinct component, which is what you and I both wrote about, and it does not implement the full set of logic. This is all going to be soon out of date anyway, as we're moving to a model where we have the finder-thread and multiple reader threads (one per pipeline), so the concept of "batch reader" should more or less go away.

jrodman · ‎10-08-2015

If you actually have a single directory containing 100k files in it directly, there are many filesystems which will themselves degrade horribly. The process of asking the operating system for the list of the 100k contained files can be exhaustively expensive on some systems. I hope that's not the propsed situation (Hashed directories are the typical solution for making this not break, and some systems can handle it.)

jrodman · ‎10-08-2015

I would caveat that there are design limitations with the current tailing system that will probably prevent monitoring a pool of file effectively in the millions range. Typically, it's true that at very large numbers of files, the aggregate data rate becomes a problem first (how much data can the forwarder digest per second (cpu limited), how fast can it transmit over the network (sometimes cpu limited, sometimes network limited), or most commonly how fast can the indexing tier push to disk per second in aggregate (tricky, can involve contention with searches). This means that the ability of tailing to read from large NUMBERS of files typically isn't relevant, as with very large installations the above problems are hit first. This means that when data is "spotty", almost always this means that the system as a whole is not able to keep up with the aggregate data rate, not that there is a problem with tailing (file monitoring) itself. Of course we need to look at the system diagnostics to confirm this, but file count would not be the first guess. Aside: As of 6.3 Splunk Enterprise (and other downstream splunkd products) Splunk has implemented parallel pipelines, allowing use of more cpu to process data, which should lower some of those classic cpu bottlenecks, at the price of higher total core count in use for incoming data. Regardless of those truths, tailing can still get into trouble in two sorts of situations: Situation 1, uncached metadata in the hundreds of thousands: In brief, if the files are stored somewhere that the current size and time can remain "warm", like in caches, then 10k to 100k files can work okay, but if files are somewhere this cannot be done, like NFS, the metadata requests are likely too demanding at file numbers like 100k. Situation 2, millions of files: Into the millions of monitored files, either per-file memory cost of tracking the files, or the I/O cost of retreiving the curent size or time of the files will become too large. Situation 1 in detail: Tailling checks frequently for whether files have changed, in order to queue those files to be read promptly. Because the goal is to achieve near-realtime in well-maintained conditions, the max intended wait time between checks is around a second. (There are other timers for error conditions, files currently open, etc). This means a minimum intended 10,000 to 100,000 (relative to file count) stat() calls for unix or GetFileInfo calls for windows every second. If the backing data (the file size and time) information is in memory, this isn't a big problem. However, if the files are stored on a storage system that cannot cache the information locally (some types of networked or clustered filesystems) that may result in a unworkably high number of I/O operations per second. The file monitoring code will gracefully degrade if it cannot achieve this intended schedule of file checks. The files will still be checked, and changed files will still be read from. However, not meeting that intended rate of file checks will typically mean that the storage system is being significantly taxed by the random IOPS of metadata retreival. This will lessen its capability to provide the practical file data, and on a shared-function storage system could introduce contention with other applications as well. Situation 2 in details: It may be surprising, but it's unavoidably necessary for the file monitoring component to keep some amount of memory in use for every discovered file (even files you rule out with controls like IgnoreOlderThan). This means that as the number of files grows to infinity, the ram needed by file monitoring will too. In practice, 100k files will use a significant amount of ram (perhaps hundreds of megabytes), but millions of files will easily reach multiple gigabytes. That can be accomodated (with grumbling perhaps), but at some point, e.g. 100million files, it will not work. More likely to be a problem is the same issue from Situation 1. With millions of files, it becomes more likely that the operating system's cacheing logic will not keep all the file status information in ram, or simply that the system calls to request that information will become too much of a cpu bottleneck for the system to perform smoothly. If there is a true need (especially if growing), to monitor a set of multi-millions of files from one installation, please present that case very actively through both support and sales channels.

jrodman · ‎10-08-2015

Are you sure batchreader implements time-before-close itself? It should simply hand the file back to tailing, but I don't know for sure what it does. The batchreader/tailing design last I looked was under review so I don't have the complete authority for the current 6.3 state.

jrodman · ‎06-09-2015

This answer isn't about an issue. It's an informational answer about splunk terminology.

jrodman · ‎06-05-2015

I don't know the full logic that went into the docs being written in that precise way. I would suggest commenting directly on the doc. I can pass it to the docs team generally but it seems more natural to raise the conversation directly with the docs team to keep yourself in the loop.

jrodman · ‎06-05-2015

minFreeSpace is enforced by the indexer, which isn't even loaded into memory on a Universal Forwarder.

jrodman · ‎06-05-2015

The more general paths for these controls across different Linux distributions are /sys/kernel/mm/transparent_hugepage/enabled /sys/kernel/mm/transparent_hugepage/defrag The Redhat-specific paths have to do with backporting the feature to older kernels when it was new. The more general way to persist this type of kernel system setting is via sysctl, but distributions or local practice may have preferred alternatives.

jrodman · ‎06-05-2015

Be sure you ACTUALLY have events over 256 lines long.

jrodman · ‎05-28-2015

I haven't personally tested, but my understanding was that this behavior came from a Linux kernel problem, subsequently fixed. The same high cpu problem was reported in many other applications during that time interval, and could be quiesced by simply setting the system date to the date it already had. So assuming the Linux kernel team has fully resolved the problem, and you've updated your kernels since 2012, it should not reoccur. Personally I'm encouraging an internal test at Splunk Corporate for the leap second traversal now. We can't cover all cases, but we should cover the common one.

jrodman · ‎05-22-2015

What are you trying to do with mvcombine here? It looks like your stats command is requesting a multivalue field for user, but then you're trying to combine it. mvcombine works on multiple events, with single-value fields. What do you want as your ultimate table?

jrodman · ‎05-13-2015

FWIW, in both recipes you probably want to consider manipulating the PATH variable as well, to put the things you'll want to use higher on the path than splunk's bin (or dropping the splunk bin dir entirely if that matches your goals), but that's harder to encapsulate in an example.

jrodman · ‎05-13-2015

When I wrote this i probably should have mentioned that you are likely to want to manipulate the PATH environment variable, to select which will win when running additional commands between system binaries, splunk-provided binaries and any custom binaries (eg in usr/local or special paths) in case of name conflict. This is typically relevant for bzip2, python itself, cherryd, the openssl utility program, node.js's node executable, but but more executables could possibly be added to splunk in the future.

jrodman · ‎05-05-2015

Years later.. No. The bulletin board messages are not attached to a log. The code paths that produce bulletin board messages need to pass the relevant information to logging channels when appropriate as well as the bulletin board. Thus, some messages (for example search output) may appear in search.log for relevant searches, and many important status messages from the backend are written to splunkd.log as well as the messages system. For the old python modules system that would have been handling that sort of view validation when this question was asked, it should have logged to web_service.log, but apparently did not.

jrodman · ‎05-04-2015

My Linux system has 30GB of RAM available. Is there a way I can limit the memory used by Splunk so that I do not exhaust the total memory in use on the system and cause a service outage?

jrodman · ‎05-04-2015

It's possible to use Linux "control groups" to apply a ceiling to the memory use of any group of processes via various means. Control groups were introduced originally to start meeting the needs of "containers" or in-operating-system virtualization goals like virtuozzo, openvzn, kvm and so on, but have since found uses for many potential goals. Here's article which describes steps which can be used on current releases of Linux (e.g. RHEL/CentOS 7 or Debian 😎 to limit all memory used by a particular userID (eg user splunk). http://wiki.splunk.com/Community:Limiting_Splunk_Memory_Linux_ControlGroups#Limiting_Splunk_Memory_with_Linux_Control_Groups

Posts	949
Solutions	172
Karma Given	397
Karma Received	987
Member Since	‎01-15-2010

Online Status	Offline
Date Last Visited	‎06-05-2020 02:02 AM

Why is copy-truncate a low-quality log-rotation st...

In LDAP integration for user authentication, what ...

Can I limit the total memory used by Splunk on my ...

After upgrading to Splunk 6.1, I have searches ret...

What is a splunk search in "zombie" state? What d...

How can I run a windowed realtime seach from the c...

Changes to search configuration (field extractions...

I've updated to the latest version of the PDF Serv...

Why doesn't the upload image feature of answers wo...

How can I install a splunk 4.2+ license from the c...

Re: Does Splunk re-index a file that was ignored d...

Re: Does Splunk re-index a file that was ignored d...

Re: Does Splunk re-index a file that was ignored d...

Re: Why am I getting an installation failure for S...

Re: "One or more machines does not meet the recomm...

Re: In LDAP integration for user authentication, w...

In LDAP integration for user authentication, what ...

Re: "One or more machines does not meet the recomm...

Re: is there a limit on the number of files splunk...

Re: is there a limit on the number of files splunk...

Re: is there a limit on the number of files splunk...

Re: is there a limit on the number of files splunk...

Re: is there a limit on the number of files splunk...

Re: What is a splunk search in "zombie" state? Wh...

Re: Minimum Disk Space for Splunk 6.x Universal Fo...

Re: Minimum Disk Space for Splunk 6.x Universal Fo...

Re: How do I disable Transparent Huge Pages (THP) ...

Re: Error in splunkd.log: Breaking event because l...

Re: After applying this year's Red Hat leap second...

Re: mvcombine ignores specified delimiter

Re: Python scripted inputs run with the wrong vers...

Re: External lookup command in Windows

Re: Are Splunk Manager error messages logged anywh...

Can I limit the total memory used by Splunk on my ...

Re: On Linux, how can I leverage operating system ...

Join the Conversation