About kulick

kulick · ‎01-15-2025

I know this is a while ago now, but maybe helpful to others...try using the "hidden" dimension `_timeseries`. This is a JSON string that is an amalgamation of all of the dimensions for each datapoint. Take care, the results may be (very) high arity and splunkd doesn't (yet?) have very strong protections for itself (in terms of RAM used while searching) when using this code path, so it is (IMHO) easy to crush your indexer tier's memory and cause lots of thrashing.

kulick · ‎04-15-2023

Just reading this thread. Cool. I didn't know about estdc(). I will definitely look for chances to use it now. Regarding your question @VatsalJagani , I suspect this works by just hashing the values (hash function is O(1)ish if chosen properly (maybe actually O(n) in the size of the value, but that is small (<1024 or something))) and then incrementing a "distinct" counter if that hash has not been seen before and finally recording that the hash has been seen in a table (at index based on the hash). That should all be basically constant time for each row/event considered. Of course, if two distinct values hash to the same "bucket", then you will miscount, but if you hash is large enough that you minimize those collisions, then you will get approximately the right answer. Lots of edge cases need to be handled in that description, but they are basically the same edge cases that arise in any hashtable implementation and (sane) hashtable implementations are well understood to be O(1) for insert and lookup. Thanks for the Answer, @khourihan_splun !

kulick · ‎01-26-2023

Our teams have noticed an issue since we upgraded to Splunk 9.0.3 (from 8.1.x) with the chart legend interactions. When the legend is a long list of series, attempting to scroll the legend list by clicking on the "scroll button" is instead mistakenly interpreted as a click on one of the legend's series, causing a drilldown search. This is seen, at least, in the Search Assistant. Has anyone else seen this? Is it a known (to Splunk) problem? Any idea which versions/situations do/don't exhibit the bug?

kulick · ‎12-02-2020

Sorry for the slow reply and thanks for the code review. Definitely misused eval there in my example. I will use searchWhenChanged="true" in the future. Thanks!

kulick · ‎09-29-2020

I needed something like this also. How about this inefficient solution? It seems to work as long as you get the right fields into the `foreach` part... | makeresults count=10 | streamstats count | eval a="1", b="2", name="count" | foreach * [ eval value=mvindex(mvappend(case(name="<<FIELD>>",'<<FIELD>>'),value),0) ]

kulick · ‎04-28-2020

If you want to avoid mvexpand/mvcombine (which have performance and capping risks), try this... | makeresults | eval str="JOHN SMITH new york city dEvOnShIrE" | eval str=lower(str) | rex mode=sed field=str "s/^([a-z])/__\1__/ s/ ([a-z])/ __\1__/g s/__a__/A/g s/__b__/B/g s/__c__/C/g s/__d__/D/g s/__e__/E/g s/__f__/F/g s/__g__/G/g s/__h__/H/g s/__i__/I/g s/__j__/J/g s/__k__/K/g s/__l__/L/g s/__m__/M/g s/__n__/N/g s/__o__/O/g s/__p__/P/g s/__q__/Q/g s/__r__/R/g s/__s__/S/g s/__t__/T/g s/__u__/U/g s/__v__/V/g s/__w__/W/g s/__x__/X/g s/__y__/Y/g s/__z__/Z/g" Not the prettiest, but reasonably efficient, I'd guess.

kulick · ‎04-28-2020

How about this SPL leveraging sed-mode of rex... | makeresults | eval str="The foo is bar. baz is fine. what?" | rex mode=sed field=str "s/\.( +)([a-z])/. \1__\2__/g s/__a__/A/g s/__b__/B/g s/__c__/C/g s/__d__/D/g s/__e__/E/g s/__f__/F/g s/__g__/G/g s/__h__/H/g s/__i__/I/g s/__j__/J/g s/__k__/K/g s/__l__/L/g s/__m__/M/g s/__n__/N/g s/__o__/O/g s/__p__/P/g s/__q__/Q/g s/__r__/R/g s/__s__/S/g s/__t__/T/g s/__u__/U/g s/__v__/V/g s/__w__/W/g s/__x__/X/g s/__y__/Y/g s/__z__/Z/g" Ugly, but effective. I use a variant (that doesn't require a period) to capitalize every word in a fragment for display in a table.

kulick · ‎01-28-2020

I can confirm both that tokens don't expand/work properly in search nodes' base attributes and that answer works well! Nice!

kulick · ‎01-24-2020

Maybe accept above answer for karma? 😉

kulick · ‎01-24-2020

Sorry, I didn't originally see your follow up here. I think that the underlying bug that I am working around in the patch referenced above produces different behaviors depending on many different factors (eg. custom command input event to output event ratio (does it turn each input event into multiple output events?), Splunk architecture (are there many indexers serving the query or just one?), the SPL structure (is the SPL command starting from a generating command or from a real search?), etc.) Based on all of these differences, the bug may or may not be triggered. I didn't completely follow your request here, but I do not believe that simply adding or removing a flush() command at the right time will work around the current Splunk daemon bug. Instead, the custom command must carefully manage the timing of how it collects and returns information to its Splunk parent process to avoid the bug...

kulick · ‎01-24-2020

I am having trouble supporting both URL link parameters for a form input/token and "derived" value tokens which are computed from the value of a form token. As an example, imagine that I have two tokens, tok and derived. The token derived is set when token tok is changed, as follows... <input token=“tok”> <label>Summary Interval</label> <choice value="10minute">10 minutes</choice> <choice value="1hour">1 hour</choice> <choice value="1day">1 day</choice> <default>1hour</default> <change> <condition value="10minute"> <eval token="derived">minspan=10m</eval> </condition> <condition> <set token="derived"></set> </condition> </change> </input> The problem comes when a user links to this dashboard setting tok to "10minute" in the URL (with something like "tok=10minute" in the args). In this case, it seems that the token derived will not be set at all, causing any dependent dashboard panels to wait on token input. I can work around by computing derived in a search instead, but it sure is convenient to simply set such a related (derived) token right here in the token it depends on. Does this usage of a change clause in the input pretty much ensure that bookmarks and other URL links containing token tok will be a problem?

kulick · ‎01-22-2020

Some reports that this may be fixed in 8.0. See similar question... https://answers.splunk.com/answers/777206/use-mobile-device-to-open-browser-with-splunk-dash.html

kulick · ‎01-22-2020

I would love a CSS or JS patch to repair this, if possible, to help users until we are ready to go to 8.0+. Anyone know what was changed? Is a simple patch/hack on specific dashboards viable?

kulick · ‎11-25-2019

I have attempted to handle this issue in a more recent change to the original patch. Details here: https://github.com/splunk/splunk-sdk-python/compare/master...TiVo:large-scale-custom-cmds Let me know if that resolves the issues you were seeing. Good luck! 🙂

kulick · ‎11-25-2019

Oh, latest upload here: https://github.com/splunk/splunk-sdk-python/compare/master...TiVo:large-scale-custom-cmds

kulick · ‎11-25-2019

Howdy. 😃 I actually had some additional synchronization change deltas on that original hacking that I never uploaded to github. I have pushed them now. Perhaps they handle the situation you were hitting? The changes were related to cases that hit maxresultrows and the behavior of the base class...

kulick · ‎08-09-2019

Thanks for these links. I put a pointer to my github changes in that issue.

kulick · ‎08-09-2019

I think that my latest update to my previous "answer" now actually is an answer to your original question and the problem that we were both experiencing. I'd love to know if it helps you...

kulick · ‎08-07-2019

And in fact, Martin taught me a great trick to avoid needing mvexpand . The trick covers cases where you would ultimately just be using the field in question in a group by clause of a subsequent stats command. In this case, you can simply leave the multi-valued field multi-valued and things will "just work". Cool trick! Thanks for showing me that one, Martin!

kulick · ‎07-31-2019

See my answer above, but I believe adding a flush() each time through the loop triggers the problem faster because each time the custom command process (child) writes a batch (or even a single) event back to the Splunk daemon (parent), the parent responds by sending an "empty" chunk back. Since the default python SDK never reads stdin after collecting the initial batch of events, this stdin pipe can rapidly fill up and the parent will eventually either block or get an EWOULDBLOCK errno on write calls to the other end of the pipe. Sadly, repairing this issue by teaching the child to monitor and continually drain stdin was not sufficient to prevent this error from occurring, though it does reduce the frequency somewhat.

kulick · ‎07-31-2019

UPDATE: With the help of a few hints from a friendly at Splunk, I believe that I have managed to get this working. I have tested on numerous configurations (single server vs. 3 SHC with 6 indexer cluster, generating vs. eventing base searches, with and w/o previews, localop and parallel on indexers) and all seem to work. Sometimes you must tune a timing parameter ( throttleusec ) that helps the custom command child process throttle the results passing back to the Splunk parent daemon, but I have gotten this version to work with hundreds of millions of events very reliably. The solution is embodied in the new echo custom command implemented in this change... https://github.com/TiVo/splunk-sdk-python/commit/5188f7d709cadd80e786692b371a64c4ae0991d2 Also, Splunk reports on my service ticket that this underlying timing bug will be resolved in a future release. ORIGINAL ANSWER: I have spent a couple of days attempting to better understand and work around this problem. At the end of my efforts, I have concluded that there is a bug in the Splunk daemon itself that behaves somewhat differently (timing-wise) from version to version and machine to machine. Along the way, I found multiple opportunities to enhance/improve the python SDK, but my fixes did not ultimately prevent the underlying problem from recurring. Details below. We initially observed this problem in production as a scheduled job began consistently failing after working fine for months. The problem signature was (as stated in the question here): <timestamp> ERROR ChunkedExternProcessor - Failure writing result chunk, buffer full. External process possibly failed to read its stdin. <timestamp> ERROR ChunkedExternProcessor - Error in '<our_custom_cmd>' command: Failed to send message to external search command, see search.log. Once this error appeared, it occurred consistently. The exact timing of when it occurred to the search relative to the search launch time varied somewhat. We were quickly able to reproduce this problem on local, much simpler Splunk workstation installs (single machine) using | makeresults or even index=_* | head 1000000 | table _time host source sourcetype style base searches connected to our custom command. As also already stated here, we also quickly determined that reducing the custom command to the simplest possible configuration (that simply yielded back its input) still produced the problem. During these rounds of testing, we found that the error was not 100% consistent and varying the number of events sent to the custom command and the size of those events seemed to change the frequency of the reported error. Additionally, debugging, esp. logging, added to the custom command impacting the likelihood of hitting the error. Given the text of the error, we began to suspect that somehow our custom command was allowing a pipe to fill causing this issue. Especially suspicious was the custom command's stdin which we knew to be connected to the Splunk daemon that was reporting the error. Reviewing the implementation strategy of the command, including the python SDK base classes, presented a few potential optimizations. First, the python SDK will currently simply read the entirety of the data input into RAM before processing (due to the implementation of _read_chunk() in SearchCommand ). This seemed problematic for multiple reasons (memory usage of the custom command, lack of true streaming implementation for large data sets). We first attempted to repair this by building a "chunk-aware" input and processing the events as we read them from stdin . This timing change (reading the input records more slowly and producing output records while doing so) seemed to much more quickly trigger the buffer full failure, so, while we think this is actually the best implementation, we abandoned it. class ChunkedInput(object): def __init__(self, infile, limit): self._file = infile self._limit = limit def __iter__(self): while True: if self._limit <= 0: return line = self._file.readline() yield line self._limit -= len(line) Instead, we repaired the "read everything into RAM" problem by implementing a new class StreamingSearchCommand , derived from SearchCommand . In this implementation, we reworked _read_chunk() and _records_protocol_v2() to download the incoming events into a gzip'd file in the dispatch directory and then reopen and stream them back from there. This greatly reduced memory footprint required by our simple test custom SPL command, but the buffer full errors continued. Next, we imagined that perhaps, since our command was not continuing to monitor and read from stdin once it had collected all of the incoming events that the pipe attached to our stdin might be filling to the point where the Splunk daemon was going to block attempting to write into it. We imagined that this condition could be underlying the error reported here. We repaired this oversight by improving our custom SmartStreamingCommand class to also occasionally poll ( select actually) the stdin file descriptor and read out a chunk if data was present. Testing of this implementation confirmed that the Splunk daemon did, in fact, continue to occasionally write things to us through this pipe, even after we had collected all input records. Still, this improvement did not completely prevent the dreaded buffer full error. Finally, reviewing the python SDK implementation further we were concerned that it might not be flushing records in a streaming fashion, but instead waiting until the generator chain ( self._record_writer.write_records(process(self._records(ifile))) ) drained. So, we added an occasional flush() call to our SmartStreamingCommand implementation. Unfortunately, we continued to hit buffer full errors. At this point, we decided to bring out the big guns ( strace ) and we started by monitoring our process. We could easily see that we were regularly monitoring stdin and reading it quickly if data was present. Everything on the python SDK/custom command process side seemed okay, so we switched to strace ing the Splunk daemon itself. We found that attaching strace had the (Heisenberg) effect of eliminating the problem altogether. Excellent. After numerous tries, changing the number and list of syscalls that we were intercepting, we finally managed to catch the failure in action one time. We expected review of this precious, captured output to show a system call returning EWOULDBLOCK or similar, allowing us to work backwards to understand the condition that caused the Splunk daemon to become upset and produce this error. Unfortunately, after quite a bit of time tracing file descriptors, futexes and signals across threads in the Splunk daemon, all of the system calls looked fine and no clear culprit was illuminated. Additional inspection of the many search.log examples that were generated during the testing and evaluation of this issue did seem to show a pattern. Specifically, the Splunk daemon would fairly consistently issue this error approximately 80ms after the custom command had flushed a batch of input to it. We believe this suggests that the code associated with reading event records back from the custom command is thus likely implicated in this issue. We attempted, to some, but not complete success, to leverage this observation by adding a slight sleep() before flushing each batch of records. After numerous attempts to work around this issue, including building lower footprint, more efficient python SDK replacements, we remain stuck with this issue, unable to build custom commands that process more than a million or so events without causing issues. This is a significant weakness in our current "big data" infrastructure and is blocking us on a few fronts. We would welcome advice or collaboration intended to work towards a solution to this issue. I will file a Splunk support ticket referencing this item on Splunk Answers...

kulick · ‎11-16-2018

Unfortunate, this behavior, but if it is the current state of the art, then ER seems the best path forward. Thanks.

kulick · ‎11-16-2018

I think you need to be sure that your data in field 'container' is numeric if you use... <scale type="linear"></scale> Maybe try... <scale type="category"></scale> ...instead for non-numeric data.

kulick · ‎11-12-2018

I like and need mvexpand to work with some of my data. Sometimes, our input events contain information about multiple, underlying events (esp. rich JSON data sources). I understand that mvexpand can, under certain situations, can lead to scaling challenges with SPL. I generally think of these problematic cases as examples where each individual input event expands into lots (hundreds, thousands or more) of newevents. I can imagine this being especially tricky when the arity of the expansion varies greatly from input event to input event. I want to believe that cases where mvexpand causes the event count to be doubled should be safe. It seems that these cases could be implemented to be fully streamable (at the indexers) and that the SPL should scale out embarrassingly easily. Here's an example query: | makeresults count=10000 | streamstats count | eval count=1000*round((count-1)/1000-0.5,0) | eval mcount=mvrange(0,99,10) | mvexpand mcount | fields count mcount | fields - _raw | eval ucount=mvrange(0,49,10) | mvexpand ucount | fields count mcount ucount | fields - _raw | eventstats count as total by count | eventstats count as mtotal by mcount | eventstats count as utotal by ucount | stats count, values(eval(count." (".total.")")) as cvalues, values(eval(mcount." (".mtotal.")")) as mvalues, values(eval(ucount." (".utotal.")")) as uvalues This SPL makes 10,000 events and then mvexpand s twice, once by 10x and once by 5x. The result is 500,000 events as expected. By tweaking the makeresults and mvrange commands, we can test different limits of the mvexpand command. Adjusting the ucount to mvrange(0,99,10) produces the expected 1,000,000 events. This, however, is the highest number that works as I expected. Once the total number of total events exceeds 1,000,000 events, as any mvexpand , some (undesirable) caps begin to be applied. In my case, I need to use mvexpand with a case where the base search itself produces many tens or hundreds of millions of events. The "expansion factor", if you will, is a small, constant number (<100, likely less than 10 and can be constrained). Here is an example where the final expansion merely doubles the event count (in a completely local way) that I believe should work... | makeresults count=10000 | streamstats count | eval count=1000*round((count-1)/1000-0.5,0) | eval mcount=mvrange(0,99,1) | mvexpand mcount | fields count mcount | fields - _raw | eval ucount=mvrange(0,49,25) | mvexpand ucount | fields count mcount ucount | fields - _raw | eventstats count as total by count | eventstats count as mtotal by mcount | eventstats count as utotal by ucount | stats count, values(eval(count." (".total.")")) as cvalues, values(eval(mcount." (".mtotal.")")) as mvalues, values(eval(ucount." (".utotal.")")) as uvalues Instead of 2,000,000 events, I only get 984,200 on my environment. I am imagining building my own custom command, but I suspect that others have hit this limit. It certainly seems that mvexpand /could/ be smarter than this. Any advice? (For the record, I have already tried the fields - _raw trick shared in other mvexpand answers.)

kulick · ‎02-07-2018

Data Set Characteristics We have an index containing ~100k events that are each about 1k in size, making a roughly 100MB collection of data. This covers one week, but eventually we'd like to operate on months of this data. Let's consider this one week first. Problem Statement We desire to study these events and group them together based on a field we call 'transaction_id'. In practice, most events have unique 'transaction_id' fields and collecting them together will only reduce the number of results to around 65% of the initial input events. The problem we face is that attempts to aggregate this data hit memory limits (stats command) or run very slowly (transaction command). Attempt #1: transaction When we were rookies, we started with transaction. It has a simple and effective "UI" and we used it successfully on this problem. It is however slower and does not scale as well as a 'stats'-based solution so we desire to move on. Attempt #2: stats We have read the old scrolls and understand that we should prefer 'stats' if possible (for multiple reasons). We can generally get this technique to work, but if we apply too many input events, our jobs are crushed by the resource limit enforcer on our search head. Our SPL command is very simple... search index=data | stats first(field1) as field1 by transaction_id Our reasoning regarding the required memory usage, from first principles, is thus. Our data set is, in total, about 100MB. In order to aggregate our data, the 'stats' command must hold it all (memory or disk) and combine then return it. If we imagine that the stats command's data structures and in-memory representation require 100x the original data input size, then we would need around 100MBx100 = 10GB of RAM for this operation, if no disk is to be used. While we understand conceptually that the 'stats' command is smart enough to 'spill to disk' while aggregating to avoid needing infinite amounts of memory, our experience with it in these cases is not rewarding. It gets killed while using far more memory than we would expect given our 'max_mem_usage_mb' setting. Our job size limit is 64GB on this search head... search_process_memory_usage_threshold = 64000 search_process_memory_usage_percentage_threshold = 25 -bash-4.2$ cat /proc/meminfo | grep MemTotal MemTotal: 263936036 kB We expect that any given command in our SPL pipeline that honors 'max_mem_usage_mb' will try to use around or under our configured 200MB limit... limits.conf: [default] max_mem_usage_mb = 200 We want to believe that this should mean that our 'stats' command will begin spilling to disk before using GBs of memory. However, when we monitor this SPL query while running (watching 'top' on the search head), it is easy to see that it will grow well into the tens of GBs of resident set size. This leads to the following questions... How does 'stats' decide when to 'spill to disk' and change its algorithmic approach? Are we just hitting some current weakness in the algorithm that selects the aggregation strategy? We have taken note of the 'phased_execution' documentation in limits.conf.spec and wonder how we might learn more... 🙂 Other Thoughts We are also aware of this talk which proposes a novel and interesting strategy for combatting this kind of problem. We are still reviewing the mechanism it proposes. https://conf.splunk.com/files/2016/slides/superspeeding-transaction-monitoring-with-the-kvtransaction-command.pdf

Posts	30
Solutions	1
Karma Given	33
Karma Received	17
Member Since	‎09-07-2015

Online Status	Offline
Date Last Visited	‎12-23-2025 12:31 PM

Is there a bug with clicking to scroll chart legen...

Bad Dashboard Interaction: Form Input Derived Toke...

Can you help me with the following issue involving...

Transacting with High Cardinality Groups and Memor...

How might I slice/filter data by dynamically chang...

Re: How would I build a table of all metrics and t...

Re: Whats the difference between dc (distinct coun...

Is there a bug with clicking to scroll chart legen...

Re: Bad Dashboard Interaction: Form Input Derived ...

Re: Read a field value which field name is in anot...

Re: Capitalize every word of field in search resul...

Re: Automatically capitalize the first letter of e...

Re: Change base search based on dropdown

Re: How can I define column-color when I do not kn...

Re: Chunked=True SmartStreamingCommand to support ...

Bad Dashboard Interaction: Form Input Derived Toke...

Re: Touch-based usage of Splunk Web restricted - v...

Re: Use mobile device to open browser with splunk...

Re: Chunked=True SmartStreamingCommand to support ...

Re: How come custom search commands (CSC) SCPv2 ca...

Re: How come custom search commands (CSC) SCPv2 ca...

Re: How come custom search commands (CSC) SCPv2 ca...

Re: How come custom search commands (CSC) SCPv2 ca...

Re: Can you help me with the following issue invol...

Re: How come custom search commands (CSC) SCPv2 ca...

Re: How come custom search commands (CSC) SCPv2 ca...

Re: Can you help me with the following issue invol...

Re: How can I define column-color when I do not kn...

Can you help me with the following issue involving...

Transacting with High Cardinality Groups and Memor...

Join the Conversation