About nickhills

nickhills · ‎12-03-2020

You would be requesting a bypass for the SSL inspection. In other words, you would need a business case justification that the Splunk Cloud Gateway service should be allowed to pass through the proxy un-inspected. The cloud Gateway docs say this: If your proxy is running SSL decryption, it must support WebSockets or exempt prod.spacebridge.spl.mobi. So the simplest approach is to exempt/bypass inspection for prod.spacebridge.spl.mobi (or make sure websockets are enabled) https://docs.splunk.com/Documentation/Gateway/1.13.0/Registration/TroubleshootConnectionIssues I am not sure that there is a documented approach for adding custom CA certificates - I would talk to Splunk Support for their recommendation on the best approach/if this is even possible - There are significant security controls built in to Spacebridge, adding additional CA certs into the trust chain could compromise this, so you maybe limited to the first option

nickhills · ‎12-02-2020

Is your proxy doing SSL inspection? If so, this will break the certificate chain that SpaceBridge uses to communicate. In environments I have worked on in the past we have requested a bypass for inspection (along with business case justification). This is the simplest route forwards, the alternative is to configure Splunk to trust your Proxies CA cert.

nickhills · ‎12-02-2020

That happens because of some additional escaping which is needed. Try this version, which avoids needing a triple \ Site (?<site>\w+).+\\\\(?<result>\w+)

nickhills · ‎12-02-2020

The easiest way to confirm that you have identities and assets being collected and presented correctly is with the following two searches. |`identities` and |`assets` If those commands produce complete & well formatted output, then it should be working.

nickhills · ‎12-02-2020

You can rewrite "_time" to be a valid timestamp from your event data, and then the transaction command will use that value instead. For example, if you have a field called ServerTime in the format "12/02/2020 09:32:24 AM": index=main source=http:xxxx ServiceName=xxxx OR ServiceName=xxxx CorrelationID=xxxx |rex xxxxxxxx|eval _time=strptime(ServerTime,"%m/%d/%Y %I:%M:%S %p") |transaction RawCorrelationID |stats count by duration

nickhills · ‎12-01-2020

Assuming that you can't free space (by freezing events)... If you need more storage, you need more disk. If you can't make the disks any bigger, then you need more of them. You have some options: - Add more indexers - This is the simplest way to add capacity. Once you have added new peer(s), rebalance your cluster and after the process completes you will have (approximately) even distribution of data across all your indexers. https://docs.splunk.com/Documentation/Splunk/8.1.0/Indexer/Addclusterpeer - Add S3 Storage. If you are already in AWS, you could consider using SmartStore. This stores your oldest data in AWS S3, keeping your more recent data cached locally on your physical [sic] volumes. Unless you frequently access old data the performance impact is negligible if you tune things properly. https://docs.splunk.com/Documentation/Splunk/8.1.0/Indexer/AboutSmartStore - [a potentially unhelpful remark about moving to linux was here. Self censored] If it's an option, you can modify your SF/RF settings to reduce the number of duplicate copies of data to buy some time. Make sure you understand the implications of this first.

nickhills · ‎06-12-2020

An update on this: This was related to my recent post about memory allocation on systemd. I managed to work around the problem by giving the host a large fast swap disk, but the crux of the issue was the constrained memory limit in the systemd unit file. See: https://community.splunk.com/t5/Installation/GOTCHA-Upgraded-Memory-with-systemd-Read-This/m-p/501871#M6313 Fixing the root caused eradicated the issue!

nickhills · ‎05-29-2020

Hi Rich, I had a think about how to phrase this as question/answer but couldn't really come up with a format that didn't sound patronising. 😞 Splunk used this post to re-write the documentation page, so I guess it has served its purpose, accordingly I have marked it as answered.

nickhills · ‎05-29-2020

Following my submission, the documentation has been updated to reflect this. It now reads: The MemoryLimit value is set to the total system memory available in bytes when the service unit file is created. The MemoryLimit value will not update if the total available system memory changes. To update the MemoryLimit value in the unit file, you can manually edit the value or use the boot-start command to disable and re-enable systemd. http://docs.splunk.com/Documentation/Splunk/8.0.3/Admin/RunSplunkassystemdservice The Splunk Docs team are still reviewing this, but this revised wording certainly makes it clearer. Thanks Edward K!

nickhills · ‎05-21-2020

If you are running Splunk with systemd and have ever upgraded your host memory, it is possible that Splunk is not using it! When you install Splunk and have completed your build process, if you are like me, one of the last things I do is: ./splunk enable boot-start -systemd-managed 1 -user splunk reboot the box and pat yourself on the back, maybe grab a beer. At some point later, you add more memory to the host (or in my case, resize the EC2 instance) EC2 is a special case which is what highlighted this for me - more on that later.. Assuming your host has swap memory, everything will work fine. However, what you may not notice is that Splunk is heavily using swap instead of real memory. If you disable swap swapoff -a you will very likely see Splunk dies rapidly with OOM. What happens is this: When you run splunk enable boot-start Splunk takes note of the current memory limit and writes this into the Splunkd.service unit file. Unless configured otherwise, default linux behaviour is to allow memory overcommit using cgroups, meaning that if Splunk wants more memory than it has allocated by cgroups, all sorts of magic happens which ultimately means that swap memory gets allocated to Splunk. If your swap is on fast IO disks, you may not notice straight away, however Splunk will never be able to use more 'real' memory than was present when you initially ran enable boot-start Check if you are affected: cat /etc/systemd/system/Splunkd.service Look for the line [Service] ... MemoryLimit= x Then check your system memory cat /proc/meminfo Look the value for MemTotal= y (kB) Note that the unit file is in bytes, and MemTotal is in kB, but check if these values match. If the MemoryLimit is below MemTotal then you very likely have been affected by this issue. To Resolve: 1 Stop Splunk 2 Disable boot-start ./splunk disable boot-start 3 Re-enable boot-start ./splunk enable boot-start -systemd-managed 1 -user splunk 4 Restart Splunk Now if you check the unit file, you should have the correct MemoryLimit applied and Splunk should be able to use all of it. Documentation: UPDATE: Following my submission, the documentation has been updated to reflect this. It now reads: The MemoryLimit value is set to the total system memory available in bytes when the service unit file is created. The MemoryLimit value will not update if the total available system memory changes. To update the MemoryLimit value in the unit file, you can manually edit the value or use the boot-start command to disable and re-enable systemd. http://docs.splunk.com/Documentation/Splunk/8.0.3/Admin/RunSplunkassystemdservice The Splunk Docs team are still reviewing this, but this revised wording certainly makes it clearer. Thanks Edward K! EC2 corner case: Generally speaking "swap" on ec2 sucks! Most hosts running in AWS have storage volumes which have a limitation on IOPS. This can make it very apparent if your system has swap on an EBS and is using lots of IO because after a period of time, your host will have consumed all of the IOP credits, then IO grinds to a halt causing high iowait. (Tip, use htop and enable display of iowait - You will see high load averages but comparatively low CPU use unless iowait display is on) In my case this was exactly the problem I was experiencing. The obvious solution is more ram, less swap. So I upgraded the host, and disabled the swap. Instantly this caused OOM and so began much head scratching, digging and ultimately discovery of where this was occurring

nickhills · ‎04-20-2020

Hi @PavelP Thanks for your suggestions, see the results of vmstat and iostat below. As you can see this is real swapping, (vmstat formatted for clairty) And in iostat you can see the highest io is on the dedicated swap device nvme2n1 I'm not worried about the load avg, I was just highlighting that I know the system in the screenshot is below suggested minimum spec, but the behaviour is just the same on a 16core 64gb instance. Thanks for the note on vmtouch - I had not come across that tool before! A new discovery is that the checkpoint file which the TA writes is now over 800mb in size. This is utter madness because the bucket that is being processed only has a few thousand files in it, and a quick look inside the checkpoint file has many thousands of repeated bucket names. There is clearly something very strange occurring with this python input process. It was remiss of me not to post version info in the original question, but this is Splunk Enterprise 8.0.3 with version 5.0 of the TA. (latest) Its a fresh deployment, so no upgrade legacy inputs. I am close to writing this one up as a bug and throwing it over to support.

nickhills · ‎04-18-2020

Just to note, this is a dedicated HF for AWS, its not running any searches, and the other AWS inputs all work fine and in a timely fashion, this issue is constrained just to the generic s3 task (of which this is the only one) I have seen this doc https://docs.splunk.com/Documentation/Splunk/7.2.4/Troubleshooting/Troubleshootmemoryusage but as described above this is not related to searches, or even the splunkd process.

nickhills · ‎04-18-2020

I have an interesting problem which I can’t work out with the AWS-TA specifically for an S3 input. I am collecting CloudTrail logs from an S3 bucket (no SQS, because the existing environment was preconfigured) I am collecting the logs with a Generic S3 input, but I have limited the collection window to only events that have been written to S3 in the last week (ish). As an aside, the bucket is already lifecycle managed and only has 90 days of logs within it. I am running the TA on an AWS HF instance with the necessary IAM roles, and data is collected, however.. there is a significant delay between logs being written to S3, collected and indexed. After some investigation, I have discovered that the HF is chewing through its entire Swap disk, whilst the physical (well, virtual) memory is max ~4gb of total available. I have been debugging this issue on a variety of ec2 instance types (c5.4xl/2xl/xl) and the system memory/core count has no impact on the behaviour. The behaviour of this was such that the swap file located on the / volume would be heavily written to, this is in turn would eat the available IOPS on the volume, which when depleted cause high IOwait and eventually the system becomes unresponsive. To check if this was a problem with the process needing a high amount of virt. mem for the initial ‘read’ I have moved the instance to an m5d, which provides an ephemeral 140GB ssd, which I have used as a dedicated swap device as it is not IOP limited (except by the underlying hardware device). As predicted, this has solved the IOPS depleting and the iowait condition is prevented. The box has a steady load avg of ~3 (yes I know it’s only got 4 cores) but is otherwise quite happy. However, the S3 python process has consumed 124GB of Virtual Memory, of which 123GB is on Swap. I have never seen anything like this before with this TA across many deployments on AWS. The logs from the TA report nothing untoward in the relevant log file, and whilst CT logs are getting in, they are taking 2-4 hours to arrive, rather than 30mins as configured in the input. I have dumped the /swap partition with strings , and I can see that the swap file contains the data from the CT log files read from s3, so my present assumption is that the entire buckets log files are being read into swap (multiple times, as there is only 7-8GB of logs in the bucket), it seems that once swap is full, the oldest logs are evicted from swap, and the HF finally processes them, and sends them to the indexers. Side note – if I run the HF without swap, it gets oomkilled and restarts. No crashlog is generated. Even so, physical memory usage never peaks at more than 3GB. Does anyone have any ideas what could be up?

nickhills · ‎03-19-2020

Did this help? If so please accept the answer and upvote so the community can see that this worked for you!

nickhills · ‎03-17-2020

you need to have your alert configured in the file named alert_actions.conf You should locate this file in /opt/splunk/etc/shcluster/apps/new_app/default/alert_actions.conf

nickhills · ‎03-10-2020

Are you running an authenticated scan against the endpoint with credentials? The CVE as discussed here: https://www.splunk.com/view/SP-CAAAP5E Addresses the issue by moving the endpoint to an authenticated request in versions >6.6.0. I am not sure why nessus would still detect this in an unauthenticated request

nickhills · ‎03-09-2020

What actual Address are you using for the endpoint? It should be yourhost:8088/services/collector/event

nickhills · ‎03-06-2020

If you are just interested in CPU data I echo the sentiments below - use the Splunk provided TA as all of this work is done for you. If, however this is an exercise which you plan to expand further, you can make you Splunk ingestion journey much easier using K=V values as the response from your script. Also make you your life easier by including a timestamp. CSV output formats make a huge amount of sense if the output from your script is large and repetitive. In those cases you only write the field names (header) out once, and then list all the values, but for a simple output from a script, you will find it much easier if you format your output like so: 06/03/2020 09:36:24 cpu=all pctIdle=94.6 Splunk will automatically break events formatted like this, and the fields will be auto extracted for you

nickhills · ‎03-06-2020

The user needs the role which is added when you install cloud gateway (I don’t have access to a system with it running at present) if I recall it’s called something like cloud-gateway-role. You can modify the standard “user role” to inherit it, so all users get the required capability

nickhills · ‎03-05-2020

That’s one approach. The other is be 100% sure your standby is not in dns, start it, sync files, shut down. The reason I suggest keeping it offline is to reduce any chance of it being confused with the primary, but another approach is to keep the host running with Splunk shutdown.

nickhills · ‎03-04-2020

Did you run through the migration checks before upgrading? https://docs.splunk.com/Documentation/Splunk/8.0.0/Installation/AboutupgradingREADTHISFIRST Take a look at this post following and see if it addresses your issue. https://answers.splunk.com/answers/779171/upgrading-from-701-to-80-splunk-web-stopped-runnin.html

nickhills · ‎03-04-2020

I think I have seen another post on the subject - and after a quick look now, I can't find it! There was a suggestion that Splunk 8 does not handle dashboards that do not have a label, and can trigger this message. Sadly the message does not give any indication as to which dashboard is the problem. The suggested advice was to look through the xml for your dashboards and identify any which do not have a label and manually edit the xml to include one. If I can find the post I will update my answer to give credit/context. EDIT### I have retracted my answer, as I was mistaken. Here is the post I was referring to, but it does not apply to your situation. https://answers.splunk.com/answers/802002/500-internal-server-error-when-i-click-manage-apps.html

nickhills · ‎03-04-2020

So I just learnt something! File uses a file checkpoint in the path you specify. Auto stores the checkpoint in the KV Store https://docs.splunk.com/Documentation/AddonBuilder/3.0.1/UserGuide/Usetheaddon#Configure_inputs

nickhills · ‎03-04-2020

This can happen if your instance was at some point started by root (perhaps by mistake) All files in $SPLUNK_HOME should be owned by the user Splunk is running as (splunk) If you have files inside $SPLUNK_HOME owned by root, you should probably run: sudo chown -R splunk:splunk /opt/splunk - or the path of $SPLUNK_HOME

nickhills · ‎03-04-2020

Hi @MMCC Why can't the master on site 2 be up and running? It has to be "brought" up... A Multisite Cluster can only have one Cluster Master running at a time. The master is the component responsible for co-ordinating all the actions of the cluster, therefore, you can not have two. reconfiguring all peers to the new master takes time! How can this be avoided? You should have a standby master (which means a Splunk instance with the same splunk.secret, same cluster shared key, same config, same apps etc) but offline (vm shutdown). Ideally, your cluster peers should use a DNS name (not a IP address) to reference the master. To bring the standby master online, change the DNS records to reflect the IP of the standby master, and start the VM. If the "failover" CM is not known to the peers can it still be up? If the standby CM has the same splunk.secret, and the same cluster shared key, and the same master apps, then the peers will accept it as the previously running CM. A word of Caution MAKE SURE YOU KEEP THE MASTER APPS IN SYNC. If you make changes to the master apps on the primary CM, (in particular indexes.conf) make sure you copy the changes to the standby. If you fail to do this and you have an index defined in your primary which is NOT defined in the standby, when the standby master comes online it will remove the missing index from your peers. This is not fun. Don't let it happen to you.

Posts	1781
Solutions	266
Karma Given	147
Karma Received	478
Member Since	‎10-10-2011

Online Status	Offline
Date Last Visited	‎09-18-2023 12:13 PM

[Windows] Upgrade 7.3.9 to Versions:<8.0.8 or:<8.1...

Setting Splunk Community Signatures

GOTCHA: Upgraded Memory with systemd? Read This!

AWS TA - S3 Generic input. Insane Memory Usage!

PROMOTION: Splunk Ideas - Help shape Splunk's futu...

splunk_ta_o365 DLP event collection fails with: "D...

Has the TA Metricator prevented cluster peer resta...

Notice: Sophos Central App for Splunk is retiring....

Tenable Add-On For Splunk: Limit collection time r...

Datamodel Acceleration: Bug, misconfiguration, or ...

Re: dashboards fail to load in mobile app

Re: dashboards fail to load in mobile app

Re: Extract multiple words in a filed

Re: Whats the best way to verify Identity and Asse...

Re: Transaction command - Way to change from times...

Re: DMC Alert - Near Critical Disk Usage

Re: AWS TA - S3 Generic input. Insane Memory Usage...

Re: GOTCHA: Upgraded Memory with systemd? Read Thi...

Re: GOTCHA: Upgraded Memory with systemd? Read Thi...

GOTCHA: Upgraded Memory with systemd? Read This!

Re: AWS TA - S3 Generic input. Insane Memory Usage...

Re: AWS TA - S3 Generic input. Insane Memory Usage...

AWS TA - S3 Generic input. Insane Memory Usage!

Re: Renaming Existing Deployment App

Re: Unable to deploy an custom app through a deplo...

Re: Information Disclosure Vulnerability - Splunk ...

Re: App shows 200, but no data

Re: Linux Script output is different from _raw

Re: Customize Dashboard Visualization using Splun...

Re: How to handle Multisite Cluster Master failure...

Re: After upgrading from 7.3.0 to 8.0.0, Splunk We...

Re: After upgrading from 7.3.0 to 8.0.0, Splunk We...

Re: Splunk addon builder checkpoint

Re: Why the KVstore process is being started as a ...

Re: How to handle Multisite Cluster Master failure...