All Topics

Top

All Topics

  We have a disconnected network and have splunk installed on a RedHat Linux server. I can login to the web interface with a local splunk account just fine but cannot login with a domain account. Th... See more...
  We have a disconnected network and have splunk installed on a RedHat Linux server. I can login to the web interface with a local splunk account just fine but cannot login with a domain account. This machine has been configured with domain logins for quite a while and has worked but only recently stopped working with a domain login. I recently needed to put in a temporary license until we get our re-purchase of a new license. Have not gotten far with troubleshooting yet. Where can I look to troubleshoot this issue?. ty.
I have two different data sets within the Updates data model. I catered a few panels within a dashboard that I use to collect the installed updates and update errors. I want to combine both of these ... See more...
I have two different data sets within the Updates data model. I catered a few panels within a dashboard that I use to collect the installed updates and update errors. I want to combine both of these searches into one by combining the datasets to correlate which machines are updating or occurring errors. Here's the two searches I have so far.  Installed Updates:  | datamodel Updates Updates search | rename Updates.dvc as host | rename Updates.status as "Update Status" | rename Updates.vendor_product as Product | rename Updates.signature as "Installed Update" | eval isOutlier=if(lastTime <= relative_time(now(), "-60d@d"), 1, 0) | `security_content_ctime(lastTime)` | eval time = strftime(_time, "%m-%d-%y %H:%M:%S") | search * host=$host$ | rename lastTime as "Last Update Time", | table time host "Update Status" "Installed Update" | `no_windows_updates_in_a_time_frame_filter` Update Errors:  | datamodel Updates Update_Errors search | eval time = strftime(_time, "%m-%d-%y %H:%M:%S") | search * host=$host$ | table _time, host, _raw,    
I am playing around with the splunk-rolling-upgrade app in our DEV environment.  We dont use a kvstore there and we dont use a kvstore on our indexers in PROD either.  Which is were I would like to u... See more...
I am playing around with the splunk-rolling-upgrade app in our DEV environment.  We dont use a kvstore there and we dont use a kvstore on our indexers in PROD either.  Which is were I would like to use this once I sort out the process.  However, the automated upgrade process appears to be failing because it is looking for a healthy kvstore.  Is there a flag or something I can put into the rolling_upgrade.conf file so that it ignores the kvstore?  Especially when it comes to our CM and Indexers where we have the kvstore disabled.
Hello to everyone! My question looks very dummy, but I really can't understand how I can resolve it. So, what we are having step by step: 1. Some network device that sends an event via UDP directl... See more...
Hello to everyone! My question looks very dummy, but I really can't understand how I can resolve it. So, what we are having step by step: 1. Some network device that sends an event via UDP directly to an indexer 2. Indexer receives message according to capture of wireshark 3. Then I'm trying to find this event on a searchhead, and I see nothing 4. Somehow I generate another event on the network device 5. Then I expect to see two events during the search, but I see only the previous one This behavior is a little bit random but easy to reproduce with network devices that send events unfrequently. And, additionally, I can easily detect wrong behavior because of the significant difference between _time and _indextime of those events. A couple of words about indexer settings, props.conf on indexer looks like this, nothing special:   [cat_syslog] DETERMINE_TIMESTAMP_DATE_WITH_SYSTEM_TIME = true MAX_TIMESTAMP_LOOKAHEAD = 24 SHOULD_LINEMERGE = false TIME_PREFIX = ^<\d{1,3}>\d+:\s+.*:\s+\d+:\s+   Overall, what I can assume. 1. According to my props.conf, indexer expecting to find default ([\r\n]+) to apply line-breaking rule and create single event 2. But for some reason fails in it 3. From this moment, the indexer waits until the next event 4. An, I don't know why,  but ([\r\n}+) appears in the next message So, the question is, how to NOT wait until next event in this situation? I also understand that I can't change the line-breaking rule because of very unrequent events. And also, there are no special characters at the end of events because they look like this:   <172>153702: 172.21.0.13: 153696: Sep 13 16:30:50.797 RTZ: %RADIUS-4-RADIUS_ALIVE: RADIUS server 172.28.20.80:1812,1813 is being marked alive. <174>153700: 172.21.0.13: 153694: Sep 13 16:30:30.714 RTZ: %RADIUS-6-SERVERALIVE: Group AAA_RADIUS: Radius server 172.21.20.80:1812,1813 is responding again (previously dead). <173>153695: 172.21.0.13: 153689: Sep 13 16:25:05.626 RTZ: %LINEPROTO-5-UPDOWN: Line protocol on Interface FastEthernet0/9, changed state to up  
Hi -  I have a quick props question. I need to write a props for a particular sourcetype, and the messages always start with before the timestamp starts: ukdc2-pc-sfn122.test.local - OR ukdc2-pc-s... See more...
Hi -  I have a quick props question. I need to write a props for a particular sourcetype, and the messages always start with before the timestamp starts: ukdc2-pc-sfn122.test.local - OR ukdc2-pc-sfn121.test.local -  When writing the TIME_PREFIX can a regex be written to account for this, is it just a basic one if so can someone provide this? Thanks  
hi I try to list the step to interface splunk with service now and to create an incident in servicenow from a splunk alert is it mandatory to use the splunk addon Splunk Add-on for ServiceNow | Spl... See more...
hi I try to list the step to interface splunk with service now and to create an incident in servicenow from a splunk alert is it mandatory to use the splunk addon Splunk Add-on for ServiceNow | Splunkbase? and what are the steps after? thanks
Hi all, Is it possible to pass paramenters to the action [[action|sendtophantom]] in the field "Next Steps" . For example pass it the severity or SOAR instance? Thanks
Hello, Could you please provide guidance on how to retrieve the daily quantity of logs per host? Specifically, I am looking for a method or query to get the amount of logs generated each day, brok... See more...
Hello, Could you please provide guidance on how to retrieve the daily quantity of logs per host? Specifically, I am looking for a method or query to get the amount of logs generated each day, broken down by host. Best regards,
Hi, I have instrumented a node.js agent with auto instrumentation in cluster agent.My application is reporting but there is no call graph have been captured for BTs. I have checked the agent prope... See more...
Hi, I have instrumented a node.js agent with auto instrumentation in cluster agent.My application is reporting but there is no call graph have been captured for BTs. I have checked the agent properties and discovered that by default this property is disabled. AppDynamics options: excludeAgentFromCallGraph,true Can anyone suggest how can i enable this property for auto instrumentation method.
Hi All, I need to download and install below app via command line https://splunkbase.splunk.com/app/263 Please help me with the exact commands, I tried with multiple commands, login is success... See more...
Hi All, I need to download and install below app via command line https://splunkbase.splunk.com/app/263 Please help me with the exact commands, I tried with multiple commands, login is successful and getting token but during app download getting 404 bad request error
how can I use top command after migrating to tstats? I need the same result, but looks like it can be done only using top, so I need it index IN (add_on_builder_index, ba_test, cim_modactions, cis... See more...
how can I use top command after migrating to tstats? I need the same result, but looks like it can be done only using top, so I need it index IN (add_on_builder_index, ba_test, cim_modactions, cisco_duo, cisco_etd, cisco_multicloud_defense, cisco_secure_fw, cisco_sfw_ftd_syslog, cisco_sma, cisco_sna, cisco_xdr, duo, encore, fw_syslog, history, ioc, main, mcd, mcd_syslog, notable, notable_summary, resource_usage_test_index, risk, secure_malware_analytics, sequenced_events, summary, threat_activity, ubaroute, ueba, whois) sourcetype="cisco:sma:submissions" status IN ("*") | rename analysis.threat_score AS ats | where isnum(ats) | eval ats_num=tonumber(ats) | eval selected_ranges="*" | eval token_score="*" | eval within_selected_range=0 | rex field=selected_ranges "(?<start>\d+)-(?<end>\d+)" | eval start=tonumber(start), end=tonumber(end) | eval within_selected_range=if( (ats_num >= start AND ats_num <= end) OR token_score="*", 1, within_selected_range ) | where within_selected_range=1 | rename "analysis.behaviors{}.title" as "Behavioral indicator" | top limit=10 "Behavioral indicator" I tried this but it doesnt return me percent | tstats prestats=true count as Count from datamodel=Cisco_Security.Secure_Malware_Analytics_Dataset where index IN (add_on_builder_index, ba_test, cim_modactions, cisco_duo, cisco_etd, cisco_multicloud_defense, cisco_secure_fw, cisco_sfw_ftd_syslog, cisco_sma, cisco_sna, cisco_xdr, duo, encore, fw_syslog, history, ioc, main, mcd, mcd_syslog, notable, notable_summary, resource_usage_test_index, risk, secure_malware_analytics, sequenced_events, summary, threat_activity, ubaroute, ueba, whois) sourcetype="cisco:sma:submissions" Secure_Malware_Analytics_Dataset.status IN ("*") by Secure_Malware_Analytics_Dataset.analysis_behaviors_title | chart count by Secure_Malware_Analytics_Dataset.analysis_behaviors_title | sort - count | head 20
Hi, I've a case where I want to update/append the Macro with the results from lookup. I don't want to do this manually each time. So is there any way I could use a scheduled search and update macr... See more...
Hi, I've a case where I want to update/append the Macro with the results from lookup. I don't want to do this manually each time. So is there any way I could use a scheduled search and update macro if the lookup has any new values.
hi i need to do an heat map vizualization i have checked the dasbord examples addon and in this example a lookup is used   | inputlookup sample-data.csv is it possible to do the same thing withou... See more...
hi i need to do an heat map vizualization i have checked the dasbord examples addon and in this example a lookup is used   | inputlookup sample-data.csv is it possible to do the same thing without a lookup please? I mean by using an index and an eval command for example if the field "Value" is < 50 th color is green, <30, the color is orange and < 10 the color is red in my heat map Rgds
Hi Team, I am sending json data to Splunk server and I want to create a dashboard out of it. My data is in the below format and I need help in creating the dashboard out of it.   example: {"valu... See more...
Hi Team, I am sending json data to Splunk server and I want to create a dashboard out of it. My data is in the below format and I need help in creating the dashboard out of it.   example: {"value": ["new-repo-1: 2: yes: 17", "new-repo-2: 30:no:10", "new-one-3:15:yes:0", "old-repo: 10:yes:23", "my-repo: 10:no:15"]} and many more similar entries.   my dashboard should look like, repos count active count new-repo 2 yes 17 new-repo-2 30 no 10 new-one-3 15 yes 0 old-repo 10 yes 23 my-repo 10 no 15   I am able to write the rex for single field using extract pairdelim="\"{,}" kvdelim=":" but not able to do it for complete dashboard. can someone help?   Thanks, Veeresh Shenoy
I see this "extracted_eventtype" field in many saved searches and dashboard inline searches. However, I cannot find where it is generated. In the DUO events I do see "event_type" and "eventtype" fie... See more...
I see this "extracted_eventtype" field in many saved searches and dashboard inline searches. However, I cannot find where it is generated. In the DUO events I do see "event_type" and "eventtype" fields. But not "extracted_eventtype". Dashboards with that field show "No results found." because that field is nowhere to be found in DUO events. Any thoughts / pointers would be very much appreciated!
how can I monitoring an user if he is using the wireless in the company? thank you!
Is it possible to password protect emailed reports?
In my SPL JOIN query, I want to get the events for, let's say, between T1 and T2; however, the relevant events on the right side of the query happened between T1-60m and T2. I can't figure out how to... See more...
In my SPL JOIN query, I want to get the events for, let's say, between T1 and T2; however, the relevant events on the right side of the query happened between T1-60m and T2. I can't figure out how to do it in the dashboard or just a report. Using relative_time won't work for some reason. I appreciate any help. index=myindex | fields a, b, c | join type=inner left=l right=r where l.keyid=r.keyid [search index=myindex ```<- how to change the earliest to earliest-60m?``` |fields d, f ] | table l.a, l.b, l.c, r.d, r.f    
Hi, I've been struggling for some time with the way baselines seem to work - to the extent that I'm feeling like I can't trust them to be used to alert us to degraded performance in our systems.  I ... See more...
Hi, I've been struggling for some time with the way baselines seem to work - to the extent that I'm feeling like I can't trust them to be used to alert us to degraded performance in our systems.  I thought I would describe the issue and get the thoughts of the community.  Looking for some thoughts from folks who are happy with baselines and how they are mitigating the issue I’m experiencing.  Or some input confirming that my thinking on this is correct. I have proposed what I think could be a fix towards the end.  Apologies if this ends up being a bit of a long read but it feels to me like this is an important issue – baselines are fundamental to AppD alerting and currently I don’t see how they can reliably be used. To summarise the issue before I go into more detail it looks to me like AppD baselines, and the moving average used for transaction thresholds, ingest bad data when there is performance degradation which renders baselines unfit for their purpose of representing ‘normal’ performance.  This obviously then impacts on any health rules or alerting that make use of these baselines. Let me provide an example which will hopefully make the issue clear. A short time ago we had a network outage which resulted in a Major Incident (MI) and significantly increased average response time (ART) for many of our BTs. Because the ART metric baseline uses these abnormal ART values to generate the ongoing baseline it meant that the baseline itself rapidly increased. The outage should have significantly exceeded multiple SDs above the expected ‘normal’ baseline.  But because the bad data from the outage increased the baseline it meant that other than the very brief spike right at the start the increase in ART barely reached 1SD above baseline. Furthermore, the nature of the Weekly Trend – Last 3 Months baseline means that this ‘bad’ baseline will propagate forward.  Looking at the first screenshot above we can clearly see that the baseline is expecting ‘normal’ ART to be significantly elevated every Tuesday morning now.  Presumably this will continue until the original outage spike moves out of the baseline rolling window in 3 months. This is more clearly shown if we look more closely at the current week so that the chart re-scales without the original ART spike present. As far as the baseline is concerned, a large spike in ART every Tuesday morning is now normal.   This mean that less extreme (but still valid) ART degradation will not trigger any health rules that use this baseline.  In fact, this could also generate spurious alerts on healthy performance if we were using an alert based on < baseline SD as the healthy ART now looks to be massively below ‘normal’ baseline. To my mind this simply can’t be correct behaviour by the baseline.  It clearly no longer represents normal performance which by my understanding is the very purpose of the baselines. The same problem is demonstrated if we use other baselines but I’ll not include my findings here for the sake of this already long post not becoming a saga. This issue of ingesting bad data also impacts the Slow/VerySlow/Stalled thresholds and the Transaction Score chart: As can be seen we had a major network outage which caused an increase in ART for an extended period.  This increase was correctly reflected in the Transaction Score chart for a short period but as the bad data was ingested and increased the value of the moving average used for thresholds we can see that even though the outage continued and ART stayed at abnormal level, the health of the transactions stopped being orange Very Slow and moved through yellow Slow back to green Normal.  And yet the outage was ongoing, the Major Incident was ongoing, the ART had not improved from its abnormally high service impacting value.  These later transactions are most certainly not Normal by a very long way and yet AppD believes them to be normal because the moving average has been polluted by ingesting the outage ART data.  So after a short period of time the moving average used to define a Slow/Very Slow transaction no longer represents normal ART but instead has decided that the elevated ART caused by the outage is the new normal.  I’d like to think that I’m not the only one who thinks this is undesirable. Any alerting based on using slow transaction metrics would stop alerting and would report normal performance even though the outage was ongoing with service still being impacted. Now it’s not my way to raise a problem without at least trying to provide a potential solution and in this case I have two initial thoughts: AppD adds the ability to lock the baseline in much the same way as we lock BTs.  So a BT is allowed to build up a baseline until it looks like it matches ‘normal’ behaviour as closely as we’re likely to get.  At this point the baseline is locked and no further data is added to the baseline.  If a service changes and we believe we have a new normal performance then the baseline can be unlocked to ingest the new metrics and update the baseline to the new normal, at which point it can be locked again. Instead of locking baselines AppD could perhaps implement a system whereby bad data is not ingested into the baseline.  Perhaps something like: any data point which comes in which triggers a health rule (or transaction threshold) is taken as evidence of abnormal performance and is not used to generate the baseline, maybe instead the last known non-triggering data point is used for the baseline.  This would mean that the baseline probably would still increase during an outage (working on the assumption that a service degrades before failing so the points immediately prior to the triggering of an alert might still be elevated above normal).  But this should mean that the baseline change would not be as fast or as catastrophic as the current method of calculating the rolling baseline/moving average. Well, that pretty much wraps it up I think.  If you've made it this far then thanks for your time and I'd really appreciate knowing if other folks are having a similar issue with baselines or have found ways to work around it.
Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power everything from login and checkout to content lookups and “likes,” so issues with slow queries,... See more...
Behind every business-critical application, you’ll find databases. These behind-the-scenes stores power everything from login and checkout to content lookups and “likes,” so issues with slow queries, too many full table scans (or too few index scans), incorrectly configured indices, or resource exhaustion directly impact application reliability and user experience. Thankfully, we can capture key database metrics to expose such issues and ensure optimal performance, efficient troubleshooting, and the overall reliability of our applications.  In this post, we’ll explore monitoring the open-source relational database PostgreSQL. Postgres is widely used in enterprise applications for its scalability, extensibility, and support. It also collects and reports a huge amount of information about internal server activity with its statistics collector. We’ll harness these stats using the OpenTelemetry Collector and first focus on the database and infrastructure itself in Splunk Observability Cloud. Then we’ll see how everything connects to our application performance data. Which metrics matter and why Monitoring database metrics is critical to proactively identifying issues, performance optimizations, and database reliability, but with so many stats coming from the statistics collector, it can be difficult to determine what to focus on. How do we isolate what’s critical to monitor effectively? It can help to focus on operation-critical key metrics like those related to: Query performance (query throughput/latency, locks, query errors, index hit rate) Resource utilization (connections, CPU, memory, disk space, table/index size, disk I/O, cache hits) Database health (replication lag, deadlocks, rollbacks, autovacuum performance) Query Performance Slow, resource-intensive queries or queries with high throughput can decrease the response time of our applications and degrade user experience. To prevent things like slow page load time, we want to focus on metrics related to query time –  total response time, index scans per second, and database latency. These metrics will indicate if our database has the right or wrong indexes, absent indexes, if our tables are fragmented, or have too many locks, etc.    Resource Utilization Exceeding resource thresholds can halt application operations altogether. If total active connections are too high resources might be exhausted, and users might not be able to interact with our application at all. Monitoring resource usage like CPU, memory, and table/index size can keep our databases up and running, while also allowing for accurate capacity planning and optimal user experience.  Database health Things like a high rollback to commit rate can indicate user experience issues, for example, users might be unable to complete product checkout on an e-commerce site. An increase in the number of dead rows can lead to degraded query performance or resource exhaustion with similar effects. Proactively monitoring these metrics helps easily identify inefficiencies, eliminate bottlenecks, reduce database bloat, and ultimately improve user experience.   How to get the metrics So how do we get these metrics from PostgreSQL to the OpenTelemetry Collector? The first step is installing the OpenTelemetry Collector. If you’re working with the Splunk Distribution of the OpenTelemetry Collector, you can follow the guided install docs. I’m using Docker Compose to set up my application, Postgres service, and OpenTelemetry Collector, so here’s how I added the Splunk Distribution of the OTel Collector:  If you already have your OpenTelemetry Collector configuration file ready to edit, you can proceed to add a PostgreSQL receiver to the receivers block so you can start collecting telemetry data from Postgres. Because I set up the Collector with Docker Compose, I manually created my Collector configuration file (otel-collector-config.yaml). Here’s the PostgreSQL receiver I added to my Collector config:  Note: generally, your database and microservices would be behind network and API security layers so your databases and services would talk to each other unencrypted, which is why I have tls set to insecure: true. If your database requires an authenticated connection, you’ll need to supply a certificate similar to what’s shown in the documentation’s sample configuration.  I’m also exporting data for my application to my Splunk Observability Cloud backend, so I’ve added an exporter for that and added both my new receiver and new exporter to my metrics pipeline:  If you’re not using the Splunk Distribution of the OpenTelemetry Collector or not exporting data to Splunk Observability Cloud, configuring the PostgreSQL receiver block will still follow the example shown, but you’ll need to configure a different exporter and add it to the metrics pipeline.  That’s it! Now either build, start, or restart your service (I did a docker compose up --build) and watch your database metrics flow into your backend observability platform of choice.  Note: If you’re working with a complex service architecture and the Splunk Distribution of the OpenTelemetry Collector, you might want to consider using automatic discovery. This allows the Collector to automatically detect and instrument services and their data sources. Depending on your environment and Collector installation method, you can follow the appropriate docs (Linux, Windows, Kubernetes) to deploy the Collector with automatic discovery. How to see the data in Splunk Observability Cloud Now that we’re collecting Postgres data, let’s jump over to Splunk Observability Cloud Infrastructure to visualize our telemetry data. We can select the Datastores section and open up either our PostgreSQL databases for database-level metrics or PostgreSQL hosts for metrics related to the infrastructure hosting your PostgreSQL database(s): Going into the PostgreSQL databases navigator, we can see the metrics related to all of our databases: Here we see those key metrics that can hint at performance issues like total operations, index scans per second, and rollbacks. If total operations are high, we’ll know at a glance if our database resources can handle the current workload intensity. If our index scans per second drop, this can suggest we’re not using indexes efficiently. Databases with a high number of rollbacks could be experiencing an increase in transaction failures or deadlocks. All of these things can lead to slow or unreliable performance for our users. Clicking into our database we see database-specific metrics: We can monitor index size for efficient resource optimation and right-size indexes. Dead row monitoring helps ensure efficient vacuuming to decrease table bloat and increase performance. It looks like we have 18 total operations per second, but 0 index scans per second, which might mean we aren’t indexing and could have some query performance inefficiencies.  Going into the PostgreSQL hosts navigator, we can view things like changes in operations per second, transactions, and disk usage to ensure our system can handle current workloads and maintain consistent performance:    We can also click into a specific host to view individual host metrics like how many transactions succeeded, failed, or were rolled back and the cache hits versus disk hits, both of which impact overall performance:  Moving between Infrastructure, APM, and Log Observer Our database monitoring journey will most likely start at the service level or with the applications they back, so let’s dig into query performance and how to view its impacts on overall application performance.  From within our PostgreSQL host navigator, if we select a specific host, we can view logs or related content in APM to view services that have a dependency on the currently selected host:  We can then jump to the Database Query Performance to view and analyze query time, latency, and errors to see which specific areas are impacting response time and user experience and where we might be able to optimize our query performance:  Closing this out, we can see our Service Map and where the current database sits so that we can investigate specific errors, traces, or related logs: We moved from Infrastructure to Application Performance Monitoring, but we could have just as easily started with our Service Map and began troubleshooting database performance issues from there using Database Query Performance, database requests/errors, or traces.  Wrap Up Monitoring key metrics from the databases that power our applications is critical to the performance and reliability that our users count on. Configuring the OpenTelemetry Collector to receive PostgreSQL telemetry data and export this data to a backend observability platform is an easy process that provides invaluable visibility into the databases that back our services. If you’d like to try exporting your Postgres data to Splunk Observability Cloud, try it free for 14 days!   Resources Automatic Discovery and Instrumentation of PostgreSQL with Splunk OpenTelemetry Collector Database Monitoring: Basics & Introduction OpenTelemetry Collector Configuring Receivers