About davidpaper

davidpaper · ‎07-21-2020

Hi, While it is true that there is no direct support for 2FA (Duo) in Splunk Cloud anymore (it was removed after 6.6 I believe), the way to still achieve 2FA is to do it at the SAML/IDP layer. As long as your IDP supports 2FA and the 2FA bits happen outside of Splunk, go for it.

davidpaper · ‎05-27-2020

Hi, I'm David Paper. You might remember me from such posts as SmartStore Behaviors and How to read HEC introspection logs. I'm here today to talk to you about a scourge in Splunk, and that is having too many channels created in too little time on your indexers. What is a channel? A ingestion channel for a FWD connection over Splunk2Splunk protocol (generally on port 9997) to an indexer (or HWF) is a identifier of a particular tuple made up of host+sourcetype+source+tcp_connection that is created when a FWD connects to an IDX and begins to send data in. Sometimes these tuples show up in a log when there are problems registering one like 04-01-2019 23:41:19.706 +0000 ERROR TcpInputProc - Encountered Streaming S2S error=Cannot register new_channel="source::[log location]|host::[hostname]|sourcetype::iis|263011": desired_field_count=19 conflicts with existing_field_count=0 for data received from src=[HF-IP]:51090. So, when you rolled out a bunch of new FWDs or updated existing FWDs with new apps or inputs that began ingesting new sources and/or source types, you have created more potential tuples in your environment. All of those tuples will ultimately have to be handled by the indexers. For every channel created to handle a new tuple, it will have to be removed at some point. This constant creation & removal load becomes problematic when it gets above a certain rate. More on this below. With this information, how do I know I have too many tuples for my current indexer configuration to handle? There are a number of signs, taken together can help you know where to look. On the FWD end: - Many new FWD endpoints - Many new source/sourcetypes - Combo of both - FWD queues (UF or HWF) are all backing up and don't seem to ever catch up, but some data is still being ingested - A FWD restart seems to push a lot of data through for a few minutes then regresses back to pre-restart throughput levels - Adding additional ingestion pipelines to FWDs hasn't helped with the queue congestion - netstat output on FWD shows tcp data sitting in Send-Q queue for indexer connections - tcpdump on either IDX or FWD indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX. On the IDX end: - Ingestion throughput on indexers drops significantly - Ingestion queues aren't blocked any more than before - No disk I/O issues or CPU contention issues apparent - Indexer bucket replication queue may be full - Adding additional ingestion pipelines to indexers hasn't helped - Events are getting delayed on ingestion, by minutes or hours - netstat output on IDX shows tcp data sitting in Recv-Q queue waiting to be processed by the indexer - tcpdump on either idx or fwd indicated TCP window size gets set to 0 from the IDX side, indicating data needs to pause flowing into the IDX. (yes, its here twice, it's important) Metrics.log example with channel info: metrics.log.4:03-15-2019 03:42:28.183 -0400 INFO Metrics - group=map, name=pipelineinputchannel, current_size=172, inactive_channels=91, new_channels=47202, removed_channels=54982, reclaimed_channels=516, timedout_channels=6, abandoned_channels=0 Search to sum up new_channel creations and removals by indexer: index=_internal host= source=*metrics.log* new_channels | timechart span=1m avg(new_channels) avg(removed_channels) by host limit=0 useother=f Target to stay under average of 5000 new channels created on an indexer each minute (regardless of how many ingestion pipelines are on the indexer) Avg of 5000 to 10,000/min will start to show some FWD queuing, and may not recover Avg of over 10,000/min will almost always results in problems If these values move up/down in a similar pattern, and the new channel creation rate is above 5,000/min consistently, it is time to take action. Now you know if you have a problem that needs addressing. There are two places it can be addressed, FWD side and IDX side. On FWD side, tell the FWD to park themselves on a single IDX connection longer, so the tuples it uses stay active longer. For all FWDs, update outputs.conf - unset autoLBvolume (if set) - autoLBFrequency=120 (up from default of 30) If UF, update props.conf for all sourcetypes - EVENT_BREAKER_ENABLE=true - EVENT_BREAKER=regex_expression If HWF, update outputs.conf - forceTimebasedAutoLB=true On IDX side, tell the IDX to keep more inactive tuples around in memory. This isn't free, and will take ~100MB of memory for every 10k, but on an indexer with 10s of GB of RAM, this shouldn't be an issue. limits.conf - [input_channels] - max_inactive = 20000 The max_inactive value may need to be raised higher if the new_channels creation rate doesn't drop significantly enough to be below the 5000/min threshold.

davidpaper · ‎05-27-2020

My Splunk environment was humming right along until I had a need to very quickly add several thousand new FWDs and a bunch of new apps on those endpoints collecting many new sources and sourcetypes. I happen to have an intermediate forwarding tier of HWFs, but I don't know if that makes any difference. After I added all these new FWDs, sources and sourcetypes, I noticed my rate of ingestion on all my IDXs dropped a lot. Restarting IDXs via rolling restart doesn't seem to make a difference. What's going on?

davidpaper · ‎05-18-2020

Good points @joshd . For #3, this can definitely be a problem, as averages hide the best and the worst of that value quote nicely. This is another good reason to avoid re-using the same HEC token for more than 1 data source or sourcetype. Ingestion metrics aren't the only thing that can cause problems when reusing HEC tokens. Error detection in the ingestion pipeline (think data quality view in MC) only get as granular as the HEC token for some things, so reusing HEC tokens makes those more difficult to track down.

davidpaper · ‎05-14-2020

There a few things to unpack here. HEC data is reported in introspection every 1 minute. All discussions below assume 1 min intervals unless otherwise stated. There are two types of introspection logs for HEC, one type that summarizes all HEC activity on the host, and one type that provides a summary for each unique token received. What's the different? The first type example is below. Note that it does not have a token_name: field in it. { [-] component: HttpEventCollector data: { [-] format: json num_of_ack_requests: 0 num_of_auth_failures: 0 num_of_errors: 0 num_of_events: 3 num_of_parser_errors: 0 num_of_requests: 1 num_of_requests_acked: 0 num_of_requests_in_mint_format: 0 num_of_requests_to_disabled_token: 0 num_of_requests_to_incorrect_url: 0 num_of_requests_waiting_ack: 0 series: http_event_collector total_bytes_indexed: 72 total_bytes_received: 111 transport: http } datetime: 05-11-2020 10:52:15.826 -0400 log_level: INFO } The second type does have a token_name in it. { [-] component: HttpEventCollector data: { [-] format: json num_of_errors: 0 num_of_events: 12 num_of_parser_errors: 0 num_of_requests: 4 num_of_requests_in_mint_format: 0 num_of_requests_to_disabled_token: 0 series: http_event_collector_token token_name: testing multiple events total_bytes_indexed: 288 total_bytes_received: 444 transport: http } datetime: 05-11-2020 11:02:15.775 -0400 log_level: INFO And my token name now shows up. Other than the token_name , there are no other differences. Each token seen by the indexer will generate its own unique introspection log entry. So, if there are 10 unique tokens sent to the indexer, expect to see 11 introspection events (1 for each token + 1 summary). If you are summing up HEC usage data, be careful not to count the same data more than once. format is always json A HEC request may have or more Splunk events in it. A multi-event request is called a batch. num_of_events is a sum of all Splunk events received by the indexer. num_of_requests is how many individual requests HEC requests the indexer received. total_bytes_indexed is how much data was ingested via HEC total_bytes_received is how much data was received by HEC, including headers, other non-event data, and data with parsing errors in it. This number should always be larger than total_bytes_indexed and may be significantly larger if there are parsing errors in json data that would stop it from being indexed. If you have multiple requests show up in this data, you should be aware of how large each request is on average. Larger requests with many events batched mean a single indexer has to process them which isn't normally an issue. If a request is 100MB or 500MB, that's a lot of data for one indexer to swallow in one shot, and may cause indexing delays for other data sources trying to ingest to that indexer. To figure out the average size of a request = total_bytes_received / num_of_requests . In this example, that's 444/4=111 bytes per request. This can become important if data is being sent to a Splunk Cloud HEC listener, as the service details specify 1MB as a max size (https://docs.splunk.com/Documentation/SplunkCloud/8.0.2003/Service/SplunkCloudservice#Service_limits_and_constraints). num_of_parser_errors indicates that malformed JSON data was encountered. There will be corresponding errors logged in _internal splunkd.log

davidpaper · ‎05-14-2020

I'm ingesting data via HEC and I know there is data about it in _introspection, but I don't know what I'm looking at when I search for it. Here is what I know so far. I have a HEC token named testing multiple events . The token itself looks like f2584364-976f-4a68-ac3b-4a4d481ec8cd . I'm searching for introspection data about HEC via index=_introspection sourcetype="http_event_collector_metrics" . Some data has been sent to it for testing. Can someone explain what I'm seeing when looking at the entry below? { [-] component: HttpEventCollector data: { [-] format: json num_of_errors: 0 num_of_events: 3 num_of_parser_errors: 0 num_of_requests: 1 num_of_requests_in_mint_format: 0 num_of_requests_to_disabled_token: 0 series: http_event_collector_token token_name: testing multiple events total_bytes_indexed: 72 total_bytes_received: 111 transport: http } datetime: 05-11-2020 10:52:15.827 -0400 log_level: INFO }

davidpaper · ‎12-08-2019

yeah, you got it. #4, straight to the directory you pointed coldPath to. In the example above, homePath and coldPath use the same volume, but different directories on the same filesystem.

davidpaper · ‎10-21-2019

That file is where the info is stored to block events from showing up in search that have had "|delete" run against them in the past.

davidpaper · ‎10-16-2019

This is spot on, and a behavior I hadn't understood until very recently. Reassigning coldPath to homePath is an excellent idea.

davidpaper · ‎09-10-2019

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices. The post What's the order of operations for upgrading Splunk Enterprise? outlines the high-level process for upgrading a Splunk Enterprise deployment. This post focuses on the verification phase and what to check after the upgrade to make sure all the components have upgraded successfully. You can do these checks for each component as you upgrade them, or all at once after the upgrade. During upgrade, make sure you allow adequate time for one component to stabilize before upgrading the next component. The post-upgrade checks have two main goals: Make sure the system is healthy after upgrade Validate that system performance is on par with or better than it was before the upgrade Here's a high-level snapshot for what to check after upgrading. These checks go in the order of upgrade. We dive into details below. Check upgrade success on the monitoring console Check upgrade success on the license master Check upgrade success on the indexer cluster master Check upgrade success on the search tier (stand-alone) Check upgrade success on the search tier (clustered) Check upgrade success on the deployer Check upgrade success on the deployment server Check upgrade success on the indexers 1 Check upgrade success on the monitoring console The first step to confirm that the monitoring console has upgraded successfully is to log into the monitoring console UI. Once logged in, check the following: Note: The exact location of items in the monitoring console may vary depending on which version of Splunk Enterprise you're running. Verify that all search heads, indexers, deployment server(s), license server, cluster master, the deployer, and forwarders (heavy and regular) are reporting a healthy status. (Monitoring Console > Overview). Verify that components have correct roles associated with them (Monitoring Console > Instances) Review resource utilization (CPU, RAM, disk) for search head and indexer tier, and compare to screenshots taken before the upgrade to verify that performance levels are comparable (Monitoring Console > Resource Usage > Instance > role: search head and role: indexer). Review search scheduling and performance. Investigate and correct any skipped and deferred searches. (Monitoring Console > Search > Scheduler Activity > Deployment). For more about the scheduler activity dashboards, see Search: Scheduler activity in the Monitoring Splunk Enterprise Manual. For tips about discovering skipped searches, see the Answers post Skipped Searches on SHC. For background about the search scheduler and how it prioritizes (and possibly skips) searches, see Configure the priority of scheduled reports in the Reporting Manual. Review ingestion queues on the indexers. Ensure they are not filling and failing to recover. (Monitoring Console > Instances > Indexer > Views > Indexing Performance). For insight about how to troubleshoot ingestion issues, see Identify and triage indexing performance problems in the Troubleshooting Manual. For guidance about monitoring indexer performance, see Use the monitoring console to view indexing performance in the Managing Indexers and Clusters of Indexers Manual. 2 Check upgrade success on the license master Verify that all indexers are checking into the license master. (Monitoring Console > Instances > Group=License Master) Verify that _* indexes are successfully forwarding data to the indexing tier (if configured to do so). Run the following search and validate that the license master host is present in the list (you can also check for the cluster master host, the deployment server host, and the deployer host): index=_internal earliest=-5min | stats count by host For guidance about how to set up data forwarding, see Best practice: Forward search head data to the indexer layer in the Distributed Search Manual. For tips about how to set this up for forwarders and license master, see Best practice: Forward master node data to the indexer layer in the Distributed Search Manual. 3 Check upgrade success on the indexer cluster master On the cluster master host, check the load average and IOPS to determine that the cluster master has finished processing all upgrade-related activity (at the 'nix command line: iostat -zx 1 or sar -d ). Cluster masters are generally not IO intensive, but IO jumps up considerably when indexer rolling restarts occur. Use your favorite method to monitor the cluster manager's swap space, for example (on 'nix): " vmstat 1 ” show pages swapping in/out, “ si ” and “ so ” columns “ iostat -zx 1 ” looking at swap device for activity (can get device name from /etc/fstab ) After upgrade activity is finished and the system has returned to a steady state, review the clustering dashboard to ensure the cluster is searchable. If RF/SF fixup tasks are queued, verify that the fixups are in progress. For guidance about the cluster recovery process, see the presentation Indexer Clustering Fixups from .conf2017. For more about bucket fixing and a link to bucket fixing resources in the Splunk Enterprise documentation, see bucket fixing in the Splexicon. Look for search peers that are alternating ("flapping") between up and pending states, or restarting outside of a rolling start by running the following search: index=_internal source=*splunkd.log sourcetype=splunkd host=cluster_master component=CMPeer peer transitioning NOT bid | eval transition = from." -> ".to | timechart count by transition If the search results alternate between "Pending → Up" and "Up → Pending", the indexers may need more time finish all upgrade-related activity and check in properly. If the situation persists, you may need to adjust timeouts in the cluster master configuration to give the components more time to come back online. Verify that forwarders are communicating with with the cluster master using the following search: index=_internal sourcetype=splunkd component=CMIndexerDiscovery Verify that the monitoring console can still see the cluster master as a search peer. (Monitoring Console > Overview). Verify that you can successfully complete a bundle push to indexers. For guidance, see Distribute the configuration bundle in the Managing Indexers and Clusters of Indexers Manual. For troubleshooting tips, see Configuration bundle issues in that same manual. 4 Check upgrade success on the search tier (stand-alone) Verify that external auth is working (if configured), including certificates if you're using SAML or another SSO outside of Active Directory. Verify that the new Splunk Enterprise version works with all apps (searches, dashboards, add-ons, external inputs). Verify that basic searches work from each standalone search head, and that all the indexers reply by running the following search: | tstats count where earliest=-5m by splunk_server Look for skipped or deferred searches that were not skipped or deferred before the upgrade. Run the following search to evaluate the size of the search bundle being pushed to indexers to determine if it is close to the maximum setting. If you have a search head cluster, run the search once on any search head member. If you don't have a search head cluster, run this search on each search head in your environment. index=_internal sourcetype=splunkd group=bundles_uploads search_group=dmc_group_search_head | eval baseline_bundle_size_mb=round((average_baseline_bundle_byte s/1024)/1024,1) | chart max(baseline_bundle_size_mb) AS Max_bundle_size by host | eval Max_bundle_size=Max_bundle_size . "M" For guidance about maximum bundle settings, see the topic [Modify the Knowledge Bundle][12] in the _Splunk Enterprise Distributed Search Manual_. Verify that users can log in utilizing remote auth (if configured) on each search head node. Review Report Accelerations summary status. Looking for 100% after catching up. (Settings > Report Acceleration Summaries). 5 Check upgrade success on the search tier (clustered) Check all upgrade success indicators in the search tier (distributed) list in the previous section. Verify that all search head cluster members are visible in the monitoring console. (Monitoring console > Indexing > Indexer Clustering: Status). Verify the search head cluster captain and member details in the monitoring console. (Monitoring console > Search > Search head clustering: Status and Configuration). Verify that search traffic is distributed evenly in the search head cluster Scheduler Delegation dashboard in the monitoring console by sorting the first panel by instance. (Monitoring Console > Search > Search Head Clustering: Scheduler Delegation). Measure the time a search head cluster member is taking to spin up by running the following search with a time range before and after the upgrade. Major swings could indicate a problem on the members. index=_internal uri=*delegatejob* | timechart median(spent) as median_spent max(spent) as max_spent Look for any errors and warnings in the logs using this search: index=_internal sourcetype=mongod earliest=-15m Verify that the search head cluster can push a bundle successfully to all indexers, especially if a search head cluster connects to multiple indexing clusters. See the topic Update search head cluster members in the Distributed Search Manual for instructions. Verify that the KVstore comes online on each node and replicates correctly across nodes using this search: (Monitoring Console > Search > KVStore > KVStore: Deployment) 6 Check upgrade success on the deployer Verify that a bundle can be pushed from the deployer to all search head nodes. 7 Check upgrade success on the deployment server Verify that config reload is successful. You can push a config from the forwarder management UI or the command line ( splunk reload deploy-server ). If there are issues with individual lines in serverclass.conf , they will appear in splunkd.log as ERROR and will be skipped, and Splunk will continue loading the rest of the file. Verify that all forwarders that should be phoning home are doing so successfully (Monitoring Console > Forwarders > Forwarders: Deployment). 8 Check upgrade success on the indexers After the upgrade and restart, allow at least 15 minutes for the cluster to finish processing all upgrade-related activity. Check the following indicators to verify that the upgrade is compete and successful. Verify that all the nodes are present in the UI, either in the cluster master UI, or in the management console (Monitoring Console > Indexing > Indexer clustering: Status) Verify that all data is searchable, and that replication factor and search factor are fully met (*Monitoring Console > Indexing > Indexer Clustering: Status). Verify that cleanup/fixup tasks are moving forward while continuing to watch load and IO on the cluster master (at the 'nix command line: iostat -zx 1 or sar -d ). Verify that basic searches work and all the indexers are replying by running this search: | tstats count where earliest=-5m by splunk_server Verify that all indexers are ingesting data. Check that ingestion rates are continuous, and If it dropped or spiked, whether it returned to the mean: Check HEC port (if configured). (Monitoring Console > Indexing > Inputs > HTTP Event Collector: Deployment). Check S2S port(s). (Monitoring Console > Indexing > Inputs > Splunk TCP Input Performance: Deployment). Review ingestion queues on the indexers. Ensure they are not filling and failing to recover. (Monitoring Console > Instances > Indexer > Views > Indexing Performance). For insight about how to troubleshoot ingestion issues, see Identify and triage indexing performance problems in the Troubleshooting Manual. For guidance about monitoring indexer performance, see Use the monitoring console to view indexing performance in the Managing Indexers and Clusters of Indexers Manual. Review load average and IOPS to determine that the cluster master has finished processing all upgrade-related activity (at the 'nix command line: iostat -zx 1 or sar -d ). Refer to the existing resource utilization metrics collected in the steps outlined in the Answers post How do I benchmark system health before a Splunk Enterprise upgrade? to determine when the cluster master has returned to its normal state of operations. (Monitoring Console > Resource Usage > Deployment). Scan the internal logs on the cluster master for warnings and errors. You can also check the internal logs of the indexers for warnings and errors, although these logs can contain many entries for unrelated conditions, such as parsing errors, and so on. index=_internal sourcetype=splunkd source=*splunkd.log log_level!=info Repeat the checks outlined in the Search tier (clustered) section above to ensure that searches complete in a timely way. If all else fails... Here are some resources if you run into any upgrade-related snags. For specific issues, refer to the Splunk Enterprise Troubleshooting Manual. If all else fails, contact your Splunk account rep or Splunk Support and Services. Related upgrade resources For tips about what to monitor before an upgrade, see the Answers post How do I benchmark system health before a Splunk Enterprise upgrade? to make sure your deployment is ready for upgrade, and that you have taken a benchmark of performance ranges on your Splunk Enterprise components that you can compare with post-upgrade performance. For tips about what to monitor and check during an upgrade, see the Answers post How do I monitor system health during a Splunk Enterprise upgrade? to make sure the upgrade goes smoothly for all components. For high-level post-upgrade guidance, review the post-upgrade guidelines in Phase 3: Verify everything works after the upgrade in the Splunk Enterprise Installation Manual.

davidpaper · ‎09-10-2019

I need details about what to validate after the upgrade so I know it was successful. How can I tell that everything got upgraded correctly, and that the system is healthy and ready to go?

davidpaper · ‎09-10-2019

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices. The Answers post What's the order of operations for upgrading Splunk Enterprise? outlines the high-level process for upgrading a Splunk Enterprise deployment. This post focuses on what to monitor during the upgrade phase to make sure the upgrade goes smoothly for all components. Keep two things in mind during upgrade: Upgrade components in the right order. This post gives a suggestion about upgrade order, but the order you should follow depends on your topology. For a high-level overview, see the Answers post What's the order of operations for upgrading Splunk Enterprise? For specific guidance, see the topic How to upgrade Splunk in the Splunk Enterprise Installation Manual. Ensure that each component has stabilized before proceeding with the next upgrade step. This post provides some guidance on how to make sure you're ready to move to the next step. Before you start upgrading, thoroughly read the topic About upgrading Splunk Enterprise: READ THIS FIRST in the Splunk Enterprise Installation Manual. The guidelines in this post supplement the detailed instructions in the upgrade documentation. Also review the Answers post How do I benchmark system health before a Splunk Enterprise upgrade to make sure your deployment is ready for upgrade. That post walks you through how to take a benchmark of performance ranges on your Splunk Enterprise components that you can compare with post-upgrade performance. We strongly recommend you put together a detailed upgrade plan that matches your topology before you upgrade. Here's a high-level snapshot for what to check during an upgrade. We dive into details below. Upgrade and check progress of single-step components Upgrade and check progress of the indexer cluster master Check forwarder function during upgrade Upgrade and check progress of indexers (distributed) Upgrade and check progress of indexers (clustered) 1 Upgrade and check progress of single-step components Several components are single-step upgrades (no bundle pushes, waiting for fix-ups, or sync waits for other components). These components are present in both distributed and clustered deployments. If all these components are on separate machines, you can upgrade them in the following order. If they are collocated on a single machine, you only need to run the upgrade once. License master Deployment server Search head cluster deployer Monitoring console After you upgrade the code for each component, verify that you can log in successfully to each component using the UI. From the Monitoring Console > Overview, verify that the license master, deployment server, and deployer components are visible and running in expected ranges compared to the benchmarks you took before upgrade. Note: The exact location of items in the monitoring console may vary depending on which version of Splunk Enterprise you're running. 2 Upgrade and check progress of the indexer cluster master If you have an indexer cluster, there are several indicators you can check to ensure that the cluster master has upgraded fully and is ready for the next step: Exit maintenance mode at key points to let the cluster master fully recover. Maintenance mode halts most bucket fixup activity and prevents frequent rolling of hot buckets. To help facilitate cluster recovery after running the upgrade, take the cluster master out of maintenance mode so it can process fixups and manage buckets. This is a best practice after any cluster master restart, and between upgrades of each site in a multi-site indexing cluster. Check progress indicators at each step. The act of monitoring the cluster master can affect performance because the cluster UI makes REST calls that can compete for resources as the cluster stitches itself together. There are several indicators you can view at the OS layer without adding load to the cluster master. The cluster master is generally ready for the next step when the following indictors are present: Load average has dropped. On the operating system, check " w ", " uptime ", " top " to see system load average. Disk IO has returned to pre-upgrade levels. At the 'nix command line, run iostats -xz 1 or sar -d Threads are no longer pegging a single CPU at 99%+. At the 'nix command line, run " top -H ", or turn on the thread view once 'top' initializes normally (“ H ”) The log splunkd.log returns to normal. Run tail -f $SPLUNK_HOME/var/log/splunk/splunkd.log . The rate that data gets written to this log slows significantly when the cluster master has caught up. The type of messages written to this log also changes to info-only. Compare current resource usage with pre-upgrade levels. When the upgraded cluster master and cluster come up, verify that the resource usage after upgrade compares with screen shots taken before the upgrade in the Monitoring Console > Resource Usage > Machine. 3 Check forwarder function during upgrade This check is to ensure that forwarders are still checking in during upgrade (for example, with the deployment server), and forwarding data (for example, to the indexer). Using the monitoring console, ensure that data ingestion continues to flow at the expected rate for the time of day and/or day of the week (monitoring console > Forwarders: Deployment). 4 Upgrade and check progress of indexers (stand-alone) As indexers are upgraded and brought back online, ensure they are ingesting and participating in search. Run the following search: index=_internal component=Metrics per_index_thruput earliest=-30m | eval mb=(kb/1024) | timechart span=5m sum(mb) by host | tstats count where earliest=-5m by splunk_server 5 Upgrade and check progress of indexers (clustered) Verify that indexers rejoin the cluster as they come back online and are marked Status=up and Fully Searchable=yes in Monitoring Console > Indexing > Indexer Clustering > Indexer Clustering: Status. What's next? To address the question about what to monitor to verify a successful Splunk Enterprise upgrade, see the Answers post What do I validate after I upgrade Splunk Enterprise to confirm the upgrade was successful? What's your experience? We'd like to hear from you. We'll be updating this topic as we gather more input.

davidpaper · ‎09-10-2019

I need details about what to monitor during my upgrade so I know it is proceeding as expected. What should I monitor during an upgrade?

davidpaper · ‎09-10-2019

This response is provided in conjunction with the Splunk Product Best Practices team. Read more about How Crowdsourcing is Shaping the Future of Splunk Best Practices. The post What's the order of operations for upgrading Splunk Enterprise? outlines the high-level process for upgrading Splunk Enterprise. One of the steps is to benchmark system health before the upgrade. Benchmarking system health in prep for upgrade has two main goals: Make sure the system is healthy enough to go forward without breaking or bogging down mid-upgrade Establish a baseline of performance before the upgrade so you can tell if your system is performing within expected ranges after the upgrade Before you start benchmarking, though, make sure you are familiar with your Splunk environment. To get a complete layout of your deployment architecture, Use the monitoring console to determine your topology, as described in the Inherit a Splunk Enterprise Deployment manual. Review all the topics in that manual if you are new to the Splunk environment you are about to upgrade. Here's a high-level snapshot of what to check before upgrading. We dive into details below. Benchmark and check system health with the monitoring console Benchmark and check forwarder system health Benchmark and check indexer system health Benchmark and check search tier system health 1 Benchmark and check system health with the monitoring console First, check basic system health indicators in the monitoring console with the Health Check. If you haven't already, Download health check updates and then Use the health check as described in the Monitoring Splunk Enterprise Manual. Then check these basic indicators in the monitoring console at a minimum: Note: The exact location of items in the monitoring console may vary depending on which version of Splunk Enterprise you're running. Verify that the monitoring console is configured correctly, and all Splunk Enterprise components are listed and have the correct roles associated with them. (Monitoring Console > Settings > General Setup). For guidance, see Configure the Monitoring Console in distributed mode in the Monitoring Splunk Enterprise Manual. Verify that all Splunk Enterprise components are connected and reporting back data. Check search heads, indexers, deployment server, license master, cluster master (if in use), deployer (if in use), and heavy forwarder (if in use). (Monitoring Console > Settings > General Setup: "monitoring" and "state" columns). Review existing resource utilization (CPU, RAM, disk) for search head and indexer tier. Take screenshots for comparison after upgrade. (Monitoring Console > Resource Usage > Deployment). Review search scheduling and performance. Correct any skipped and deferred searches before upgrade. (Monitoring Console > Search > Scheduler Activity > Deployment). For more about the scheduler activity dashboards, see Search: Scheduler activity in the Monitoring Splunk Enterprise Manual. For tips about discovering skipped searches, see the Answers post Skipped Searches on SHC. For background about the search scheduler and how it prioritizes (and possibly skips) searches, see Configure the priority of scheduled reports in the Reporting Manual. Review ingestion queues on the indexers. Ensure they are not filling and failing to recover. (Monitoring Console > Instances > Indexer > Views > Indexing Performance). For insight about how to troubleshoot ingestion issues, see Identify and triage indexing performance problems in the Troubleshooting Manual. For guidance about monitoring indexer performance, see Use the monitoring console to view indexing performance in the Managing Indexers and Clusters of Indexers Manual. Check the search head cluster If you're using a search head cluster, run the following checks on the monitoring console: Review replication latency for errors (top of view) and consistency for time taken (bottom of view). Ideally, you would investigate and correct errors before the upgrade, but if you don't, your error rate before and after the upgrade should be consistent. For the time taken, determine whether replication times vary significantly (spikes) or if they have a natural oscillation based on time of day, day of week, or any other pattern. Whatever that pattern is before the upgrade, it should continue after the upgrade. (Monitoring Console > Search > Search Head Clustering > SHC Configuration Replication). Check each search head cluster member and ensure that the KVStore role is applied to them. (Monitoring Console > Search > Search Head Clustering > Search Head Clustering: Status and Configuration). Edit the assignments as needed. For guidance, see Configure the Monitoring Console in distributed mode in the Monitoring Splunk Enterprise Manual. Review the KVstore oplog, specifically “Operations Log Window of KV Store Captain.” Look for a value of at least one hour. Three to four hours is ideal for a busy SHC, the higher the better. Values below 15 minutes are problematic and you should investigate and fix them before upgrade. (Monitoring Console > Search > KVStore > KVStore: Deployment). In the KVStore: Deployment view, ensure the following settings: KVstore in the search head cluster has a captain and one or more secondaries total queued=0 for all nodes Instances by Average Replication Latency is in the range of 0-10, except for search head clusters running ITSI, which can have a latency range in the 30s or higher Check the search tier Run this check on all search tier servers: If you have implemented report acceleration, review the Summary Status column in the Report Accelerations Summaries for completeness. Review the Access Count column for usage. (Settings > Report Acceleration Summaries). Consider disabling any report accelerations that have never been accessed. If report accelerations aren’t at 100%, the reason is likely related to skipped searches. Correct before upgrading. For guidance about accelerating reports, see the topic Accelerate Reports in the Reporting Manual. Check the deployer/search head cluster Run these checks on the monitoring console for the deployer/search head cluster: Verify that the status of the cluster is fully healthy. (Monitoring Console > Search > Search Head Clustering: Status and Configuration). Verify that you can complete a bundle push to all search head cluster nodes successfully. For instructions, see the topic Update search head cluster members in the Distributed Search Manual. If you are using a static captain, know which search head cluster node is set to captain. (Monitoring Console > Search > Search Head Clustering: Status and Configuration). Validate that KV store(s) replicate without issue. (Monitoring Console > Search > KV Store > KV Store Deployment, bottom of view). For guidance about how to resync the KV store, see Resync the KV store in the Admin Manual. Check the indexer cluster master If you are using an indexer cluster master, run the following checks on the monitoring console: Verify that all data is searchable, and that replication factor and search factor are fully met. (Monitoring Console > Indexing > Indexer Clustering: Status). Verify that you can successfully complete a bundle push to indexers. For guidance, see Distribute the configuration bundle in the Managing Indexers and Clusters of Indexers Manual. For troubleshooting tips, see Configuration bundle issues in that same manual. Benchmark disk IOPS and load average so you can compare it after upgrade to verify healthy function (at the 'nix command line: iostats -xz 1 or sar -d , or (Monitoring Console > Resource Usage > Deployment). Verify that unique bucket counts are within reasonable ranges. Although there are no set limits, a good benchmark is 5 million for less or Splunk Enterprise versions 6.6, 7.0, 7.1, or 9 million or less for Splunk Enterprise version 7.2. If unique bucket counts get much higher than these ranges, you could start experiencing performance degradation. (Monitoring Console > Indexing > Indexes and Volumes: Deployment). Run the following search on the cluster master: | rest splunk_server=local /services/cluster/master/peers | stats sum(bucket_count) AS bucket_count_all | eval bucket_count = round(bucket_count_all / 1000 / 1000,2)."M" | eval replication_factor = [ | rest splunk_server=local /services/cluster/config | return $replication_factor ] | eval unique = round(bucket_count_all / replication_factor / 1000 / 1000,2)."M" | fields bucket_count unique | rename bucket_count AS "Total Buckets", unique AS "Unique Buckets” If the unique bucket counts are significantly higher than 5 or 9 million, investigate the reasons and fix. Consider setting high bucket count configurations on the CM and IDX servers before upgrading. For guidance, see slide 21 of the presentation from Splunk .conf2017, Indexer clustering internals, scaling, and performance testing. Identify the pass4SymmKey in plain text in case it needs to be re-keyed into any configurations after upgrade. This password is managed outside of Splunk. Check the license master Run these checks on the monitoring console for the license master: Verify that all indexers are checking into the license master. (Monitoring Console > Instances > Group=License Master). Verify that _* indexes are successfully forwarding data to the indexing tier (if configured to do so). Run the following search and validate that the license master host is present in the list (you can also check for the cluster master host, the deployment server host, and the deployer host): "index=_internal earliest=-5min | stats count by host" For guidance about how to set up data forwarding, see Best practice: Forward search head data to the indexer layer in the Distributed Search Manual. For tips about how to set this up for forwarders and license master, see Best practice: Forward master node data to the indexer layer in the _Distributed Search Manual. Archive copies of license(s) off host, or verify that they are included in backups. Make copies of the .lic files in $SPLUNK_HOME/etc/licenses/enterprise/* . Check the deployment server Run these checks on the monitoring console for the deployment server: Validate that config reload is successful. You can push a config from the forwarder management UI or the command line ( splunk reload deploy-server ). If there are issues with individual lines in serverclass.conf , they will appear in splunkd.log as ERROR and will be skipped, and Splunk will continue loading the rest of the file. Validate that all forwarders that should be phoning home are doing so successfully. (Monitoring Console > Forwarders > Forwarders: Deployment). 2 Benchmark and check forwarder system health Verify the following on your forwarders before upgrading your Splunk Enterprise version. Verify that your current forwarders will work with new version of indexers, for example, that the version combinations are supported. To check forwarder compatibility between versions, see Compatibility between forwarders and indexers in the Splunk Products Version Compatibility Manual. Verify that the SSL and cipher suite configurations are compatible. For details, see Configure secure communications between Splunk instances with updated cipher suite and message authentication code in the Securing Splunk Enterprise Manual. If you are using an app that requires a heavy forwarder or makes external queries, such as DBX or JMX, validate that they work with the new Splunk Enterprise version. Ensure that any forwarder code management tools you have set up (such as Puppet, Chef, Ansible, or SCCM) can reach all forwarders to be upgraded. 3 Benchmark and check indexer system health Run these checks on your indexers: Ensure there is sufficient disk space to take local backups before the upgrade and to deploy the new code during upgrade. For guidance about managing disk space, see the topic Estimate your storage requirements and related topics in the Capacity Planning Manual. For items that may affect disk space during upgrade, see the topic About upgrading READ THIS FIRST in the Splunk Enterprise Installation Manual. Run this search to verify that indexers aren’t running scheduled searches: index=_internal source="*/scheduler.log" search_group=dmc_group_indexer sourcetype=scheduler |dedup host savedsearch_name | stats count(savedsearch_name) by savedsearch_name Verify that basic searches work and all the indexers replying by running this search: | tstats count where earliest=-5m by splunk_server 4 Benchmark and check search tier system health Run these checks on your search tier components: Validate that the upgrade target version works with all apps (searches, dashboards, add-ons, external inputs). Check version compatibility via Splunkbase for premium and non-premium apps. For guidance, see Splunk Products Version Compatibility and applications on Splunkbase. Also verify the end-of-support status of Splunkbase apps. For details, see End of Availability: Splunk-BUilt Apps and Add-ons on Splunk Blogs. Test homegrown apps. For guidance, see the topic Test your apps before upgrade in the Splunk Enterprise Installation Manual. Have copies of SSL keys, SAML configs, external auth credentials like passwords available in plaintext. Look for failing searches due to missing users in external auth and correct issues prior to upgrade. Run the following search to evaluate the size of the search bundle being pushed to indexers to determine if it is close to the maximum setting. If you have a search head cluster, run the search once on any search head member. If you don't have a search head cluster, run this search on each search head in your environment. index=_internal sourcetype=splunkd group=bundles_uploads search_group=dmc_group_search_head | eval baseline_bundle_size_mb=round((average_baseline_bundle_byte s/1024)/1024,1) | chart max(baseline_bundle_size_mb) AS Max_bundle_size by host | eval Max_bundle_size=Max_bundle_size . "M" For guidance about maximum bundle settings, see the topic Modify the Knowledge Bundle in the Splunk Enterprise Distributed Search Manual. What's next? To tackle the question about what to monitor during a Splunk Enterprise upgrade, see the Answers post How do I monitor system health during a Splunk Enterprise upgrade? What's your experience? We'd like to hear from you. We'll be updating this topic as we gather more input.

davidpaper · ‎09-10-2019

I need details about what to check before I upgrade so I know if my deployment is ready to upgrade. What do I monitor, and how do I benchmark system health before the upgrade?

davidpaper · ‎06-11-2019

Once an index is converted to use SmartStore, you are spot on. No more need for a coldPath entry for that index. Edit: The above is incorrect. You still need a coldPath entry in indexes.conf for the index, but the cold volume shouldn't be actively used once the buckets have been evicted from there.

davidpaper · ‎06-05-2019

Ah, this isn't really the case, but I can see how it might appear this way. There is now only "hot" and "not hot" in terms of a bucket lifecycle in S2. The concept of warm and cold being separate is no longer really a thing. Hot (read/write) is still replicated based on CM RF/SF settings until it rolls to read-only, and then 1 copy is made of the bucket to S3, and the other local copies are marked for deletion by the indexers' cachemanager process. The cachemanager retrieves read-only buckets from S3 when it needs to so a search can be completed and those bucket share the same file system as hot...so make sure your hot/cachemanager filesystem is nice and fast.

davidpaper · ‎04-08-2019

S2 behaviors in no particular order. I will update this post as new information is learned. RF/SF only apply to Hot buckets. Once a bucket is rolled, it is uploaded to S3 and any bucket replicates are marked for eviction. S2 cachemanager will download components of a bucket as searches determine what’s needed. Maybe bloomfilters, deletes, journal.* or other components, and as such multiple downloads for the same bucket may look like they are happening, but per component, no duplicate downloads should happen. Evictions don’t always seem to show up in MC on the S2 pages. The following will. index=_internal sourcetype=splunkd source=*splunkd.log action=evictDeletes Starting in 7.2.4, additional metrics were added to be able to count downloaded byte count. Prior to this version, Splunk was metrics-blind to the (potentially significant) impact on the network/storage a rolling restart induces. During a rolling restart, as each indexer is marked to go down CM begins to reassign primacy for buckets on the indexer on the way down to other indexers All buckets on indexer being restarted are marked for eviction, effectively flushing the cache on the indexer being restarted As indexers in the cluster are restarted, others will start d/ling buckets from S3 to satisfy search requests, which can take a heavy toll on local network and storage if not prepared for this level of data transfer in a short period of time, as all other indexers not being restarted will likely start requesting buckets to download at once. SmartStore only allows one indexer at a time to be primary searchable for a bucket and no other indexers are allowed to have copies of that bucket cached. The CM will issue eviction notices to any indexers with copies of that bucket locally. This ensures that only 1 indexer will search that bucket and return results. As a result of this, there is a huge amount of data shuffling and downloading that happens during a full cluster rolling restart. Bucket rebalance works more quickly with S2 than without it because the only buckets to rebalance are hot buckets Added Nov 2019 Disk part 1: S2 disk I/O requirements seem to be higher than non-S2, due to the bucket downloading process needing to be able to write large amounts of data quickly as cachemanager populates buckets for search. Default downloading config allows for 8 simultaneous downloads at once. Disks previously able to shoulder the load may not be up to the task of S2’s caching requirements. I'm looking at you, RAID5 volumes. By definition it's cache space (and hot bucket space, but hot is replicated), so use RAID0 (stripe) for the fastest disk possible, and not waste a MB of available disk space. RAID10 (mirrored stripes) is also acceptable, but cuts usable disk space by 50%. Disk part 2: To expand on the above a bit, S2 performance is more than just high IOPS, it's about throughput too. Customers running S2 in AWS that have chosen to use gp2 EBS volumes for hot/cachemanager will likely see severe IO contention resulting in IO wait % jumping during high periods of S2 bucket downloads from remote storage. This is quite easy to see in top or iostat when users run searches that trigger large bucket evictions & bucket downloads from remote storage. gp2 has a limit of 250MB/sec, which doesn't take long to hit when the network is 10 gig or faster. Yes, a fast network means data written to kernel buffer cache at a high rate and when its time to sync to disk, the storage won't be able to keep up. io1 EBS type is better, at 1000MB/s, but still can exhaust throughput capacity during periods of concurrent high bucket downloads and search that taxes the storage for both reads and writes in addition to ingestion and hot bucket replication. In AWS, it is highly recommended to use NVME for hot/cachemanager (i3 and i3en instance types work very well here) in RAID 0 and consider setting RF/SF=3 (still applies to hot buckets) to sleep better at night. Disk part 3: If deploying S2 outside of AWS, strive to obtain the fastest disks (throughput & IOPS) available, whether local SSDs or NVME to avoid storage bottlenecks getting in the way of your Splunk performance.

davidpaper · ‎04-08-2019

I'd like to better understand what behaviors SmartStore is going to exhibit in my environment, and how do I manage them? What can I do to prepare my environment for SmartStore?

davidpaper · ‎04-02-2019

On the cluster master, the following search provides answers to both questions. | rest splunk_server=local /services/cluster/master/peers | stats sum(bucket_count) AS bucket_count_all | eval bucket_count = round(bucket_count_all / 1000 / 1000,2)."M" | eval replication_factor = [| rest splunk_server=local /services/cluster/config | return $replication_factor ] | eval unique = round(bucket_count_all / replication_factor / 1000 / 1000,2)."M" | fields bucket_count unique | rename bucket_count AS "Total Buckets", unique AS "Unique Buckets"

davidpaper · ‎04-02-2019

I want to know how many buckets I have in my indexing clustered environment, both the total count of all buckets and how many of them are unique.

davidpaper · ‎08-16-2018

In discussions with Architecture gurus at Splunk, including @jkerai, there are some general guidelines to answer the question. Pipeline count While you can technically run many pipelines (recently tested running 12), we had diminishing results beyond 3. The main challenge is to keep all UF pipelines balanced and getting data fed to them. In most cases due to the way UFs get their data, not all pipelines are busy and thus aggregate thruput from UF is barely past 30-40MB/s. If you are reading data from disk by monitoring files, you will need a strategy that ensures that there are enough files to be read in parallel by different UF pipelines. If data is coming over raw TCP, there should be good number of connections coming onto UF so that they get evenly spread across different pipelines In majority of the cases I would guess that data is coming from few big files that is constantly being written to. This lends to high utilization for few pipelines and very low util for remaining ones. Throughput & Data distribution Each pipeline makes its own connection to one of the entries listed in outputs.conf. So, 3 pipelines will connect to 3 different entries listed in outputs. Pipelines work independently, roughly equivalent to having multiple UFs installed and running concurrently. Multiple pipelines randomly establish connections to the next hop, but statistically they should be talking to different indexers due to randomness. Each pipeline gets its own allocation of limits [thruput] maxKBps=# setting. If this is set to 0 (unlimited), then all pipelines shove as much throughput as they can push and the remote side can accept. Example: If maxKBps=5000, then each pipeline gets a max of 5MBytes/sec (yes bytes not bits) of throughput. This value is enforced before the data goes through the outbound compression routines, so the amount of data that appears on the wire should be considerably smaller (roughly 90% compression on average). So, 3 pipelines at 5MB/s = 15MB/s raw * 0.1 (compression ratio) = 1.5MB/s or 12Mb/s on the wire. Adding extra pipelines to your forwarder can help maintain a 2:1 forwarder:indexer pipeline ratio, which helped data distribution be more even across indexers. The higher the ratio, the more evenly distributed data is across the indexing tier. This matters when it comes to search performance (you want all indexers participating in all searches whenever possible) and balanced disk usage. Resource utilization Each pipeline enabled takes up resources of memory, CPU, disk and network. The one that seems to be the problem most often is CPU. A UF pipeline can consume 2 full cores. A HWF pipeline can consume 4 cores. So, 3 UF pipelines can chew up 6 cores on the host running the forwarder. 3 pipelines on a HWF could use up to 12 cores. RAM usage will also grow as each pipeline has its own queues and buffers that it maintains. If you have tuned your output buffers or queue sizes, be prepared for RAM usage to grow accordingly. Forcing your forwarders to dig into swap space is never a good idea for a production server! Disk can become impacted when persistent queues are enabled on the inputs side. Each pipeline will get its own directory for its queue and could potentially fill it up to max size of the queue if the next hop stops accepting data for a period of time. Make sure you have enough disk space to accommodate full persistent queues on all pipelines. Network utilization is discussed above.

davidpaper · ‎08-16-2018

I'm trying to figure out how many pipelines to set on my forwarders to maximize the following: Throughput Data distribution to my indexers Resource utilization What are the things I need to be aware of when adding more pipelines? The default is 1.

davidpaper · ‎06-18-2018

The logic behind bucket replication sourcing works like this: 1) We will prefer a local site source for RF replication. 2) However, if the local sources is already at max capacity for how many replications it can be involved in (max_peer_rep_load), then we can definitely go cross site for RF replication. 3) For SF replication, there is no preferences, it ends up being random. On the CM in server.conf: [clustering] max_peer_rep_load can be used to throttle up/down how many replication jobs are happening at once. Lowering this will slow down non-streaming (warm/cold) bucket replication, but will not affect streaming (hot) bucket replication. This value represents "slots" for each indexer to participate in non-streaming replication, either as a source or as a target. Huh, what? Need an example. Imagine 3 peers on site1, with bucket A and B that we want to be replicated intrasite (site1 needs to have 2 copies of A and B buckets), and max_peer_rep_load=1 (for simplified example), and 1 peer on site2: Site1: Peer1 - Bucket A Peer2 - Bucket B Peer3 - Bucket C Site2: Peer4 - Bucket B We may trigger replication of Bucket A on Peer1->Peer2. Since Peer1 & 2 are involved in a replication, both of the "peer rep" slots are now taken on Peer1 and Peer2. Peer3 has a slot available, so it can get a replication of BucketB from some outside site (Peer4 in site2) since Peer2 doesn't have a slot available, thus triggering an inter-site copy. Unfortunately, when we fix buckets, we fix them in some fixed (but random) order, and if the bucket we're scheduling next for replication doesn't have a Source on the local site, it will go to an alternate site. A huge thank you to @dxu_splunk for the background to answer the question. -dave

davidpaper · ‎06-18-2018

Scenario: multi-site cluster site1 and site2 site_rep_factor=origin:2, total:3 site_search_factor=origin:2, total:3 bucket12345 has 2 copies in site1 (origin) and 1 copy in site2. When a copy of the bucket is deleted in the origin site1 (the rb_* copy), the CM kicks off a job to make a new copy of that bucket. I see it being copied from an indexer in site2, instead of an indexer in site1. I expected Splunk to use a copy in the same site as the source, but it's not doing that. Why?

Posts	126
Solutions	22
Karma Given	227
Karma Received	224
Member Since	‎06-22-2011

Online Status	Offline
Date Last Visited	‎02-28-2023 01:16 AM

Why did ingestion slow way down after I added thou...

What is included in HEC introspection data?

What do I validate after I upgrade Splunk Enterpri...

How do I monitor system health during a Splunk Ent...

How do I benchmark system health before a Splunk E...

SmartStore Behaviors

How do I get a count of unique and total buckets i...

How many pipelines should I use on a forwarder?

Multi-site indexer clustering: why isn't data sour...

How to detect duplicate GUIDs on forwarders?

Re: Does Splunk Cloud support DUO two factor authe...

Re: Why did ingestion slow way down after I added ...

Why did ingestion slow way down after I added thou...

Re: What is included in HEC introspection data?

Re: What is included in HEC introspection data?

What is included in HEC introspection data?

Re: SmartStore Behaviors

Re: SmartStore Behaviors

Re: SmartStore Behaviors

Re: What do I validate after I upgrade Splunk Ente...

What do I validate after I upgrade Splunk Enterpri...

Re: How do I monitor system health during a Splunk...

How do I monitor system health during a Splunk Ent...

Re: How do I benchmark system health before a Splu...

How do I benchmark system health before a Splunk E...

Re: SmartStore Behaviors

Re: SmartStore Behaviors

Re: SmartStore Behaviors

SmartStore Behaviors

Re: How do I get a count of unique and total bucke...

How do I get a count of unique and total buckets i...

Re: How many pipelines should I use on a forwarder...

How many pipelines should I use on a forwarder?

Re: Multi-site indexer clustering: why isn't data ...

Multi-site indexer clustering: why isn't data sour...

Join the Conversation