EDIT: This list has more to do with platform stability than "defending the turf," but it's much easier to identify problems in an otherwise healthy environment than a sick one.
I generally do the following:
1. Configure the monitoring console and enable alerts. If you're using forwarders, configure forwarder monitoring. This should cover basic availability monitoring.
2. Create a report or dashboard quantifying _internal (or app specific) ERROR and WARN* events by source, component, or whichever category works best for you conceptually. Manage these as defects using quality control tools, e.g. Pareto charts.
3. Identify hosts and sources present today that were not present yesterday, i.e. new sources.
4. Identify hosts and sources present yesterday that are not present today, i.e. missing sources.
5. Identify anomalous changes in event counts across critical hosts and sources.
6. Work with your infrastructure or capacity team (if they're separate functions) to baseline Splunk performance and identify anomalous variances in principal components: CPU, memory, I/O, and storage.
Beyond the basics, you're getting into service quality and quantifying/qualifying user behavior: search performance, search coverage, data retention relative to storage pools, etc.