Hum this is the core of the product, it indexes events, and store them into buckets (the rawdata folder), and creates tsidx (timeseries index pointer) to make them searchable. (in case of replication, not all copies are searchable)
The format and process are of course proprietary, but you can find some details of the different pipelines involved.
http://docs.splunk.com/Documentation/Splunk/6.1.4/Deploy/Datapipeline
For the collection they are many way :
see http://docs.splunk.com/Documentation/Splunk/latest/Data/WhatSplunkcanmonitor
local monitoring : inputs monitoring files and folders
remote monitoring for windows : WMI inputs or AD monitor
network inputs : udp or tcp ports are listening on the indexers, by example for syslog on pot 514.
forwarding : the indexers have a listening port (splunktcp on port 9997 by example), and forwarders agents on remote servers are monitoring and sending data.
see http://docs.splunk.com/Documentation/Splunk/latest/Forwarding/Aboutforwardingandreceivingdata
see http://docs.splunk.com/Documentation/Splunk/latest/Deploy/Distributedoverview
in case of doubt, run a btool on the inputs, or use the SOS app metrics dashboards to identify forwarders.
./splunk cmd btool inputs list --debug
http://docs.splunk.com/Documentation/Splunk/6.1.4/Troubleshooting/Usebtooltotroubleshootconfigurations
... View more