I have a DBX 3.1.2 job that's failing at some point along the way. I don't get any error messages (everything is set to DEBUG levels), just the following message in the metrics logs:
2018-05-03 12:06:37.976 -0400 INFO c.s.dbx.server.task.listeners.JobMetricsListener - action=collect_job_metrics connection=my_db_connection jdbc_url=null record_read_success_count=3444 db_read_time=397794 record_read_error_count=1 hec_upload_time=102 hec_record_process_time=13 format_hec_success_count=3444 hec_upload_bytes=1631645 status=FAILED input_name=my_db_input batch_size=1000 error_threshold=N/A is_jmx_monitoring=false start_time=2018-05-03_12:00:00 end_time=2018-05-03_12:06:37 duration=397965 read_count=3444 write_count=3000 filtered_count=0 error_count=0
As you can see, not everything in the read_count field is making it into the write_count field. But when I search for error messages related to this input, I don't get anything beyond this.
Has anybody else had this problem? Where did you look?
sounds like HEC performance, which usually means indexer pushback. Look at your indexing queues.
3000 is a suspiciously round number and also a suspicious multiple of your batch_size.
Also, that hec_upload_time of 102 seconds is... I hope that's in ms. Even then that seems kind of high for a few thousand records totaling a MB and a half.
Have you confirmed that the right number of records made it into Splunk or not? I'm pretty sure it didn't, but maybe this is an error on the internal's metrics?
I agree that it's a suspicious multiple.
Upload times are in ms as far as I can see...this is one of the most heavily taxed databases in the environment, so it's going to be a bit higher than one would like.
We have confirmed that Splunk is not reading the appropriate amount of records. We are missing entries.