A client produces a weekly magazine, in PDF format. There are 17 different versions of the zine each week, where the only difference is the cover page and the file name.
The file names are specific to the week's issue of the zine and the particular version of that issue. So, for example, OI022012.pdf and OI022012.hni.pdf and OI022012.oi.pdf
The client is requesting (frequently...) that I tell him how many downloads he gets each week. Obviously that's a valid request.
Optimally, I'd be able to tell him how many downloads of each file, each week.
If I were able to just count the code 200's (the successful, full-file downloads), I'd be all set. (But, then I wouldn't be asking a question here.)
The problem is that the log files are full of code 206's. Partial downloads. Eyeballing the log files reveals that the partials are bunched together from each client (usually), and look like this:
22.214.171.124 - - [24/Feb/2012:07:37:01 -0500] "GET /_pdf_/OI022012.pdf HTTP/1.1" 206 4317 "-" "Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" 126.96.36.199 - - [24/Feb/2012:07:37:00 -0500] "GET /_pdf_/OI022012.pdf HTTP/1.1" 200 3692245 "-" "Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" 188.8.131.52 - - [24/Feb/2012:07:37:01 -0500] "GET /favicon.ico HTTP/1.1" 200 1406 "-" "Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" 184.108.40.206 - - [24/Feb/2012:07:37:01 -0500] "GET /_pdf_/OI022012.pdf HTTP/1.1" 206 265 "-" "Mozilla/5.0 (Windows NT 6.0; rv:10.0.1) Gecko/20100101 Firefox/10.0.1"
(etc., etc... there are about 20 lines in a row from that same IP address, all code 206's)
Of course, other files are in there too (like the favicon.ico).
Can anybody suggest a way to build a report which tells me how many downloads there have been of each file, per week? Obviously, 10 partial (code 206) downloads of the file from the same IP address will have to count as just one download.
I don't see a way to do it. I may have to write a cgi which redirects to the appropriate pdf file, just to track the hits.
Clarification, added later:
I received a response to this which (for some reason) is not listed here on the site.
The suggestion was to make a transaction report, something like this:
... | transaction clientip, file | stats count by file | sort - count
That only works if I ignore NAT. In reality, most of the people downloading these files are getting them at work, and most of them work at large companies. So all of the requests from a given office are appearing to come from the same IP address.
Is there any way to make the transaction include an element of time?
Looking the logs over, it appears that all the partials for a file from a single IP address come in bursts which happen in a couple of seconds. IE, an entire file will be downloaded with 50 "partial" requests, in a matter of a couple seconds. Then some time later, there will be another group of partial downloads.
If there were some way to only count a request as part of the transaction if it happens within 30? 60? seconds of the first one in the group, I think I'd be all set.
Does that make sense? I have no idea if this is even possible, or if it is what the syntax would look like, but here's what I'm imagining:
... | transaction clientip, file, time[range=30s] | stats count by file | sort - count