Example code - Dicts: Difference between revisions
(Created page with "__NOTOC__ == Files used in example == [http://teaching.healthtech.dtu.dk/material/36610/apachewiki.log Log from web server] == Web server statistics == It is of interest to the BOSS to see how many visitors a web site has and how many web pages has been seen. So therefore someone (you) has to make some statistics from the web server log file. == Example of a log entry == This is one line split in 3, so it can be seen. <span style="color:#FF0000">52d3ccde.dynamic-ip.k-...") |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
__NOTOC__ | __NOTOC__ | ||
== Files used in example == | == Files used in example == | ||
[ | [https://teaching.healthtech.dtu.dk/material/22101/apachewiki.log Log from web server] | ||
== Web server statistics == | == Web server statistics == |
Latest revision as of 17:04, 1 March 2024
Files used in example
Web server statistics
It is of interest to the BOSS to see how many visitors a web site has and how many web pages has been seen. So therefore someone (you) has to make some statistics from the web server log file.
Example of a log entry
This is one line split in 3, so it can be seen.
52d3ccde.dynamic-ip.k-net.dk - - [07/Mar/2017:19:48:44 +0100] "GET /teaching/index.php?title=-&action=raw&gen=js&useskin=monobook HTTP/1.1" 200 290 "http://wiki.bio.dtu.dk/teaching/index.php/Course27617Spring2017" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0.2 Safari/602.3.12"
In this example, we are only interested in the red part , which is the IP-number or host name of the visitor, and the green part, which is the file served to the visitor. The problem is that the web is filled with search engines indexing the web sites, and they don't count as humans - yet. So they must be excluded from your statistics. Most of the indexing robots look at a file called /robots.txt, for what they are allowed to index. So here is a way to identify the non-humans. Also, many web pages uses pictures, meaning that the web server also serves picture files to the visitors. These pictures do not count as a page view (since they are being part of a page). Picture files mostly have the extensions .gif, .jpg, .png and .jpeg, so they can also be identified.
The program
#!/usr/bin/python3 # makes statistics on a apache webserver log file import sys # get file try: logfile = open('apachewiki.log', 'r') except IOError as err: print('There seems to be a problem with the file:', str(err)) sys.exit(1) # Dict for counting page views hosts = dict() # Set of search engine crawlers crawler = set() for line in logfile: field = line.split() # Is this a crawler if field[6] == '/robots.txt': crawler.add(field[0]) # Is this a pic? if not field[6].endswith(('.gif','.jpg','.png','.jpeg')): # Must be a page view then if field[0] in hosts: hosts[field[0]] += 1 else: hosts[field[0]] = 1 logfile.close() # Remove page views made by crawlers crawlCount = 0 for item in crawler: crawlCount += hosts[item] del hosts[item] # Sort page views by size # Print top ten print('Top 10 viewers') for host in sorted(hosts.keys(), reverse=True, key=hosts.get)[:10]: print(host, hosts[host]) # Get total page views total = 0 for val in list(hosts.values()): total += val print('Unique visitors:', len(hosts)) print('Unique crawlers:', len(crawler)) print('Total visitor page views:', total) print('Total crawler page views:', crawlCount)