Example code - Dicts: Difference between revisions

Revision as of 18:04, 1 March 2024

Files used in example

Web server statistics

It is of interest to the BOSS to see how many visitors a web site has and how many web pages has been seen. So therefore someone (you) has to make some statistics from the web server log file.

Example of a log entry

This is one line split in 3, so it can be seen.

52d3ccde.dynamic-ip.k-net.dk - - [07/Mar/2017:19:48:44 +0100] "GET /teaching/index.php?title=-&action=raw&gen=js&useskin=monobook HTTP/1.1"
   200 290 "http://wiki.bio.dtu.dk/teaching/index.php/Course27617Spring2017"
   "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/602.3.12 (KHTML, like Gecko) Version/10.0.2 Safari/602.3.12"

In this example, we are only interested in the red part , which is the IP-number or host name of the visitor, and the green part, which is the file served to the visitor. The problem is that the web is filled with search engines indexing the web sites, and they don't count as humans - yet. So they must be excluded from your statistics. Most of the indexing robots look at a file called /robots.txt, for what they are allowed to index. So here is a way to identify the non-humans. Also, many web pages uses pictures, meaning that the web server also serves picture files to the visitors. These pictures do not count as a page view (since they are being part of a page). Picture files mostly have the extensions .gif, .jpg, .png and .jpeg, so they can also be identified.

The program

#!/usr/bin/python3
# makes statistics on a apache webserver log file
import sys

# get file
try:
    logfile = open('apachewiki.log', 'r')
except IOError as err:
    print('There seems to be a problem with the file:', str(err))
    sys.exit(1)

# Dict for counting page views
hosts = dict()
# Set of search engine crawlers
crawler = set()

for line in logfile:
    field = line.split()
    # Is this a crawler
    if field[6] == '/robots.txt':
        crawler.add(field[0])
    # Is this a pic?
    if not field[6].endswith(('.gif','.jpg','.png','.jpeg')):
        # Must be a page view then
        if field[0] in hosts:
            hosts[field[0]] += 1
        else:
            hosts[field[0]] = 1

logfile.close()

# Remove page views made by crawlers
crawlCount = 0
for item in crawler:
    crawlCount += hosts[item]
    del hosts[item]

# Sort page views by size
# Print top ten
print('Top 10 viewers')
for host in sorted(hosts.keys(), reverse=True, key=hosts.get)[:10]:
    print(host, hosts[host])
# Get total page views
total = 0
for val in list(hosts.values()):
    total += val
print('Unique visitors:', len(hosts))
print('Unique crawlers:', len(crawler))
print('Total visitor page views:', total)
print('Total crawler page views:', crawlCount)

@@ Line 1: / Line 1: @@
 __NOTOC__
 == Files used in example ==
-[http://teaching.healthtech.dtu.dk/material/36610/apachewiki.log Log from web server]
+[http://teaching.healthtech.dtu.dk/material/22101/apachewiki.log Log from web server]
 == Web server statistics ==