Python script for Visitor Paths
Last updated: October, 2010
download: Visitor Path analysis
Description:
This Python script analyzes collections of Apache (common format) log files (aka log analyzer) and produces two types of statistics: (1) visitor paths of the individual visitors (by IP). If you are using Awstats, you can link to its DNS cache and it will try to resolve the IP. (2) global visitor paths. Aggregated numbers on how many visitors went from page A to page B. You can turn of statistic of type 1 (as this is very verbose and is generated in memory) and only generate statistics of type 2. Type 2 statistics are generated as comma seperated values (csv) and are easy to import in a spreadsheet application for further analysis (cvs example). There are several other parameters you can set (e.g. location of the log files) that you can set before running the Python script. The zip file you download contains several file. The script to execute (and modify is vp.py) ApacheCommon.py is a generic log parser class and FileCrawl.py is a generic file reader (line by line). VisitorPaths.py is the class that uses FileCrawl.py and ApacheCommon.py to gather the statistics based on the configuration in vp.py
When you run the scripts with multiple log files, make sure that the log files are in the right order (in the setParameter section). Logfiles contain the oldest entries first, and the newest entries last. Order the log files from oldest log file (first) to most recent log file (last). The log file analyzer will then analyze the entries from oldest to newest which will be reflected in the text based output. The screenshot below shows a sample of how the output would look like. Per IP/domain there is an ordered list of date/time, page visited and referer (search engine, other site, etc..). A dash (-) means there was no referer.
Visitor Paths Output Example.