This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Log analysis of my website
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I write these essays in part as a promotional activity. I'm a
consultant, and expect people to find out more about what I do through
reading what I've written.
I've wondered if it's been useful, but have put off doing the analysis
of my website. At first it was because I didn't have enough essays to
do interpretable analysis. And then I just put it off. At the German
Chemoinformatics Conference I talked to quite a few people, mostly
grad students, who had gotten information from my site. That was
enough to make me finally do some analysis.
I used awstats, chosen
based on doing some web searches. I wanted something that could
analyze my Apache logs and could generate static pages. There are
other tools but since awstats did what I wanted I didn't try
anything else.
So far this year I've had 1.1 million "hits", which corresponds to
330,000 page views. A "hit" includes images, so a page view can have
multiple hits because of CSS, images, and other embedded content.
Another nearly 500,000 page views comes from web spiders and other
identifiably non-people requests. More page requests from robots than
people. All told, I use less than 20GB bandwidth per year. I use pair Networks for my hosting. My basic
account allows 400GB/month of transfer. I'm not even close.
Of the robots, Yahoo Slurp pulled down 1.6 GB, MSNBot 810 MB and and
Googlebot 290MB. 80MB for Google's RSS reader, 7MB from Bloglines and
5MB from UniversalFeedParser. Of the users, 64.5% use Windows, 17%
use Linux, 11.5% use Macs, and jumping over the BSD and Solaris users,
a full 88 requests came from an IRIX machine. The browser stats are 45% Firefox, 33.5% IE, 4% each Mozilla and Firefox, 3% Opera.
Top hit (no surprise) is my RSS feed, viewed 82,000 times this year.
Including by aggregators so translate as you wish. Next was my LOLPython
page, which wasn't a surprise. I wrote it deliberately because of the
then high popularity of lolcats and lolcode. It got 17,500 views.
About 1,200 downloads from people who weren't me.
The next two were surprising. I did a series of lectures for
the NBN. These were for the most part graduate students in
biology, going into computational biology, who needed more programming
training. The page on Javascript
validation got 7,300 hits and on threads in
Python, with 5,800. My screen scraping was also popular, at 5,600 views.
Going further down the list:
naming molecules is the first chemistry page, at 4,300 hits. I think because it uses the word "vodka"
my wide finder commentary is only a few months old and is #11 position with 4,200 hits. Basking in Tim Bray's shadow.
3,200 people viewed this slide. Why? People searching for "sample use case". But it's an image - how do the search engines know about it?
the ANTLR work I did is also popular. Only 50 days old and 2,500 hits. Well, it was on the ANTLR home page for a while.
I do a lot of work with cheminformatics, but that's the details. In
most cases my topic is more general, like how to write a C extension
for Python (that just happens to use a chemistry toolkit). The
highest cheminformatics specific hit is my article on SMILES
tokenization, with 1,500 hits. Most of the links come from
Wikipedia's SMILES
page. My most popular bioinformatics page is on
BLAST parsing at just under 1,400 hits.
You can easily see that most people who come to my pages are there
because of popular topics of the day (LOLPython, wide-finder) or
general computing questions (threading, validation, HTML templates,
Python, ANTLR). Very few came to my pages for cheminfomatics reasons.
Then again, there are very few people doing cheminformatics.
The top search phrases were:
python basics - 2,200
screen scraping - 1,600
python trace - 1,000
naming molecules - 1,000
sample use case - 809
use case sample - 610
pyrssgen - 600
sample use cases - 580
boa constructor - 510 (that's a very old review of mine)
lolpython - 500
Yes folks, 2,000 people came to my site for one image I have of a use
case, from a 10 minute presentation I gave at a bioinformatics
conference trying to convince people that usability analysis is
important. I don't think it had any effect. No one came to my site
searching for information on OEChem.
60% of the pages come from "direct address or bookmarks". 31% came
from search engines, and 10% from referrers. The top being
lolcode.com, then Pythonware's Daily-URL (probably lolpython), with
the already mentioned wide-finder (via the effbot) and ANTLR home
page. programming.reddit.com linked to my lolpython page, and the
matplotlib cookbook links to my page showing how to use matplotlib
without a GUI.
Lastly, hostname analysis. Who is 207.172.151.225? That's registered
to the RCN Corporation and resolved at
207-172-151-225.c3-0.gth-ubr1.lnh-gth.md.cable.rcn.com. They sucked
down 780 MB of my 20GB. All to read my RSS file every hour. Whoever
it is doesn't know to how to ask for an If-Modified-Since as they are
downloading the entire thing (usually unchanged) every time. How do I
complain?
The next hog is NewsAlloy through 207.230.13.10 which has downloaded
450 MB, and makes full requests every 20 minutes. I emailed them this:
Your RSS reader at 207.230.13.10 , identified as "NewsAlloy/1.1
(http://www.NewsAlloy.com; 1 subscribers)" is taking up 5% of my
upload bandwidth. While that's only 400MB/year, the underlying reason
is because your service doesn't send the tags needed to handle HTTP
conditional get. My server should only need to return a 304 Not
Modified for most cases, rather than the 200 Ok (along with over 100K
of content). You poll every 20 minutes, so that adds up.
You would decrease your bandwidth use by quite a bit - perhaps an
order of magnitude - by adding support for conditional GET requests.
See for example:
http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers .
I admit: I do this partially to see what happens. I got an answer
within a few hours. They said it shouldn't have happened and asked
for more details. Looking into it further I see that whever
subscribed via their service unsubscribed a few months ago. NewsAlloy
hadn't made a request since then.
I don't know who uses NewsAlloy. I will say that they had very
responsive service.
Next on the list, at only 6MB is my ISP. This is me checking things
on my server, and my home page is my web site. After that is a friend
(I recognized the domain name) at 4MB. He's configured his RSS reader
to poll every 30 minutes.
Looking for hosts in my field, I see 2,000 requests hits from a
biotech in England. Ah-ha, it's one person, reading this from a
machine with "Windows-RSS-Platform/1.0 (MSIE 7.0; Windows NT 5.1)". Hi!
There are 700 page requests from the rest of pharma. 200 from one
site (all through Google searches finding my PyDaylight work) and 100
from another site.