Python Buzz Forum - Log analysis of my website

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
Log analysis of my website

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.

Log analysis of my website

Posted: Mar 2, 2008 3:13 PM

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: Log analysis of my website Feed Title: Andrew Dalke's writings Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.	Latest Python Buzz Posts Latest Python Buzz Posts by Andrew Dalke Latest Posts From Andrew Dalke's writings

I write these essays in part as a promotional activity. I'm a consultant, and expect people to find out more about what I do through reading what I've written.

I've wondered if it's been useful, but have put off doing the analysis of my website. At first it was because I didn't have enough essays to do interpretable analysis. And then I just put it off. At the German Chemoinformatics Conference I talked to quite a few people, mostly grad students, who had gotten information from my site. That was enough to make me finally do some analysis.

I used awstats, chosen based on doing some web searches. I wanted something that could analyze my Apache logs and could generate static pages. There are other tools but since awstats did what I wanted I didn't try anything else.

So far this year I've had 1.1 million "hits", which corresponds to 330,000 page views. A "hit" includes images, so a page view can have multiple hits because of CSS, images, and other embedded content. Another nearly 500,000 page views comes from web spiders and other identifiably non-people requests. More page requests from robots than people. All told, I use less than 20GB bandwidth per year. I use pair Networks for my hosting. My basic account allows 400GB/month of transfer. I'm not even close.

Of the robots, Yahoo Slurp pulled down 1.6 GB, MSNBot 810 MB and and Googlebot 290MB. 80MB for Google's RSS reader, 7MB from Bloglines and 5MB from UniversalFeedParser. Of the users, 64.5% use Windows, 17% use Linux, 11.5% use Macs, and jumping over the BSD and Solaris users, a full 88 requests came from an IRIX machine. The browser stats are 45% Firefox, 33.5% IE, 4% each Mozilla and Firefox, 3% Opera.

Top hit (no surprise) is my RSS feed, viewed 82,000 times this year. Including by aggregators so translate as you wish. Next was my LOLPython page, which wasn't a surprise. I wrote it deliberately because of the then high popularity of lolcats and lolcode. It got 17,500 views. About 1,200 downloads from people who weren't me.

The next two were surprising. I did a series of lectures for the NBN. These were for the most part graduate students in biology, going into computational biology, who needed more programming training. The page on Javascript validation got 7,300 hits and on threads in Python, with 5,800. My screen scraping was also popular, at 5,600 views.

Going further down the list:

naming molecules is the first chemistry page, at 4,300 hits. I think because it uses the word "vodka"
my wide finder commentary is only a few months old and is #11 position with 4,200 hits. Basking in Tim Bray's shadow.
3,200 people viewed this slide. Why? People searching for "sample use case". But it's an image - how do the search engines know about it?
the ANTLR work I did is also popular. Only 50 days old and 2,500 hits. Well, it was on the ANTLR home page for a while.

I do a lot of work with cheminformatics, but that's the details. In most cases my topic is more general, like how to write a C extension for Python (that just happens to use a chemistry toolkit). The highest cheminformatics specific hit is my article on SMILES tokenization, with 1,500 hits. Most of the links come from Wikipedia's SMILES page. My most popular bioinformatics page is on BLAST parsing at just under 1,400 hits.

You can easily see that most people who come to my pages are there because of popular topics of the day (LOLPython, wide-finder) or general computing questions (threading, validation, HTML templates, Python, ANTLR). Very few came to my pages for cheminfomatics reasons. Then again, there are very few people doing cheminformatics.

The top search phrases were:

python basics - 2,200
screen scraping - 1,600
python trace - 1,000
naming molecules - 1,000
sample use case - 809
use case sample - 610
pyrssgen - 600
sample use cases - 580
boa constructor - 510 (that's a very old review of mine)
lolpython - 500

Yes folks, 2,000 people came to my site for one image I have of a use case, from a 10 minute presentation I gave at a bioinformatics conference trying to convince people that usability analysis is important. I don't think it had any effect. No one came to my site searching for information on OEChem.

60% of the pages come from "direct address or bookmarks". 31% came from search engines, and 10% from referrers. The top being lolcode.com, then Pythonware's Daily-URL (probably lolpython), with the already mentioned wide-finder (via the effbot) and ANTLR home page. programming.reddit.com linked to my lolpython page, and the matplotlib cookbook links to my page showing how to use matplotlib without a GUI.

Lastly, hostname analysis. Who is 207.172.151.225? That's registered to the RCN Corporation and resolved at 207-172-151-225.c3-0.gth-ubr1.lnh-gth.md.cable.rcn.com. They sucked down 780 MB of my 20GB. All to read my RSS file every hour. Whoever it is doesn't know to how to ask for an If-Modified-Since as they are downloading the entire thing (usually unchanged) every time. How do I complain?

The next hog is NewsAlloy through 207.230.13.10 which has downloaded 450 MB, and makes full requests every 20 minutes. I emailed them this:

Your RSS reader at 207.230.13.10 , identified as "NewsAlloy/1.1 (http://www.NewsAlloy.com; 1 subscribers)" is taking up 5% of my upload bandwidth. While that's only 400MB/year, the underlying reason is because your service doesn't send the tags needed to handle HTTP conditional get. My server should only need to return a 304 Not Modified for most cases, rather than the 200 Ok (along with over 100K of content). You poll every 20 minutes, so that adds up.

You would decrease your bandwidth use by quite a bit - perhaps an order of magnitude - by adding support for conditional GET requests. See for example: http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers .

I admit: I do this partially to see what happens. I got an answer within a few hours. They said it shouldn't have happened and asked for more details. Looking into it further I see that whever subscribed via their service unsubscribed a few months ago. NewsAlloy hadn't made a request since then.

I don't know who uses NewsAlloy. I will say that they had very responsive service.

Next on the list, at only 6MB is my ISP. This is me checking things on my server, and my home page is my web site. After that is a friend (I recognized the domain name) at 4MB. He's configured his RSS reader to poll every 30 minutes.

Looking for hosts in my field, I see 2,000 requests hits from a biotech in England. Ah-ha, it's one person, reading this from a machine with "Windows-RSS-Platform/1.0 (MSIE 7.0; Windows NT 5.1)". Hi!

There are 700 page requests from the rest of pharma. 200 from one site (all through Google searches finding my PyDaylight work) and 100 from another site.

Read: Log analysis of my website

Previous Topic

Next Topic


	Web Artima.com