The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
MySQL And GeoIP

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Matt Bauer

Posts: 48
Nickname: bauer
Registered: Apr, 2007

Matt Bauer owns Mosquito Mole Multiworks - a Rails Hosting and Consulting Company
MySQL And GeoIP Posted: Jan 10, 2008 10:39 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Matt Bauer.
Original Post: MySQL And GeoIP
Feed Title: blogmmmultiworks
Feed URL: http://blog.mmmultiworks.com/feed/rss.xml
Feed Description: Thoughts on Ruby, hosting and more
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Matt Bauer
Latest Posts From blogmmmultiworks

Advertisement

It's 2008 and I have a couple Ruby community goals I want to achieve. One of which is to actually get the GemStats site running. I made an attempt last year but a bad design and lack of time really derailed me. Not this year though. I have a new design that is fast as snot, I'm scheduling time in my calendar to work on it and am getting a designer to help with the UI. I've also thought of some damn cool features to add but first I need to get the basics done.

There are something like 17.4 million records detailing each and every gem download. Each record needs to be sanitized and normalized which requires the geocoding of ip addresses among other things. I'm using MaxMind's Geolocating Products for this. The data is contained in two tables. The first is defined as:

CREATE TABLE geoip_blocks (
  loc_id int(11),
  ip_address_start int(11) unsigned,
  ip_address_end int(11) unsigned
) ENGINE=MyISAM
and has about 2.8 million records. The second table is:
CREATE TABLE geoip_locations (
    loc_id int(11),
    country_code char(2),
    region_code char(2),
    city varchar(255),
    postal_code varchar(255),
    dma_code varchar(255),
    area_code varchar(255),
    latitude float,
    longitude float
) ENGINE=MyISAM
The following indexes are then added:
CREATE INDEX geoip_locations_loc_id ON geoip_locations (loc_id)
CREATE INDEX geoip_locations_country_code ON geoip_locations (country_code)
CREATE INDEX geoip_locations_region_code ON geoip_locations (region_code)
    
CREATE INDEX geoip_blocks_ip_address_start ON geoip_blocks (ip_address_start)
CREATE INDEX geoip_blocks_ip_address_end ON geoip_blocks (ip_address_end)

All ip address are encoded as unsigned integers and MySQL provides two built in functions to convert from dotted octet to the unsigned integer and back: INET_ATON and INET_NTOA. Using these functions we can get the geoip_location field from an ip address with the following query:

SELECT geoip_locations.* FROM geoip_blocks, geoip_locations
WHERE INET_ATON('216.243.185.70') 
BETWEEN geoip_blocks.ip_address_start 
   AND geoip_blocks.ip_address_end 
AND geoip_blocks.loc_id = geoip_locations.loc_id
This query does indeed work but is too slow.

To make it faster MySQL needs some hints. The first hint to leverage the fact that the geoip_blocks data is already ordered in the table. This means the matching row will be the maximum ip_address_start that is just below the ip address you are searching for. MySQL can be told of this hint by just adding and ORDER BY clause which will scan the ordered ip_address_start index:

SELECT geoip_locations.* FROM geoip_blocks, geoip_locations
WHERE INET_ATON('216.243.185.70') 
BETWEEN geoip_blocks.ip_address_start 
  AND geoip_blocks.ip_address_end 
AND geoip_blocks.loc_id = geoip_locations.loc_id
ORDER BY geoip_blocks.ip_address_start DESC
This does improve the speed but it can be made faster with another hint. There is only one record for any given ip address meaning MySQL can stop looking once it finds a matching row. This hint is applied by adding a LIMIT 1 to the query:
SELECT geoip_locations.* FROM geoip_blocks, geoip_locations
WHERE INET_ATON('216.243.185.70')
BETWEEN geoip_blocks.ip_address_start
   AND geoip_blocks.ip_address_end
AND geoip_blocks.loc_id = geoip_locations.loc_id
ORDER BY geoip_blocks.ip_address_start DESC LIMIT 1
This works pretty well except for larger ip address. Looking at explains, the problem is clear. When a larger ip address is used, MySQL picks up the ip_address_end index instead of the ip_address_start index. This makes the ORDER BY clause actually costly. It would be possible to make two different queries depending on the value of the ip address we are searching for but special cases are usually problematic. A better solution is to stick a general case. Remember, the ip_address_start must be the maximum value just below the ip address being searched for. This can be made even more explicitly for MySQL:
SELECT geoip_locations.* FROM geoip_blocks
INNER JOIN (
   SELECT MAX(ip_address_start) AS max_ip
     FROM geoip_blocks
     WHERE ip_address_start <= INET_ATON('216.243.185.70')
   ) s ON s.max_ip = geoip_blocks.ip_address_start
INNER JOIN geoip_locations ON geoip_locations.loc_id = geoip_blocks.loc_id
WHERE geoip_blocks.ip_address_end >= INET_ATON('216.243.185.70')
This gives execution times (as reported by MySQL) of around 0.09s for the entire ip range so long as the ip address is in the geoip_blocks table. Otherwise a full table scan occurs as MySQL tries to find a match.

Variations of the above query happen throughout GemStats and as a result I created a function to find the loc_id given an ip address:

CREATE FUNCTION ip_address_to_loc_id(ip_address INT(11) UNSIGNED)
  RETURNS INTEGER
BEGIN
  DECLARE loc_id_var INTEGER;
  SELECT loc_id INTO loc_id_var
  FROM geoip_blocks
  INNER JOIN (
    SELECT MAX(ip_address_start) AS max_ip
    FROM geoip_blocks
    WHERE ip_address_start <= ip_address
   ) s ON s.max_ip = geoip_blocks.ip_address_start
  WHERE ip_address_end >= ip_address
  RETURN loc_id_var;
END
This function executes (as reported by MySQL) in 0.00s for the entire ip address range. With this function the query now becomes:
SELECT geoip_locations.* FROM geoip_locations
WHERE ip_address_to_loc_id(INET_ATON('216.243.185.70')) = geoip_locations.loc_id
Fast and clean now. For more info see these discussions and posts:

MySQL Forum - slow select to find an ip in a range
On efficiently geo referencing ips with maxmind geoip and mysql gis
Fast mysql range queries on maxmind geoip tables

Read: MySQL And GeoIP

Topic: What's New in Edge Rails - In Chinese! Previous Topic   Next Topic Topic: Last day for Early-Bird Prices

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use