This post originated from an RSS feed registered with Ruby Buzz
by Matt Bauer.
Original Post: MySQL And GeoIP
Feed Title: blogmmmultiworks
Feed URL: http://blog.mmmultiworks.com/feed/rss.xml
Feed Description: Thoughts on Ruby, hosting and more
It's 2008 and I have a couple Ruby community goals I want to achieve. One of which is to actually get the GemStats site running. I made an attempt last year but a bad design and lack of time really derailed me. Not this year though. I have a new design that is fast as snot, I'm scheduling time in my calendar to work on it and am getting a designer to help with the UI. I've also thought of some damn cool features to add but first I need to get the basics done.
There are something like 17.4 million records detailing each and every gem download. Each record needs to be sanitized and normalized which requires the geocoding of ip addresses among other things. I'm using MaxMind's Geolocating Products for this. The data is contained in two tables. The first is defined as:
CREATE INDEX geoip_locations_loc_id ON geoip_locations (loc_id)
CREATE INDEX geoip_locations_country_code ON geoip_locations (country_code)
CREATE INDEX geoip_locations_region_code ON geoip_locations (region_code)
CREATE INDEX geoip_blocks_ip_address_start ON geoip_blocks (ip_address_start)
CREATE INDEX geoip_blocks_ip_address_end ON geoip_blocks (ip_address_end)
All ip address are encoded as unsigned integers and MySQL provides two built in functions to convert from dotted octet to the unsigned integer and back: INET_ATON and INET_NTOA. Using these functions we can get the geoip_location field from an ip address with the following query:
SELECT geoip_locations.* FROM geoip_blocks, geoip_locations
WHERE INET_ATON('216.243.185.70')
BETWEEN geoip_blocks.ip_address_start
AND geoip_blocks.ip_address_end
AND geoip_blocks.loc_id = geoip_locations.loc_id
This query does indeed work but is too slow.
To make it faster MySQL needs some hints. The first hint to leverage the fact that the geoip_blocks data is already ordered in the table. This means the matching row will be the maximum ip_address_start that is just below the ip address you are searching for. MySQL can be told of this hint by just adding and ORDER BY clause which will scan the ordered ip_address_start index:
SELECT geoip_locations.* FROM geoip_blocks, geoip_locations
WHERE INET_ATON('216.243.185.70')
BETWEEN geoip_blocks.ip_address_start
AND geoip_blocks.ip_address_end
AND geoip_blocks.loc_id = geoip_locations.loc_id
ORDER BY geoip_blocks.ip_address_start DESC
This does improve the speed but it can be made faster with another hint. There is only one record for any given ip address meaning MySQL can stop looking once it finds a matching row. This hint is applied by adding a LIMIT 1 to the query:
SELECT geoip_locations.* FROM geoip_blocks, geoip_locations
WHERE INET_ATON('216.243.185.70')
BETWEEN geoip_blocks.ip_address_start
AND geoip_blocks.ip_address_end
AND geoip_blocks.loc_id = geoip_locations.loc_id
ORDER BY geoip_blocks.ip_address_start DESC LIMIT 1
This works pretty well except for larger ip address. Looking at explains, the problem is clear. When a larger ip address is used, MySQL picks up the ip_address_end index instead of the ip_address_start index. This makes the ORDER BY clause actually costly. It would be possible to make two different queries depending on the value of the ip address we are searching for but special cases are usually problematic.
A better solution is to stick a general case. Remember, the ip_address_start must be the maximum value just below the ip address being searched for. This can be made even more explicitly for MySQL:
SELECT geoip_locations.* FROM geoip_blocks
INNER JOIN (
SELECT MAX(ip_address_start) AS max_ip
FROM geoip_blocks
WHERE ip_address_start <= INET_ATON('216.243.185.70')
) s ON s.max_ip = geoip_blocks.ip_address_start
INNER JOIN geoip_locations ON geoip_locations.loc_id = geoip_blocks.loc_id
WHERE geoip_blocks.ip_address_end >= INET_ATON('216.243.185.70')
This gives execution times (as reported by MySQL) of around 0.09s for the entire ip range so long as the ip address is in the geoip_blocks table. Otherwise a full table scan occurs as MySQL tries to find a match.
Variations of the above query happen throughout GemStats and as a result I created a function to find the loc_id given an ip address:
CREATE FUNCTION ip_address_to_loc_id(ip_address INT(11) UNSIGNED)
RETURNS INTEGER
BEGIN
DECLARE loc_id_var INTEGER;
SELECT loc_id INTO loc_id_var
FROM geoip_blocks
INNER JOIN (
SELECT MAX(ip_address_start) AS max_ip
FROM geoip_blocks
WHERE ip_address_start <= ip_address
) s ON s.max_ip = geoip_blocks.ip_address_start
WHERE ip_address_end >= ip_address
RETURN loc_id_var;
END
This function executes (as reported by MySQL) in 0.00s for the entire ip address range. With this function the query now becomes:
SELECT geoip_locations.* FROM geoip_locations
WHERE ip_address_to_loc_id(INET_ATON('216.243.185.70')) = geoip_locations.loc_id
Fast and clean now. For more info see these discussions and posts: