The Artima Developer Community
Sponsored Link

Python Buzz Forum
Searching user lists

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand
Searching user lists Posted: Mar 7, 2006 5:18 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: Searching user lists
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Latest Python Buzz Posts
Latest Python Buzz Posts by Phillip Pearson
Latest Posts From Second p0st

Advertisement

A year or so ago I wrote some code to search the user list for aSmallWorld, and I keep meaning to write a general tool to do that sort of job for other sites.

Originally ASW was using a MySQL query like this:

SELECT * FROM member_info WHERE firstname LIKE '%$fn%' AND lastname LIKE '%$ln%'

Testing this out on a snapshot of their database, this takes about 1.5 seconds to search through the 129K rows in the member_info table. How about if I copy the IDs and names into a memory table?

SET MAX_HEAP_TABLE_SIZE=99000000;
CREATE TABLE mem_inf_cache (custid INT, firstname VARCHAR(255), lastname VARCHAR(255)) ENGINE=MEMORY SELECT custid,firstname,lastname FROM member_info;

Executing the SELECT on the mem_inf_cache table takes 0.28 sec to scan the table. This is still not particularly impressive. So let's dump the data out into a text file and see what grep can do.

SELECT custid,firstname,lastname FROM member_info INTO OUTFILE '/tmp/asw_members.txt';

This produces a 2.8 megabyte text file containing user IDs and names. Running time grep 'Pearson' /tmp/asw_members.txt tells me 0.013 s real, 0.002 s user and 0.011s system. That's about twice as fast as MySQL's memory table, presumably because we don't have the overhead of MySQL's internal table format - it's just one big string.

(Going back into MySQL and making a mem table with one big string for everything results in the query taking 0.19 secs - still not bad.)

grep's not that fun to use, though. How about if we load it into Python and do the same thing?

import time
>>>s = open("/tmp/asw_members_outfile.txt").read()
>>> st = time.time(); s.find("Mark"); print time.time() - st
7386
0.000225067138672

>>> st = time.time(); s.find("Marc\tCanter"); print time.time() - st
426977
0.00199580192566

>>> st = time.time(); s.find("Phillip\tPearson"); print time.time() - st
1966918
0.00865697860718

>> st = time.time(); s.find("someone who doesn't exist"); print time.time() - st
-1
0.0148859024048

The last search, which runs through the entire string, took about the same amount of time as the grep, which makes sense as grep was probably running out of the disk cache and Python's string.find function is implemented in C, so each did a similar amounts of work.

Now let's try doing it with a regex instead:

import re
>>> st = time.time(); re.findall("Pearson", s); print time.time() - st
['Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson', 'Pearson']
0.017783164978

>>> st = time.time(); re.findall(".*?Pearson.*?", s); print time.time() - st
[..., '92478\tPhillip\tPearson', ...]
2.39711594582

So it doesn't take much longer for a simple regex to run over the string than it does to do a straight find, although it gets expensive if you try to get the regex to pull out the full string for you.

How does this scale? If I go back to the Python code and run everything on 100 copies of the text (i.e. 287 megabytes), s.find("someone who doesn't exist") takes 1.48 s and re.findall("Pearson", s) takes 1.72 secs, so it's pretty linear - you can search 129K users in 0.015 s or 12.9M users in 1.5 s (i.e.: 8.6M users/s on an Athlon XP 2000+).

It would be nice to optimise a little further, as member searching is quite a popular activity, and the query volume will go up along with the data volume as you get more users.

I'm afraid there's no way to make it faster to search a large number of names than a small number of names, but you can scale better than linearly. One interesting thing is that out of the 129K users, there are only 19K unique first names and 19K unique last names. These numbers would hopefully increase slower than linearly with respect to the number of users, so you could build a unique list, run the substring search on that, then query into the main index to get the user IDs after you have full strings, which should run a bit quicker in MySQL.

That's all I have time for today. I'd like to spend some more time looking into the problem of 'advanced search' - searching for more than just names. This can be a pain to do in MySQL because you need to create so many indices to cater to the many different queries the user may select - indexing requirements are quite different for "firstname like '%phillip%' and age > 20 and age < 30 and most_often_in='christchurch'" and "lastname like '%pearson%' and age < 30 and country='new zealand' and industry='software'", for example. It's easy enough to do a linear search (like the grep approach above) for this, but I'd like to think about better-scaling alternatives.

Comment

Read: Searching user lists

Topic: Advanced Through-The-Web (TTW) Document Type creation with CPSTypeMaker Previous Topic   Next Topic Topic: PHPmac

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use