The Artima Developer Community
Sponsored Link

Python Buzz Forum
Removing duplicates using itertools.groupby

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Thomas Guest

Posts: 236
Nickname: tag
Registered: Nov, 2004

Thomas Guest likes dynamic languages.
Removing duplicates using itertools.groupby Posted: Oct 7, 2008 10:57 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Thomas Guest.
Original Post: Removing duplicates using itertools.groupby
Feed Title: Word Aligned: Category Python
Feed URL: http://feeds.feedburner.com/WordAlignedCategoryPython
Feed Description: Dynamic languages in general. Python in particular. The adventures of a space sensitive programmer.
Latest Python Buzz Posts
Latest Python Buzz Posts by Thomas Guest
Latest Posts From Word Aligned: Category Python

Advertisement

Eliminating duplicates

Previously I discussed merging sorted streams using Python; a call to a standard library function for those who’ve updated to 2.6.

>>> from heapq import merge
>>> from itertools import count, imap, islice
>>> m2, m3, m5 = [imap(n.__mul__, count(1)) for n in (2, 3, 5)]
>>> m235 = merge(m2, m3, m5)
>>> list(islice(m235, 10))
[2, 3, 4, 5, 6, 6, 8, 9, 10, 10]

Here, we merge positive multiples of 2, 3 and 5 into a single stream. 6 appears twice in the output, being a multiple of both 2 and 3. 10 is similarly duplicated and 30 would feature three times.

Itertools.groupby can remove the repeated entries.

>>> from itertools import groupby
>>> help(groupby)
Help on class groupby in module itertools:

class groupby(__builtin__.object)
 |  groupby(iterable[, keyfunc]) -> create an iterator which returns
 |  (key, sub-iterator) grouped by each value of key(value).

If the key function is not specified or is None it defaults to an identity function, and groupby partitions an iterable into subiterators over equal valued elements.

>>> from operator import itemgetter
>>> first = itemgetter(0)
>>> m235 = merge(*(imap(n.__mul__, count(1)) for n in (2, 3, 5)))
>>> u235 = imap(first, groupby(m235))
>>> list(islice(u235, 10))
[2, 3, 4, 5, 6, 8, 9, 10, 12, 14]

Groupby is very like the Unix uniq tool.

Generally (the documentation goes on to say) “the iterable needs to already be sorted on the same key function” but groupby is perfectly well defined when this isn’t the case. Here’s an artificial example showing groupby turning a random sequence into an alternating one.

Random sequence

Let’s toss a coin to generate a random sequence of heads and tails.

>>> def toss_a_coin():
...     from random import choice
...     return random.choice("HT")
...
>>> from itertools import repeat
>>> def call(f):
...     return f()
...
>>> tosses = imap(call, repeat(toss_a_coin))
>>> spacer = ' '.join
>>> >>> spacer(list(islice(tosses, 7)))
'T T H T H H H'

tailstailsheadstailsheadsheadsheads

Alternating sequence

We can filter the unique values from this random sequence of coin tosses to generate an alternating sequence.

>>> def uniq(seq):
...     return imap(first, groupby(seq))
... 
>>> spacer(list(islice(uniq(tosses), 7)))
'H T H T H T H'

headstailsheadstailsheadstailsheads

Constant sequence

Every other element of this filtered sequence will be the same.

>>> tails = islice(uniq(tosses), 0, None, 2)
>>> spacer(list(islice(tails, 7)))
'T T T T T T T'

tailstailstailstailstailstailstails

Applying uniq again gives a single element infinitely repeated.

>>> tt = uniq(tails)
>>> tt.next()
'T'
>>> tt.next()
  C-c C-cTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in call
  File "<stdin>", line 2, in toss_a_coin
KeyboardInterrupt

tails

Uniq -c

Imitating uniq -c, we can yield a repeat count paired to each unique value.

>>> def ilen(it):
...     return sum(1 for _ in it)
... 
>>> def counted_uniq(seq):
...     return ((ilen(i), k) for k, i in groupby(seq))
... 
>>> ht = counted_uniq(tosses)
>>> list(islice(ht, 5))
[(1, 'H'), (3, 'T'), (1, 'H'), (3, 'T'), (3, 'H')]

How long before we see a run of 7 equal throws?

>>> from itertools import takewhile
>>> tosses = imap(call, repeat(toss_a_coin))
>>> counts = imap(first, counted_uniq(tosses))
>>> sum(takewhile(lambda n: n < 7, counts))
44

tailstailstailstailstailstailstails

Progression

The coin illustrating this note is a 1956 farthing, a small coin with a value of just ¼ of a penny dating from a time when there were 240 pence in every pound. The reverse features a wren, the obverse Queen Elizabeth II. The farthing was demonetised in 1960. In 1971 British currency went decimal and now, 37 years later, the 1971 coin designs are being revamped. Hiding amongst the loose change in my pocket today I noticed a couple of shiny new 2008 pennies. It’s about time we got euros.

2008 small change

Read: Removing duplicates using itertools.groupby

Topic: High Fructose Corn Syrup a Natural Product? Previous Topic   Next Topic Topic: Make sure mod_python and python match unicode settings

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use