The Artima Developer Community
Sponsored Link

Artima Developer Spotlight Forum
Optimizing ArrayList.removeAll()

41 replies on 3 pages. Most recent reply: Mar 9, 2007 9:39 PM by damien morton

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 41 replies on 3 pages [ 1 2 3 | » ]
amin ahmad

Posts: 15
Nickname: aahmad
Registered: Feb, 2007

Optimizing ArrayList.removeAll() Posted: Feb 28, 2007 12:30 PM
Reply to this message Reply
Advertisement

Recently, while working on the design of a high-performance data structure in Java, I made a startling discovery about the performance of ArrayList.removeAll(). Namely, that the average runtime performance of this method under Sun's Java 6 SDK is O(ns), where n is the size of the array list and s is the number of items to remove! Worst case performance is an even more dismal O(n2).

This short article explains the problem in detail, and illustrates two solutions with O(n) worst-case performance. Finally, a concrete performance test is presented to justify the theoretical discussion.

The Problem

The problem with ArrayList.removeAll was discovered in Sun's Java 6 SDK. ArrayList does not directly implement the removeAll() method: rather, it is implemented by its superclass AbstractCollection. The listing below shows the implementation of that method.

public boolean removeAll(Collection c) {
    boolean modified = false;
    Iterator e = iterator();
    while (e.hasNext()) {
      if (c.contains(e.next())) {
        e.remove();
        modified = true;
      }
    }
   return modified;
}

Two Solutions

Luckily, correcting the performance of removeAll() happens to be rather easy. Only a single pass through the array is required, and the algorithms are in-place to boot. The listing below shows a simple re-implementations of removeAll within the ArrayList class itself.

public boolean removeAll(Collection c) {
    int x=0;
    int i=0;
    for (i=0; i<size; i++) {
      if (!c.contains(elementData[i])) {
        elementData[x++] = elementData[i];
      }
    }
    size=x;
    return x != i;
 }

Note that this implementation also avoids unnecessary re-computation of a modified flag. The standard algorithm recomputes its modified flag for each item removed. This algorithm would be optimal if it weren't for the fact that most operating systems provide extremely fast native operations for moving blocks of memory around, and this algorithm doesn't manage to get in on the action.

To take advantage of fast memory copies, we need to utilize System.arraycopy() to move data around. This comes at the expense of a more complicated implementation.

public boolean removeAll(Collection c) {
    int oldHi=0, newHi=0, top=0;
    for (int i=0; i<size; ++i) {
        if (c.contains(elementData[i])) {
            oldHi = newHi;
            newHi = i;
    
           // at the end of this loop badHi will be the non-inclusive
           // upper limit of the range to delete.
           //
           while (++newHi<size && c.contains(elementData[newHi]));
      
           final int length = i - oldHi;
           System.arraycopy(elementData, oldHi, elementData, top, length);
           i = newHi;
           top += length;
       }
    }
    if (newHi > 0) {
      System.arraycopy(elementData, newHi, elementData, top, size - newHi);
      size += top - newHi;
      return true;
    } else {
      return false;
    }
}

The algorithm above represents my best effort at speed. It copies blocks rather than individual elements, and uses native code to move the data. Like the previous function, it doesn't waste time recomputing a modification flag.

A series of benchmarks and their results are available from the original source of this article at:

Optimizing ArrayList.removeAll


James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Feb 28, 2007 1:45 PM
Reply to this message Reply
> <p>To take advantage of fast memory copies, we need to
> utilize <code>System.arraycopy()</code> to move data
> around. This comes at the expense of a more complicated
> implementation.
> </p>

Your test results would suggest that this 'optimized' implementation is fairly equivalent to the first. In some cases it should decrease performance (when the lists are largely the same) as your test results show. On a side note, setting the flag multiple times is negligible. In any event both are still O(ns)

One other way that could the algorithm could be changed would be to add all the items in the parameter Collection c to a HashSet and then check the elements in n against the Hashset. This should make the time complexity O(n + s). Whether that amounts to improved performance would depend on the size and nature of the two collections. And there is of course a trade-off with increased memory usage.

amin ahmad

Posts: 15
Nickname: aahmad
Registered: Feb, 2007

Re: Optimizing ArrayList.removeAll() Posted: Feb 28, 2007 1:56 PM
Reply to this message Reply
> > <p>To take advantage of fast memory copies, we need to
> > utilize <code>System.arraycopy()</code> to move data
> > around. This comes at the expense of a more complicated
> > implementation.
> > </p>
>
> Your test results would suggest that this 'optimized'
> implementation is fairly equivalent to the first.

If by "the first", you mean the first O(n) solution, you are correct. If you mean Sun's implementation, take another look at the numbers.

In some
> cases it should decrease performance (when the lists are
> largely the same) as your test results show. On a side
> note, setting the flag multiple times is negligible. In
> any event both are still O(ns)

How are both still O(ns)? Sun's impl is O(ns), not the implementations I have provided!

>
> One other way that could the algorithm could be changed
> would be to add all the items in the parameter Collection
> c to a HashSet and then check the elements in n against
> the Hashset. This should make the time complexity O(n +
> s). Whether that amounts to improved performance would
> depend on the size and nature of the two collections. And
> there is of course a trade-off with increased memory usage.

The removeAll method should work directly with the collection its given. The performance of the collections "contain" method is a separable concern. As you mention, HashSets would be the most efficient data structure to supply to the method. However, you would not want to move the items to be removed into a hashset /inside/ the removeAll method. It would limit flexibility, cause unnecessary memory allocation, and, in general, the overhead would almost always be unjustified.

Alan Keefer

Posts: 29
Nickname: akeefer
Registered: Feb, 2007

Re: Optimizing ArrayList.removeAll() Posted: Feb 28, 2007 3:12 PM
Reply to this message Reply
The remove method is being called on the Iterator, not on the ArrayList. The version of Iterator returned by AbstractList implements remove() such that it calls remove(int) rather than remove(Object), and in ArrayList remove(int) is basically just an arraycopy operation. So in fact the original implementation has the same algorithmic complexity as the two alternative implementations, even though they might be slightly more efficient by not doing a copy on every removal.

amin ahmad

Posts: 15
Nickname: aahmad
Registered: Feb, 2007

Re: Optimizing ArrayList.removeAll() Posted: Feb 28, 2007 4:26 PM
Reply to this message Reply
> The remove method is being called on the Iterator, not on
> the ArrayList. The version of Iterator returned by
> AbstractList implements remove() such that it calls
> remove(int) rather than remove(Object), and in ArrayList
> remove(int) is basically just an arraycopy operation. So

Agreed.

> in fact the original implementation has the same
> algorithmic complexity as the two alternative
> implementations, even though they might be slightly more
> efficient by not doing a copy on every removal.

I don't think that's correct. Remember, the array copy operation works by shifting all elements after the element to be removed forward by one position. That's an O(n) procedure. This in turn is repeated /each time/ an item that needs to be removed is encountered. If there are "s" items to remove, the runtime is, therefore, O(n*s), which has a worst case time of O(n**2).

Another way to verify what I am saying is to download the benchmarks and observe that doubling the number of items to remove from the array list roughly doubles the runtime of Sun's implementation whereas the runtimes of the O(n) implementations stays unchanged.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 6:24 AM
Reply to this message Reply
> How are both still O(ns)? Sun's impl is O(ns), not the
> implementations I have provided!

You are iterating over every element in the current set. For each element you are calling c.contains(). c.contains has a base O(n) complexity.

So if n is the number of elements in your set, and s is the number of elements in c, you are calling a O(s) operation n times == O(n * s) == O(ns)

> > One other way that could the algorithm could be changed
> > would be to add all the items in the parameter
> Collection
> > c to a HashSet and then check the elements in n against
> > the Hashset. This should make the time complexity O(n
> +
> > s). Whether that amounts to improved performance would
> > depend on the size and nature of the two collections.
> And
> > there is of course a trade-off with increased memory
> usage.
>
> The removeAll method should work directly with the
> collection its given.

Where is that rule written? The Collections class does this kind of thing in sort().

> The performance of the collections
> "contain" method is a separable concern.

Not if you care about overall performance, it's not.

> As you mention,
> HashSets would be the most efficient data structure to
> supply to the method. However, you would not want to move
> the items to be removed into a hashset /inside/ the
> removeAll method. It would limit flexibility,

How would it limit flexibility?

> cause
> unnecessary memory allocation, and, in general, the
> overhead would almost always be unjustified.

Unjustified? With 2500 elements in each collection, O(ns) -> 6250000. O(n + s) -> 5000 i.e. reduced 1250 times. This comes at a cost of, oh let's say 50K of memory (it's probably less). A lot of applications could benefit from this tradeoff. I've worked in environments where 50 MB of memory was trivial. So if n and s were 2500000, the reduction in time complexity from: 6250000000000 to 5000000 -> 1250000 times less for a temporary memory usage that we don't care about.

Why not add it to your test suite?

I'm not going to say it should be done in the JDK. The memory usage could be too great for some applications. This is just a naive solution, anyway. It could be altered to reduce memory consumption. It's definitely something to consider for real (speed) optimization and is definitely not unjustified. It's much more justified that worrying about setting a boolean more than once.

amin ahmad

Posts: 15
Nickname: aahmad
Registered: Feb, 2007

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 6:49 AM
Reply to this message Reply
> > How are both still O(ns)? Sun's impl is O(ns), not the
> > implementations I have provided!
>
> You are iterating over every element in the current set.
> For each element you are calling c.contains().
> . c.contains has a base O(n) complexity.
>
> So if n is the number of elements in your set, and s is
> the number of elements in c, you are calling a O(s)
> operation n times == O(n * s) == O(ns)

I understand your analysis, but your statement that c.contains has a base O(n) complexity not always true.

>
> > > One other way that could the algorithm could be
> changed
> > > would be to add all the items in the parameter
> > Collection
> > > c to a HashSet and then check the elements in n
> against
> > > the Hashset. This should make the time complexity
> O(n
> > +
> > > s). Whether that amounts to improved performance
> would
> > > depend on the size and nature of the two collections.
> > And
> > > there is of course a trade-off with increased memory
> > usage.
> >
> > The removeAll method should work directly with the
> > collection its given.
>
> Where is that rule written? The Collections class does
> this kind of thing in sort().

I didn't realize this. However, O(n) preprocessing is more justifiable with sort, given that it's O(n log n). By "justifiable" I mean "more likely to pay dividends".
>
> > The performance of the collections
> > "contain" method is a separable concern.
>
> Not if you care about overall performance, it's not.

???
Whether or not you care about overall performance, the performance of contains /is/ separable when it comes to computing the complexity of the removeAll method. Just like the speed of memory allocation or comparison operations /is/ a concern for the performance of mergesort, but yet, those concerns are separable during performance analysis of mergesort.

>
> > As you mention,
> > HashSets would be the most efficient data structure to
> > supply to the method. However, you would not want to
> move
> > the items to be removed into a hashset /inside/ the
> > removeAll method. It would limit flexibility,
>
> How would it limit flexibility?

What if the data structure a developer passes in has better "contains" performance than HashSet (e.g. a hashset specialized for ints?). There would be nothing he could do to avoid using the HashSet and instead use the faster collection as-is. This is what I mean by "limits flexibility".

>
> > cause
> > unnecessary memory allocation, and, in general, the
> > overhead would almost always be unjustified.
>
> Unjustified? With 2500 elements in each collection, O(ns)
> -> 6250000. O(n + s) -> 5000 i.e. reduced 1250 times.
> This comes at a cost of, oh let's say 50K of memory (it's
> s probably less). A lot of applications could benefit
> from this tradeoff. I've worked in environments where 50
> MB of memory was trivial. So if n and s were 2500000, the
> reduction in time complexity from: 6250000000000 to
> 5000000 -> 1250000 times less for a temporary memory usage
> that we don't care about.
>
> Why not add it to your test suite?

I'll probably do this, though, for the record, my benchmarks are already passing in HashSets to remove all.

>
> I'm not going to say it should be done in the JDK. The
> memory usage could be too great for some applications.
> This is just a naive solution, anyway. It could be
> e altered to reduce memory consumption. It's definitely
> something to consider for real (speed) optimization and is
> definitely not unjustified. It's much more justified that
> worrying about setting a boolean more than once.

The benfit I see of your solution is that it puts a hard limit of O(n) on the removeAll method, /regardless/ of the performance of the original contains method. This comes at the expense of some upfront, O(n) (presumably) processing that would be unnecessary if a developer was already passing in a fast collection.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 7:39 AM
Reply to this message Reply
> > > How are both still O(ns)? Sun's impl is O(ns), not
> the
> > > implementations I have provided!
> >
> > You are iterating over every element in the current
> set.
> > For each element you are calling c.contains().
> > . c.contains has a base O(n) complexity.
> >
> > So if n is the number of elements in your set, and s is
> > the number of elements in c, you are calling a O(s)
> > operation n times == O(n * s) == O(ns)
>
> I understand your analysis, but your statement that
> c.contains has a base O(n) complexity not always true.

Yes it is. Some collections have better time complexity. Regardless, the base for all collections is still O(n).

> ???
> Whether or not you care about overall performance, the
> performance of contains /is/ separable when it comes to
> computing the complexity of the removeAll method. Just
> like the speed of memory allocation or comparison
> operations /is/ a concern for the performance of
> mergesort, but yet, those concerns are separable during
> performance analysis of mergesort.

If you don't want to separate them you must assume the lowest common denominator for the complexity of contains which is O(n). You have assumed the best best case: O(1) and built your tests around that. Changing your code to pass in an ArrayList and adding the implementation I have suggested gives the following results:

testing java.util.ArrayList
removing 150 from head of list (size=2500) average time=10033207ns.
removing 150 from tail of list (size=2500) average time=9554117ns.
removing nothing from list (size=2500) average time=255920ns.
removing 150 from interleaved every other with the first 300 of list (size=2500) average time=9841237ns.
removing 150 random items from list (size=2500) average time=9800201ns.
testing org.ahmadsoft.arrays.ArrayList2
removing 150 from head of list (size=2500) average time=9270435ns.
removing 150 from tail of list (size=2500) average time=9104636ns.
removing nothing from list (size=2500) average time=106303ns.
removing 150 from interleaved every other with the first 300 of list (size=2500) average time=8959013ns.
removing 150 random items from list (size=2500) average time=9193129ns.
testing org.ahmadsoft.arrays.ArrayList3
removing 150 from head of list (size=2500) average time=8991530ns.
removing 150 from tail of list (size=2500) average time=10572200ns.
removing nothing from list (size=2500) average time=78511ns.
removing 150 from interleaved every other with the first 300 of list (size=2500) average time=8503098ns.
removing 150 random items from list (size=2500) average time=9291676ns.
testing org.ahmadsoft.arrays.ArrayList4
removing 150 from head of list (size=2500) average time=212747ns.
removing 150 from tail of list (size=2500) average time=209233ns.
removing nothing from list (size=2500) average time=138670ns.
removing 150 from interleaved every other with the first 300 of list (size=2500) average time=218966ns.
removing 150 random items from list (size=2500) average time=230186ns.


I expect the ArrayList4 version will pull even farther ahead as n and s increase but I can't stand waiting for the other tests to execute with higher capacities.

> What if the data structure a developer passes in has
> better "contains" performance than HashSet (e.g. a hashset
> specialized for ints?).

In terms of time complexity, you cannot beat O(1) without quantum computers.

> There would be nothing he could do
> to avoid using the HashSet and instead use the faster
> collection as-is. This is what I mean by "limits
> flexibility".

Surprisingly, before I realized you were passing in HashSets already, my new version was basically just as fast and actually edged out the other new versions in some tests (I'm guessing this was a result of JIT because it doesn't make any sense to me otherwise.)

> I'll probably do this, though, for the record, my
> benchmarks are already passing in HashSets to remove all.

Yes. I think that's not correct. It's 'cooking the books'.

> The benfit I see of your solution is that it puts a hard
> limit of O(n) on the removeAll method, /regardless/ of the
> performance of the original contains method. This comes at
> the expense of some upfront, O(n) (presumably) processing
> that would be unnecessary if a developer was already
> passing in a fast collection.

One thing that could be done if we are talking about fixing the JDK is to add a marker interface similar to RandomAccess. It could be called FastContains or something. This could be used to mark TreeSet or other non-standard collections with better contains performance.

You could also add code to check the size of the input collection. If it's smaller than some threshold just execute normally, some other threshold, use a sorted structure, more than that, use a hash table. The only problem would be some rare circumstances where memory was at a high premium. My biased opinion would be that those rare circumstances could be worked-around because for the vast majority of cases this kind of thing would be a big improvement.

Another interesting test would be to sort both Collections (in new ArrayLists or even raw arrays) and loop over the both of them once. I believe the time complexity for that is: O(n*log(n) + s*log(s) + n) which isn't as good in terms of time complexity but probably has a little less overhead.

Tobias Mattsson

Posts: 6
Nickname: tobiasm
Registered: Jan, 2007

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 11:35 AM
Reply to this message Reply
Don't forget to null out the remaining unused slots of the underlying array. Neglecting to do so will cause memory leaks.

amin ahmad

Posts: 15
Nickname: aahmad
Registered: Feb, 2007

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 11:42 AM
Reply to this message Reply
> Don't forget to null out the remaining unused slots of the
> underlying array. Neglecting to do so will cause memory
> leaks.

Thanks, that's a good catch. I'll go back and fix this.

amin

Mark Thornton

Posts: 275
Nickname: mthornton
Registered: Oct, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 11:47 AM
Reply to this message Reply
Using a HashSet internally might change the set or even number of elements removed --- the Collection passed to removeAll might be based on a different concept of equality to HashSet. For example consider a HashSet of Integer's vs the keySet of an IdentityHashMap of Integer's (it doesn't matter what they map to of course).

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 2:56 PM
Reply to this message Reply
> Using a HashSet internally might change the set or even
> number of elements removed --- the Collection passed to
> removeAll might be based on a different concept of
> equality to HashSet. For example consider a HashSet of
> Integer's vs the keySet of an IdentityHashMap of Integer's
> (it doesn't matter what they map to of course).

You're right. It would also cause issues with Collections that have Comparators set that are not consistent with equals().

Technically all of these issues result from breaking the contract of contains():

"More formally, returns true if and only if this set contains an element e such that (o==null ? e==null : o.equals(e))."

But what is done is done I suppose.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 2:58 PM
Reply to this message Reply
> > Don't forget to null out the remaining unused slots of
> the
> > underlying array. Neglecting to do so will cause memory
> > leaks.
>
> Thanks, that's a good catch. I'll go back and fix this.

Have you looked at retainAll()? I would guess it has the same flaw.

damien morton

Posts: 15
Nickname: dmost
Registered: Jan, 2004

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 3:22 PM
Reply to this message Reply
sort by hashcode, binary search to find
maybe an intermediate way forward

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 1, 2007 3:34 PM
Reply to this message Reply
> sort by hashcode, binary search to find
> maybe an intermediate way forward

But the Comparator thing would still cause problems. A TreeSet can base it's contains method on something completely inconsistent with equals. I think you could get away with it for known collections (ArrayList, LinkedList) though.

In any event I think it's good to point out that passing in a O(n) contains collection to a removeAll call has bad time-complexity. Even if it can't be fixed in the JDK, it's something people should not do when the lists are fairly large and speed matters.

Flat View: This topic has 41 replies on 3 pages [ 1  2  3 | » ]
Topic: Working Effectively With Characterization Tests Previous Topic   Next Topic Topic: The Expressive Power of Languages

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use