The Artima Developer Community
Sponsored Link

Artima Developer Spotlight Forum
Optimizing ArrayList.removeAll()

41 replies on 3 pages. Most recent reply: Mar 9, 2007 9:39 PM by damien morton

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 41 replies on 3 pages [ « | 1 2 3 ]
Surender Kumar

Posts: 10
Nickname: darthvader
Registered: Jun, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 7, 2007 11:37 AM
Reply to this message Reply
Advertisement
my mistake..was under the impression that if input list has 2 references references and actual list 3 references, removeAll() in that case should remove only 2. The implementation doesn't do that either, so it's fine.

Can anyone please explain then what's the issue with removing to the hashset?

Amin, if you're still convinced that Sun's impl is O(n^3) or O(n*n*K) and not O(n^2) like your solutions, please elaborate.

-Surender

amin ahmad

Posts: 15
Nickname: aahmad
Registered: Feb, 2007

Re: Optimizing ArrayList.removeAll() Posted: Mar 7, 2007 12:53 PM
Reply to this message Reply
> my mistake..was under the impression that if input list
> has 2 references references and actual list 3 references,
> removeAll() in that case should remove only 2. The
> implementation doesn't do that either, so it's fine.
>
> Can anyone please explain then what's the issue with
> removing to the hashset?
>
> Amin, if you're still convinced that Sun's impl is O(n^3)
> or O(n*n*K) and not O(n^2) like your solutions, please
> elaborate.

Ignore this discussion for a second, it is tangential. See my comments below. (On this point, the correct worst-case complexity is O(n*n + K*s) for Sun's, and O(n + K*s) for mine, where K is the cost of "contains"),

>
> -Surender

Okay, I will try to make this as clear as possible. The article states that my removeAll is O(n) while Sun's is O(n*n).

Surender, *assume* for a minute that both versions of the method (mine and Sun's) accept a hashset with O(1) contains performance. Now assume that the list size is "N", and that the hashset I pass in contains those same "N" items. What is the time complexity of the two implementations, by your reckoning?

amin ahmad

Posts: 15
Nickname: aahmad
Registered: Feb, 2007

Re: Optimizing ArrayList.removeAll() Posted: Mar 7, 2007 12:55 PM
Reply to this message Reply
> Ignore this discussion for a second, it is tangential. See my comments below. (On this point, the correct worst-case complexity is O(n*n + K*s) for Sun's, and O(n + K*s) for mine, where K is the cost of "contains"),

Sorry, that's also wrong, I was in a rush.

Surender Kumar

Posts: 10
Nickname: darthvader
Registered: Jun, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 8, 2007 4:22 AM
Reply to this message Reply
In that case, I agree your solution scores over Sun's as Sun still has inner code of O(n) 'cos of remove(). Yours become O(n) while Sun's remains O(n^2). Agreed.

What we were first arguing over was that your solution is worst case O(n) while Sun's O(n^2).

So now let's talk the real case where contains() has a base complexity of O(n) as James also observed.

*Assume* that argument Collection in removeAll() is the same 'ArrayList' as the actual one. This is the worst case with both having exactly the same elements.

In this case 'contains()' iterates over the entire list thereby having O(n). So if your solution also invokes contains() (as it does) in the loop you have O(n^2) as well.

In terms of time complexity in big-Oh notation, there's no improvement.

But as i stated previously, your solution has got rid of remove() only and not contains(). Sun's time is O(n * (2n)) i.e O(n^2). That 2n is contains() + remove().

Worst case time complexity in Big-Oh is same for both if contains() is O(n). Your solution has subtracted 'n' cycles from Sun's but not divided it by 'n' cycles.


contains() = O(1) : Amin's solution's wonderful.
contains() = O(n) : No significance difference.

As i get your solution doesn't change contains().

I hope this time I'm clear to you.

amin ahmad

Posts: 15
Nickname: aahmad
Registered: Feb, 2007

Re: Optimizing ArrayList.removeAll() Posted: Mar 8, 2007 6:30 AM
Reply to this message Reply
> In that case, I agree your solution scores over Sun's as
> Sun still has inner code of O(n) 'cos of remove(). Yours
> become O(n) while Sun's remains O(n^2). Agreed.

Okay, cool. That was the main point of the article.

>
> What we were first arguing over was that your solution is
> worst case O(n) while Sun's O(n^2).

Yes, I see the source of confusion now. I was talking worst case of removeAll in isolation.

>
> So now let's talk the real case where contains() has a
> base complexity of O(n) as James also observed.

Okay, though I don't see why you consider the O(n) case the "real" case.

>
> *Assume* that argument Collection in removeAll() is the
> same 'ArrayList' as the actual one. This is the worst case
> with both having exactly the same elements.
>
> In this case 'contains()' iterates over the entire list
> thereby having O(n). So if your solution also invokes
> contains() (as it does) in the loop you have O(n^2) as
> well.

Okay, agreed, but lets make that more thorough:

n: elements in the collection, C, on which you are doing removeAll
s: elements in the list, L, you pass to removeAll
r: the elements in s that are in n (that is the number of elements that will be removed).

So for suns solution we have, roughly:
r*n: the cost of doing a remove r times
+ n*s: the cost of doing an L.contains() call on each element in C, assuming O(n) performance for contains.

for my solution, we have, roughly:
n : the cost to remove r items
+ n*s: the cost of doing an L.contains() call on each element in C, assuming O(n) performance of contains.

There you have the difference between the two. Now, if:

a) r=s=n, we get n^2 performance for both, though mine would still perform better (removal is constant vs. n^2).
b) s=0, and r is much, much larger than n (r=1000000, n=1), then the performance of the list dominates, and both algorithms perform poorly.

Side note: I think it was said on this list before, that passing a collection with poor contains performance is a *bad thing*, unless the list is really small.


>
> In terms of time complexity in big-Oh notation, there's
> no improvement.
>
> But as i stated previously, your solution has got rid of
> remove() only and not contains(). Sun's time is O(n *
> (2n)) i.e O(n^2). That 2n is contains() + remove().
>
> Worst case time complexity in Big-Oh is same for both if
> contains() is O(n). Your solution has subtracted 'n'
> cycles from Sun's but not divided it by 'n' cycles.
>
>
> contains() = O(1) : Amin's solution's wonderful.
> contains() = O(n) : No significance difference.
>
> As i get your solution doesn't change contains().
>
> I hope this time I'm clear to you.

Okay, I think we are roughly on the same page. By the way, the concrete timings are available here: http://www.ahmadsoft.org/articles/removeall/index.html. You can see that the solutions I provided outperform Sun's by a factor of 3 when removing 150 items from a list of size 2500 (passing in a hashset of course). There is no case where Sun's solution is seen outperforming the one I have provided (obvious, based on the analysis above).

amin

Peter Booth

Posts: 62
Nickname: alohashirt
Registered: Aug, 2004

Re: Optimizing ArrayList.removeAll() Posted: Mar 8, 2007 11:24 AM
Reply to this message Reply
This is an interesting discussion. My two cents:

I saw an example of the poor performance of ArrayList.removeAll() when tuning a sluggish risk management application where calling removeAll() to reset a cache was creating noticeable latencies in what could have been a snappy application. I looked at implementation and asked myself, "WTF would I want to call this method anyway?" and replaced

cachedProducts.removeAll()

with

cachedProducts = new ArrayList()

One line of code and this application use case was much faster.

I thought no more of this issue until I saw the posting. Maybe I am missing something but I can't imagine why it is ever necessary to use this method? I think that the standard Java Class Libraries on the whole are excellent (Date would be one obvious exception) but it's also naive to expect to build software that depends purely on general purpose components. A good example would be StringTokenizer which creates objects out the wazoo. This is necessary because the class attempts to solve 99% of user's problems. A version of the class that is limited to single character field separaters can be implemented without any extra object creation, solving 80% of users problems at much lower runtime cost. Now creating or using a custom implementation like this is only warranted when the general purpose StringTokenizer is an application's hotspot. Tuning code that has no performance impact is a waste of time.

This doesn't mean that StringTokenizer sucks. It just means that general purpose means general purpose and not every purpose.

Surender Kumar

Posts: 10
Nickname: darthvader
Registered: Jun, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 8, 2007 11:48 AM
Reply to this message Reply
> > In that case, I agree your solution scores over Sun's
> as
> > Sun still has inner code of O(n) 'cos of remove().
> Yours
> > become O(n) while Sun's remains O(n^2). Agreed.
>
> Okay, cool. That was the main point of the article.
>
> >
> > What we were first arguing over was that your solution
> is
> > worst case O(n) while Sun's O(n^2).
>
> Yes, I see the source of confusion now. I was talking
> worst case of removeAll in isolation.

yes.

> >
> > So now let's talk the real case where contains() has a
> > base complexity of O(n) as James also observed.
>
> Okay, though I don't see why you consider the O(n) case
> the "real" case.
>
> >
> > *Assume* that argument Collection in removeAll() is the
> > same 'ArrayList' as the actual one. This is the worst
> case
> > with both having exactly the same elements.
> >
> > In this case 'contains()' iterates over the entire list
> > thereby having O(n). So if your solution also invokes
> > contains() (as it does) in the loop you have O(n^2) as
> > well.
>
> Okay, agreed, but lets make that more thorough:
>
> n: elements in the collection, C, on which you are doing
> removeAll
> s: elements in the list, L, you pass to removeAll
> r: the elements in s that are in n (that is the number of
> elements that will be removed).
>
> So for suns solution we have, roughly:
> r*n: the cost of doing a remove r times
> + n*s: the cost of doing an L.contains() call on each
> element in C, assuming O(n) performance for contains.
>
> for my solution, we have, roughly:
> n : the cost to remove r items
> + n*s: the cost of doing an L.contains() call on each
> element in C, assuming O(n) performance of contains.
>
> There you have the difference between the two. Now, if:
>
> a) r=s=n, we get n^2 performance for both, though mine
> ne would still perform better (removal is constant vs.
> n^2).
> b) s=0, and r is much, much larger than n (r=1000000,
> 0, n=1), then the performance of the list dominates, and
> both algorithms perform poorly.
>
> Side note: I think it was said on this list before, that
> passing a collection with poor contains performance is a
> *bad thing*, unless the list is really small.
>

Missed it completely if it was discussed that strongly.

> >
> > In terms of time complexity in big-Oh notation,
> there's
> > no improvement.
> >
> > But as i stated previously, your solution has got rid
> of
> > remove() only and not contains(). Sun's time is O(n *
> > (2n)) i.e O(n^2). That 2n is contains() + remove().
> >
> > Worst case time complexity in Big-Oh is same for both
> if
> > contains() is O(n). Your solution has subtracted 'n'
> > cycles from Sun's but not divided it by 'n' cycles.
> >
> >
> > contains() = O(1) : Amin's solution's wonderful.
> > contains() = O(n) : No significance difference.
> >
> > As i get your solution doesn't change contains().
> >
> > I hope this time I'm clear to you.
>
> Okay, I think we are roughly on the same page. By the way,
> the concrete timings are available here:
> http://www.ahmadsoft.org/articles/removeall/index.html.
> You can see that the solutions I provided outperform Sun's
> by a factor of 3 when removing 150 items from a list of
> size 2500 (passing in a hashset of course). There is no
> case where Sun's solution is seen outperforming the one I
> have provided (obvious, based on the analysis above).
>
> amin

Agree with you.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 8, 2007 12:16 PM
Reply to this message Reply
> my mistake..was under the impression that if input list
> has 2 references references and actual list 3 references,
> removeAll() in that case should remove only 2. The
> implementation doesn't do that either, so it's fine.
>
> Can anyone please explain then what's the issue with
> removing to the hashset?

Not all collections follow the contract of contains as stated in Collections. Consider the example that was given of the IdentityHashMap.keySet(). In an IdentityHashMap, an Object is only considered to be in the map if the map references that exact Object. This is not the way HashSet works. For example:
IdentityHashMap idMap = new IdentityHashMap();
HashSet hashSet = new HashSet();
TreeSet treeSet = new TreeSet(String.CASE_INSENSITIVE_ORDER);
 
String a = "test";
String b = new String("test");
String c = "TEST";
 
idMap.put(a, null);
hashSet.add(a);
treeSet.add(a);
 
System.out.println(idMap.keySet().contains(b));
System.out.println(hashSet.contains(b));
System.out.println(treeSet.contains(c));
System.out.println(hashSet.contains(c));


You could salvage this idea but you need to certain things about the collection that is passed in and without other changes to the JDK, that means you can only do this with other ArrayLists or with LinkedLists (and not subclasses of these.)

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 8, 2007 12:18 PM
Reply to this message Reply
> This is an interesting discussion. My two cents:
>
> I saw an example of the poor performance of
> ArrayList.removeAll() when tuning a sluggish risk
> management application where calling removeAll() to reset
> a cache was creating noticeable latencies in what could
> have been a snappy application. I looked at implementation
> and asked myself, "WTF would I want to call this method
> anyway?" and replaced
>
> cachedProducts.removeAll()
>
> with
>
> cachedProducts = new ArrayList()
>
> One line of code and this application use case was much
> faster.
>
> I thought no more of this issue until I saw the posting.
> Maybe I am missing something but I can't imagine why it is
> ever necessary to use this method?

I think you are confusing removeAll with clear. removalAll requires a parameter which is a collection of elements to remove.

Surender Kumar

Posts: 10
Nickname: darthvader
Registered: Jun, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 9, 2007 12:41 AM
Reply to this message Reply
> > my mistake..was under the impression that if input list
> > has 2 references references and actual list 3
> references,
> > removeAll() in that case should remove only 2. The
> > implementation doesn't do that either, so it's fine.
> >
> > Can anyone please explain then what's the issue with
> > removing to the hashset?
>
> Not all collections follow the contract of contains as
> stated in Collections. Consider the example that was
> given of the IdentityHashMap.keySet(). In an
> IdentityHashMap, an Object is only considered to be in the
> map if the map references that exact Object. This is not
> the way HashSet works. For example:
>
> IdentityHashMap idMap = new IdentityHashMap();
> HashSet hashSet = new HashSet();
> TreeSet treeSet = new
> TreeSet(String.CASE_INSENSITIVE_ORDER);
> 
> String a = "test";
> String b = new String("test");
> String c = "TEST";
> 
> idMap.put(a, null);
> hashSet.add(a);
> treeSet.add(a);
> 
> System.out.println(idMap.keySet().contains(b));
> System.out.println(hashSet.contains(b));
> System.out.println(treeSet.contains(c));
> System.out.println(hashSet.contains(c));
> 

>
> You could salvage this idea but you need to certain things
> about the collection that is passed in and without other
> changes to the JDK, that means you can only do this with
> other ArrayLists or with LinkedLists (and not subclasses
> of these.)

Actually I take this as a shortcoming of Sun's removeAll where object equality should be the criterion to remove instead of contains(). I understand that results would be totally different in that situation when it should not be. A case for another Bug in the parade?

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: Optimizing ArrayList.removeAll() Posted: Mar 9, 2007 5:42 AM
Reply to this message Reply
> Actually I take this as a shortcoming of Sun's removeAll
> where object equality should be the criterion to remove
> instead of contains(). I understand that results would be
> totally different in that situation when it should not be.
> A case for another Bug in the parade?

Even if you were able to convince Sun that this is a bug (good luck with that one) it's not going to matter because changing the behavior now could (and probably would) break lots of code.

It's thing like this and the dusty corners of Java generics that makes me think that Java needs to be petrified to some degree and get a fresh start with a new language.

It seems though that most of the really bad warts of any language (or standard library) come about in the first few versions. It would be nice to have a new language created with the lessons of the past in mind.

damien morton

Posts: 15
Nickname: dmost
Registered: Jan, 2004

Re: Optimizing ArrayList.removeAll() Posted: Mar 9, 2007 9:39 PM
Reply to this message Reply
ArrayList.removeAll is performing an operation usually of interest when dealing with sets rather than arrays.

Since theres no way to expose this operation on arrays without incurring an excessive overhead, should the operation be exposed at all?

Should it not be implemented on ArraySet rather than ArrayList, and then only when performing this set operation on collections that provide a faster-than-O(n) lookup operation.

Flat View: This topic has 41 replies on 3 pages [ « | 1  2  3 ]
Topic: Working Effectively With Characterization Tests Previous Topic   Next Topic Topic: The Expressive Power of Languages

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use