Friday, December 9, 2011

The java collections framework for newbies

I don't consider myself a java expert by any measure, but there's a disturbing thing I've noticed. There are a LOT of people who claim to be "java developers", but they have zero clue what the "java collections framework" is. This post is designed for folks who keep getting stumped on interview questions or are mystified when someone starts talking about the difference between a Set and a List (for example).
If you google "java collections framework for dummies" you'll find this link which has a more complete, if fairly dense explanation. I'm going to do you one better and give a rule of thumb that you can use without thinking about it.
At it the root of things, a collection is something you can store other things inside. Just like in real life, a collection of marbles is just a "bunch" of marbles. The big difference in the collections framework is that the different implementations have different things they DO with the marbles that you need to understand.
For example, let's consider the ArrayList... everybody and their brother should know this... if not, you are not a java developer, go read a book. Some special things about an array list: It stores the entries in order of when they are added, you can select an element by it's index, it can contain duplicates of the same element. From a performance perspective, it is VERY fast to lookup and add things by index and add things to an ArrayList, on average, it is slow to see if a particular object is present because you must iterate the elements of the list to see if it's there.
Next, let's talk about HashSet... I realize that this might sound vaguely drug related to the uninitiated, but a hashset has some interestingly different characteristics from a list. First off, a HashSet has no concept of order or index, you can add things to it, you can iterate over it, but you cannot look things up by index nor are there any guarantees of what order things will be presented to you when it loop over members. Another interesting characteristic is that it cannot contain duplicates, if you try to add the same object twice, it will NOT fail, it will just return false and you can happily move on.
Last but not least, there is the Hashtable (or his slightly more dangerous cousin, the HashMap). This is used to store key/value pairs. Instead of keying things by an index (like an arraylist), you can key them by just about anything you want. You can do things like myMap.put("foo","bar") and then myMap.get("foo") will return bar...
There is a LOT more to this, but with this quick reference you can at least begin to do useful things in java.
Examples of using a List
ArrayList myList = new ArrayList();
myList.add("Second Thing");
myList.add("Second Thing");
myList.add("First Thing");

System.out.println(myList.get(0));
will output
Second Thing
An interesting thing to note is that the size of this is 3
System.out.println(myList.size());
will output
3
The following:
for (String thing: myList) {
  System.out.println(thing);
}
will always output:
Second Thing
Second Thing
First Thing
Next lets look at a set:
HashSet mySet = new HashSet();
mySet.add("Second Thing");
mySet.add("Second Thing");
mySet.add("First Thing");
The first difference we can see is that
System.out.println(mySet.size());
returns
2
Which makes complete sense if you understand that sets cannot contain duplicates (and you understand how the equals method of String works...;) Another interesting thing is that: The following:
for (String thing: myList) {
  System.out.println(thing);
}
might output:
Second Thing
First Thing
or it might output:
First Thing
Second Thing
It so happens that it returns the second version on my machine but it's really JVM/runtime specific (it depends on how the HashSet is implemented and how hashcode is implemented and a bunch of other variables I don't even fully understand).
More importantly, the following will be likely be much faster for LARGE collections:
System.out.println(mySet.contains("Third Thing"));
Finally, the grandDaddy of all the entire framework, hashtable.
 Hashtable myMap = new Hashtable();
 myMap.put("a", "Second Thing");
 myMap.put("b", "Second Thing");
 myMap.put("c", "First Thing");

 System.out.println(myMap.get("a"));

Will output
Second Thing
and the following:
for (Map.Entry entry: myMap.entrySet()) {
    System.out.println(entry.getKey() + "=" + entry.getValue());
}
will output
b=Second Thing
a=Second Thing
c=First Thing
Hopefully with these examples, you can get an idea of the capabilities of the collections framework. There is much much more to it and I encourage ANYONE doing java development to spend time playing around and learning the different characteristics of the various components as I've only lightly skimmed the surface.

7 comments:

Emily S said...

I'd like to point out that you should favour HashMap over Hashtable, should you need to do multi-threading then a ConcurrentMap (ConcurrentHashMap) would be preferable. Hashtable is a legacy class which has been retrofitted to also be a map.

Michael Woloszynowicz said...

Good post Mike, it's true that so many people fail this in an interview.

I agree with Emily, concurrency is a huge issue with HashMap, particularly when the map is re-sizing.

Mike Mainguy said...

Interesting, about concurrentHashMap, I believe this must have come in with the concurrency stuff. Now that I think about it, I'm not sure it's exactly a 1 for 1 replacement though. I debated on the HashMap-Hashtable choice, on one hand, HashMap is faster, but on the other hand, I know too many people who the first thing they use a map for is a public static final cache of some sort which means HashMap ends up introducing subtle problems under load.

Anonymous said...

"From a performance perspective, it is VERY fast to lookup and add things by index and add things to an ArrayList, on average, it is slow to see if a particular object is present because you must iterate the elements of the list to see if it's there."

What does that sentence mean? Lookup is fast but seeing if particular element exists is slow? Seems like you are mixing up ArrayList with LinkedList. You should take your own advice and study on the collections a bit more...

Stackoverflow has a good summary on the differences: http://stackoverflow.com/questions/322715/when-to-use-linkedlist-over-arraylist

Craig Doremus said...

Newbie Beware! This guy doesn't know what he's talking about. Start here: http://docs.oracle.com/javase/tutorial/collections/ instead.

Mike Mainguy said...

Ha! you got me, I could certainly spend more time studying this stuff, of course when I'm interviewing someone who has been writing java for a year (and/or studied in school) and they cannot explain any difference between a List and Set, you could argue I'm more informed than them. Many java "Experts" forget that many people are just starting out and, from a practical perspective, the subtle nuances than can be learned from studying the bytecode/blog posts/documentation yield diminishing returns for the "average" beginning developer. I applaud folks who willingly submit to studying the inner workings of these things for hours (days/years), but I am not one of them and I would contend that 99% of the folks who write java professionally also are not.

I do appreciate folks taking the time to clarify and comment, every little bit helps...

Janeve George said...

A wonderful article for newbies! You might find the article on "Which Java collection to use?" interesting.

It doesn't go into depth of using each Collection class. But it includes a PDF file that compares the popular classes in the Collection Framework..