15 June 2016

Java Lists

I was again called into code review mode in encountering uses of java.util.Vector throughout a piece of code in method arguments, return values, and fields, and took it upon myself to instruct the junior developer on the error of his ways. These kinds of conversations are sometimes a little too one-sided, such as I say "don't do that" or "do it this way" and they say "okay" and don't do that anymore. This time was one of those other cases where they have an opinion and disagree want to learn more from my sagacity and we have a short discussion about the pros and cons of Vector in particular and more generally how to refer to and use types in OO languages.

As a result of the discussion, some particular points arose for me to capture in my infrequently used blog that I hope others also find of use.

There are many articles and blog posts that can be found comparing Java "List vs array vs ArrayList vs LinkedList vs Vector vs ...". Some are more recent and some are older, and some have benchmarks on particular operations. The results of the benchmarks have changed over time as improvements have been made in various areas of the Java Runtime, such as in garbage collection, JIT and HotSpot compilation, and reduced overhead of unnecessary synchronization, which are more relevant to this discussion than some other improvements. One of the biggest relevant changes was the unification of many similar "list" types under the Java Collections Framework, which allowed various list types to be used interchangeably in an improved OO way.

My intent is not to again benchmark which list implementation is the best, but to outline how to choose and how to structure your code so you can both make a choice and change that choice as it is appropriate to do so. Therefore, when trying to decide "which List type to use?", here are some guidelines.

Use an abstraction

My first tip: "always use an abstraction". One rule of good OOP is to use the highest level abstraction you can get away with. In this case, that means to generally use java.util.List type reference everywhere except where you must not, which is when instantiating. This then allows you to change the implementation without an effect to most of your code. Why would you want to change the implementation? See below.

List instead of Array

The second tip is: "prefer List over array". The fuller version is: "use an array only if you really really need it for performance and have run profiling to determine this is really really what you need". Otherwise, use a List, i.e.: "always" prefer List over array for safety, improved API, and interoperability.

Arrays are certainly useful, just like any other of the lowest-level building blocks of data structures. But an array is not a substitute for an abstract data type. An array should be wrapped in the semantics of a class, encapsulated, and had the invariants of that semantic protected. Transforming the array to do anything but the simplest of operations can be far more pain and overhead in future maintenance than the savings in performance of a raw array over a List in your APIs. You may end up with an entire library of static methods to manage arrays rather than simply combine state and behavior into a class and enable leveraging the good parts of OO.

Certainly there are times when use of arrays is a must. One example is in large-scale processing of low-level data, such as image manipulation. You don't want to deal with a List<Byte>, and the overhead of wrapper types would be unreasonable. However, that doesn't mean you should resort to only passing arrays and primitive types through your APIs to manage that image data.

Start with ArrayList

The tip is: "use java.util.ArrayList as your 'go to' list implementation". If it turns out you need a different one, it should be a simple case of changing the code that instantiates. ArrayList gives good performance for typical use, such as adding a bunch of elements, iterating, not much overhead.


Vector is Synchronized

Since the conversation started with Vector, the tip is that "Vector is rarely if ever faster than ArrayList". Vector is like ArrayList, but adds synchronization. This means it checks for mutual exclusion locks on every method call into your list (depending on the JVM optimizations, of course).

In versions of Java somewhere before 8, this was a significant performance hit, but more recently is much less so. The performance hit is more than zero, but if you use List throughout your code it doesn't matter. In particular if you only use the instance in the context of a single thread, there is no reason to pay for something not being used.

You can also achieve a similar thread-safe result by wrapping the ArrayList (or any List) with Collections.synchronizedList, giving not only another option in instantiation but in profiling.


LinkedList is essentially useless

The next tip is "java.util.LinkedList is rarely the best option for the situation". LinkedList is like ArrayList, but stores in a linked data structure instead of contiguous. In computer science theory, an add or remove in the middle of the linked structure is technically fewer operations at O(1) since the subsequent list elements don't have to be shifted and a get operation is slower at O(n) because it is not random access.

In practice, array manipulation at lower levels is crazy optimized and benchmarks have consisitently shown it is actually faster to shift contiguous memory than to perform the linked insert. Of course, indexes and other management overhead and lazy operations may be added to improve performance to the internal implementation of the LinkedList, but would only increase its memory footprint. In my experience then, there is little benefit in using java.util.LinkedList.


The correct answer is always "it depends"

The final and most important tip: "it depends". For anyone with a graduate degree, you have likely uttered and heard this innumerable times. Many times a problem scope (such as a benchmark) is too narrow to cover a practical case and must include other information to characterize the situation enough to make the best, informed decision. One of the best ways to get that situational awareness is to have benchmarks available but then profile your case and see where it aligns with the benchmarks.

Benchmarking "add" can be misleading between Vector and ArrayList because they grow at different rates. If all you ever do is "add", then likely Vector is better, but why are you only ever calling add? You either have profiled your code or it is not significant enough to profile and validate your expertise with science. Benchmarking any operation in isolation (concurrent access, random element access, head/tail insert, head/tail remove, inner element removal, iteration) can be misleading, which is why it is useful to be able to change the implementation when your code changes or you discover where performance gains will make the most impact.

I reminded the developer during the conversation that raw running time is not always the ultimate end. Many times perceived performance outweighs actual performance. Also, when dealing with user interfaces, as is common in our shop, seconds or hundreds of milliseconds are more our concern and then we move on to the next task in the never-shrinking queue.

So to best be resilient to change, you must first know the task you are solving and what choices will likely make it work best, then design to the proper level of abstraction for your case, then profile and change where it will make a difference.