allocation-styles

Allocation Styles

Stevey's Drunken Blog Rants™

Over the weekend, I was pondering different ways I've seen people solve programming problems, and a new distinction occurred to me — a way to classify certain styles of programming that I hadn't read about or thought of before. The distinction is in how different types of programmers approach memory allocation.

The Classic Categorization

One popular way to classify programming languages is to divide them into Imperative, Functional, and Declarative styles. (It's interesting that people associate a language with a style, since language and style are to some extent independent.) These three styles are a bit fuzzy and overlap somewhat, but are still essentially very different approaches to programming.

Imperative basically means statement-oriented: do this, do that. In the general (non-programming) sense, an imperative is a command, and the imperative style is seen as telling the computer what to do. People often further subdivide the Imperative style into Procedural and Object-Oriented categories. They're both still imperative styles; the procedural/OO distinction is more of an organizational one — how you organize your code and data.

The Functional style is basically expression-oriented: compute this, compute that. It tries to avoid side-effects (meaning, changing the values of variables), since side-effect-free functions are easier to reuse and compose. There are other differences between functional and imperative styles, and one of them is neatly captured by the distinction I thought of this weekend, which for the moment I'm thinking of as allocation strategy. More on that in a bit.

Declarative programming is very high-level way of programming, where you say what you want, and leave the rest to the computer. You might call it query-oriented, although I think many people also consider UI-layout languages like Jakarta Jelly to be declarative, in the sense that you're declaring the structure of your object tree, and leaving it to the computer to handle the instantiations and setting up parent/child links, etc. In any case, query languages like SQL or XPath are generally considered declarative, and so are logic languages like Prolog and Jess.

Like I said, it's a fuzzy distinction. And you can personally use any almost style in any language — for instance, you can do imperative programming in Lisp (most Emacs-Lisp code is basically imperative, for instance), and you can do functional programming in Java or C++. However, most languages naturally tend to support some styles better than others, and cutting against the "grain" of the language can be a lot of additional effort.

In any case, I was thinking about memory-allocation strategies in day-to-day programming, and realized that I could think of at least five very distinct styles. Programmers aren't usually taught about these styles explicitly, so whichever one you pick up first is likely to be the one you use all the time, and doing it any other way feels unnatural.

However, different allocation styles are actually suited to very different problem domains, and if you're using the wrong one, you're cutting against the grain. You'll wind up struggling without necessarily even realizing it. I think that very good programmers understand these allocation styles intuitively, and can choose the best one for a given situation, rather than just pulling the old hammer out for every occasion.

Stevey's Made-Up Allocation Strategies

I haven't decided on the best order for presenting the allocation styles I've thought of. You could order them by popularity, innate performance, ease of use, etc. I'll order them by "how much you worry about memory allocation when you're writing the code", and see how that goes.

So far I've thought of six distinct styles. Here are the fancy names I came up with for them. Aren't they just grand? I'm still not quite happy with the last three, though, and I'm not entirely sure yet whether #5 and #6 are different.

    1. Allocation-free programming

    2. Buffer-oriented programming

    3. Collection-oriented programming

    4. List-oriented programming

    5. Query-oriented programming

    6. Pattern-match programming

Generally speaking, I tend to think of these styles as being most idiomatic for particular programming languages:

    1. Allocation-free style is common in C, Fortran, and Pascal

    2. Buffer-oriented style is common in C, C++, and Java written by C/C++ programmers

    3. Collection-oriented style is common in Java

    4. List-oriented style is common in most higher-level languages, but is probably best exemplified in Lisp, Ruby, and Haskell

    5. Query-oriented style is common in Lisp, Python, Haskell, and SQL

    6. Pattern-oriented style is common Haskell, Erlang and ML; in mini-languages like XPath, regular expressions, and other path-expressions languages; and in many problem domains for which it doesn't quite have first-class language status.

I should point out that my use of "pattern" has little to do with "design patterns", which is more of a software-engineering discipline based on patterns of reuse in software.

I also want to make it clear that I think of these styles as crosscutting (to borrow a term from AOP) the traditional distinctions of oo, procedural, functional, declarative, etc. Hopefully I can make that clearer as I describe them.

Allocation-free programming

If you program in any non-garbage-collected language, dealing with memory allocation is somewhat painful. Although we usually program in C or C++, it's equally true of Pascal, Fortran, even assembly language. C++ offers some strategies to help you manage memory somewhat automatically (e.g. RAII), so I'm really talking more about classic C programming.

The basic idea with this style is that you always try to avoid allocating dynamic memory, if you can at all help it. Some common strategies include:

    1. representing your data as bit fields whenever possible, so you can pass around values using ints and longs

    2. doing "in-place" sorting, searching and filtering of buffers by swapping data values

    3. using static buffers allocated at program startup

    4. using stack-allocated variables and buffers

Using 32-bit and 64-bit numbers to represent data values is an especially idiomatic approach for this style. Several people (including me) used this style for their C or assembly-language solutions to the farmer/dog/chicken/grain challenge we had a few months ago. There were only 16 possible states, so it was trivial to represent them as bits. There was even a nice, logical way to do it using a single byte: the high-order bits are the left bank, and the low-order bits are the right bank, with the bits ordered as F, D, C, G on each bank. So 1111 0000 is the start state, and if the farmer carries the grain, it becomes 0110 1001. And so on.

Using this style results in doing stuff like this:

#define FARMER 0x08

#define DOG 0x04

#define CHICKEN 0x02

#define GRAIN 0x01

#define FARMER_AND_CHICKEN FARMER | CHICKEN

#define FARMER_AND_GRAIN FARMER | GRAIN

#define FARMER_AND_DOG FARMER | DOG

#define START_STATE 0x0F

#define GOAL_STATE 0x00

#define CARRY_NOTHING 0x08

#define CARRY_DOG 0x04

#define CARRY_CHICKEN 0x02

#define CARRY_GRAIN 0x01

...and this:

switch (m)

{

case CARRY_NOTHING:

/* e.g. 0000 0111 */ return farmer_on_west(s) ? s & 0x07 : s | FARMER;

case CARRY_DOG:

/* e.g. 0000 0011 */ return farmer_on_west(s) ? s & 0x03 : s | FARMER_AND_DOG;

case CARRY_CHICKEN:

/* e.g. 0000 0101 */ return farmer_on_west(s) ? s & 0x05 : s | FARMER_AND_CHICKEN;

case CARRY_GRAIN:

/* e.g. 0000 0110 */ return farmer_on_west(s) ? s & 0x06 : s | FARMER_AND_GRAIN;

default:

printf("fatal error in next_state, state: %d, move: %d\n", s, m);

exit(1);

}

You wind up with lots of comments describing the mapping of bits to data values (or states, or relationships, or or anything else you might want to represent with bit patterns).

Because it's often the case that you only need a subset of the full range of possible values, your code is often littered with assertions that do bounds checks or limit checks or whatever, to make sure you're passing around a valid value. (One example: it may be the case that your representation never allows for even values, so you might throw in preprocessor assertions about the low-order bit.)

Is this approach fast? Yup. Once it's running and debugged, it can be lightning-fast. It can also be incredibly slow, e.g. if you've chosen an exponential or factorial-time algorithm. Whether you use bits, chars, strings, or objects for the actual storage in your data representation, it's still just a constant-overhead, and as you scale up, your choice of algorithm begins to overwhelm whatever storage choices you made. The storage choice might cut the run-time (and/or run-space) in half, but the algorithm can affect time and space by arbitrary orders of magnitude.

Of course, everyone would ideally like to have the best algorithm and the optimal storage representation, because then their code will be running "as fast as it can". And that's the most important criterion, right? After all, hardware is extremely expensive; people are dirt cheap. (Note: irony alert) So we should always choose the solution that would perform optimally when scaled up, even if it looks like we'll never need to scale it. Because you never know if you might need to scale it someday.

But what if the storage choice doesn't make any difference at all to speed? In the farmer/dog/chicken/grain problem, which is basically 100% I/O bound unless you're doing something really crazy, it would make no human-detectable (or even profiler-detectable) difference in memory usage or runtime speed to use (say) arrays of chars, or even strings, or even first-class objects instead of bit fields. Why go to such lengths to avoid allocation in this case? Is the bit-field approach easier to read? Is it less error-prone? In what way is it actually "better", such that I and just about everyone else who used C or assembly chose that approach?

One answer is that it's actually the easiest approach in C. Yes, you have to define a bunch of constants, and you have to use masks and bitwise operators, and it's potentially error-prone, and your data "type" is totally opaque to both your compiler and your debugger, and it's not a generic solution for search-space problems, and there are umpteen other objections to the approach. It's still easier than mucking with dynamic memory allocation, because memory allocation has even more choices and things that can go wrong:

    • In most cases (not this one, but most of the time) you have to worry about whether it's the responsibility of the caller or the callee to free the memory. This isn't always an easy decision, and it's one that you can't abstract away; it becomes a part of your APIs, hence your architecture, and it's hard to change later.

    • You still have to choose some sort of non-ideal representation — perhaps strings or char arrays or structs — that may suffer from many of the same problems as the bit-field approach (e.g. having to check for invalid values), and now you've added memory management as an extra hassle.

    • Even if it's easy to allocate and free the memory, you still have to deal with pointers, and every single pointer access is something you could potentially mess up.

C doesn't really give you very good ways of "hiding" the memory allocation strategy, so using either the machine types (i.e. ints of various sizes) or static buffers is usually the easiest approach, if the problem lets you get away with it.

This is why C is often called a "glorified assembly language" — although it seems very expressive compared to assembly, you'll find that there's a close correlation between what C and assembly make it easy to do. It's almost as easy to work with pointers, machine types, and static buffers in assembly as in C. Straight C is really only a constant-factor improvement in ease of use over assembly.

Ease of use aside, there's another reason that C programmers often use allocation-free programming: it's macho. C programmers know there are lots of lamers out there who learned to program in higher-level languages, and they have no idea (or no recollection) of how computers work, or how to map their high-level abstractions onto bits and bytes. So C programmers revel in that constant-factor improvement they're getting (assuming they're using the same algorithm as their peers using a higher-level language) by using low-level data representations.

Sometimes C programmers call solutions like this "elegant". And in a certain sense, they are elegant, at least in C: they've managed to optimize for both runtime performance (in space and time) and code size.

I'm not sure if it's reasonable to call this style "elegant" in general. It's the kind of bit-twiddly programming that makes a lot of sense for doing, say, device-driver work, or maybe doing the context-switching code in a process scheduler, or the memory allocator in a real-time or embedded system.

For anything higher-level than that, I think most programmers here at Amazon would agree that it's probably optimizing for the wrong things, and it's likely to cause you trouble down the road, as your requirements change. For instance, using static/global buffers doesn't work well at all when you try to make your system multi-threaded. And using bit-fields doesn't work well for shared APIs, where callers need to know the exact bit pattern syntax to debug values going in and out of the APIs. There are really all sorts of ways it can get you in trouble.

Still, this is a surprisingly common allocation strategy in production code, because it's often easiest, and its machismo is usually defensible in our industry, which is still generally enthralled with micro-optimizations.

Buffer-oriented programming

This is the next step up, where you are doing dynamic memory allocation, and it's used nearly as often in C++ as it does in straight C. The style is characterized by allocating buffers to hold data values — arrays, basically, since C and C++ operators blur the distinction somewhat between arrays and memory pointers.

Like allocation-free programming, you can usually tell this style by just glancing at the code. Telltale signs include:

    • using lots of pointers (and double-pointers)

    • passing in length (and/or index) parameters along with every buffer in a function or method signature

    • heavy use of library functions like memcpy, strcpy, and others that operate on buffers or subsections of buffers

    • APIs being documented as "in-place" or "destructive", meaning the function has side-effects, cautioning you not to expect your original data to remain unchanged after the call

This is, for better or worse, still probably the most common allocation strategy worldwide. C++ offers some alternatives, and if you use STL heavily you can almost entirely avoid buffer-oriented programming. But I think that's still considered to be "advanced" C++ programming by your average programmer (not just considering Amazon, but industry-wide).

Buffer-oriented programming is the universal lingua franca: it still maps pretty closely to the capabilities of the underlying hardware, it's portable across just about every imaginable OS, and it's completely domain-independent: everyone programming in C or assembly uses this style, and it's still very common in C++. Since everyone "has to know it", it's usually a currency accepted in job interviews, and in some cases, it's the only currency interviewers will accept, since it really is more universally applicable than the styles used in (say) STL.

Buffer-oriented programming doesn't mix particularly well with recursion, for two very different reasons:

    1. It requires passing in and keeping track of extra indices, sometimes quite a few of them, that delimit various positions in the buffer(s) you're operating on. This makes the function signatures (not to mention the functions themselves) messier, and thus more complex to write.

    2. Recursion is generally slower than iteration (assuming the iteration isn't crossing function-call boundaries), and it's generally more prone to overflowing the stack than iteration is to prone overflowing counters or buffers.

As a result, programmers who primarily employ the buffer-oriented style rarely use recursion, not even to model problems or processes that are most naturally expressed recursively (e.g. finding subsets or permutations, tree and graph traversals, most kinds of parsing, and many others.) As a result, they often have trouble thinking recursively when the occasion demands it.

To be fair, people who typically use allocation strategies that are more recursion-friendly often have trouble thinking in terms of buffers and pointers when the occasion demands. I think most programmers (including me) tend to favor one style over the others, especially when working in the language they favor. And the language will generally determine the allocation style they use.

Buffer-oriented programmers often think of the world in sequential terms. If you ask a BOP to name five data structures, often as not you'll hear "uh, vector, array, queue, stack, and linked list". And they'll probably have trouble coming up with any practical, real-world example of where you'd actually use a linked list.

This is because buffers are just arrays, so they naturally lend themselves best to implementing (duh) arrays, queues, stacks, fixed-sized circular lists, and vectors (if you want to put in the effort, although BOPs rarely see the need).

In fact, you can use buffers to implement all sorts of data structures, including binary trees and priority queues. Many important algorithms (including n*log(n) sorting algorithms, all kinds of radix/bucket algorithms, and many others) have been developed for use on fixed-size buffers.

Heck, you can even do hashtables with a single buffer — I've just answered my own, long-standing question as to why on earth they still teach (and use) linear and quadratic probing as collision-resolution schemes when they're inferior in many ways to using simple linked lists. (Among other things, they make deletions complicated.) It's because buffer-oriented programming is a way of life, and linear/quadratic probing let you continue that lifestyle if you're using hashtables.

It's fair to say you can do 100% of all programs (at least the ones worth writing) using buffer-oriented programming. Which is why I talk to so many interview candidates (and co-workers, for that matter) who maintain steadfastly that C is all you need, and that higher-level languages are for wimps who don't know how computers work and want to write in Basic. The hard-core ones, who will remain nameless but their initials are Willie Lewis, maintain that assembly is all you really need.

And in a sense, they're right. For five years I worked at a company whose entire code base was in 80x86 assembler — an entire OS, libraries, applications, and even many of the tools.

I think you now understand buffer-oriented programming, as I've described it, well enough for me to move on. I'll defer passing judgment on any of these allocation strategies until I've had a chance to discuss them all.

Collection-oriented programming

I've not personally done any work with STL. My professional and academic experience with C++ either pre-dates STL, or it was then new and problematic enough that people didn't use it much if they were trying to write portable software (which I was.) So I don't know offhand which of my six allocation strategies it's most similar to, or if it's different enough to merit its own category, perhaps in conjunction with RAII ("resource acquisition is initialization"), a very common C++ idiom that uses stack-allocated objects that are cleaned up when the current scope exits.

Perhaps someone can comment on this? I've tried to study STL, and it looks somewhat interesting (in that the creator of STL hated OOP, loved functional programming, and basically tried to simulate FP — to some extent — in STL). But every time I try to read about it, I'm off-put by the fact that 25% of every page of documentation dedicated to STL is a series of legal disclaimers saying that the lack of error-checking and counterintuitive rules are virtually guaranteed to hurt you in ways you won't soon forget. Something about that just demotivates me; not sure why.

In any case, I'm quite familiar with the most idiomatic allocation strategies in production Java code. In Java, it's about a 50/50 split between buffer-oriented and what I'm calling "collection-oriented" programming.

Java makes buffer-oriented programming quite a bit easier than it is in C, for various reasons:

    1. there are no explicit pointers or dereferencing, so you work only with indices (usually ints.)

    2. machine primitive types have predefined sizes (e.g. ints are always signed 32-bit values), and arrays remember their own types, so you never have to worry that you may be indexing into an invalid offset in the buffer.

    3. garbage collection means that unless you're holding a reference to a buffer (e.g. in a long-lived variable or some sort of data structure), you can forget about the buffer when you're done with it

    4. arrays and strings carry their own sizes along with them, so you don't have to employ tricks to try to avoid scanning to the end to get the buffer length (which eliminates a ton of common C idioms from Java code).

    5. automatic bounds-checking on array accesses means you never get core-dumps. This has lots of advantages. Core dumps are more painful to debug, and they cause you to lose all the state you may have built up in your program, so recovering from a core dump is a pretty big interruption. Java exceptions generally don't stop program execution (unless you don't bother to catch them), and the stack trace usually points you to the exact line number where the error occurred.

So buffer-oriented programming in Java is nowhere near as onerous as it is in C or even C++. And since many Java programmers are ex-C/C++ programmers, it's a very common way to write Java code.

I'm not sure if I'd go so far as to call it "idiomatic" in Java. I suppose it depends on what you're using it for. Most of the time, you're probably better off using collection-oriented programming in Java, because explicit buffers are still prone to many of the same problems they had in C:

    • you may still be tempted to use bit-fields or other primitives for representing rich(er) data types, and be stuck having to check whether the values are actually valid representations.

    • you still have to pass extra parameters around for keeping track of indices, which complicates both the signatures and the method implementations.

    • you're stuck hand-managing reallocations (if the buffer wasn't big enough) and shifting data (if you delete something from the middle).

    • buffers aren't "active" objects in any sense; you can't create subclasses of arrays in Java, so they can't know about their own operations. So you're still stuck with the fundamental problem that your data representation isn't well encapsulated.

I've done a fair amount of this kind of programming in Java, but mostly for network protocols, implementing your own general-purpose data structures, and other situations where it really does make sense to try to pack the data tightly and/or exercise maximal control over memory allocation. In those cases, you could say it's idiomatic style. But they're few and far between.

Collection-oriented allocation means you use Java's built-in collection classes. This is an intrinsically higher-level type of programming than buffer-oriented style. You can usually express yourself more succinctly, and you're more likely to choose a more appropriate data structure — e.g. a linked list, for situations where you know you'll often be deleting or inserting in the middle as you're traversing the list.

It's generally easier to use collections in Java than it is in C or C++, for various reasons that include:

    • Java has a common base class for the class heirarchy, so every object can be placed in a collection (including hashtables).

    • garbage collection makes it a lot easier to do deletions and replacements on collections. (However, it's also much easier to create caches in Java, so your code is highly prone to "leaks" via references held in long-lived collections.)

    • the collections package is reasonably well-designed, with a polymorphic interface heirarchy that makes it easy to swap out implementations after you better understand your usage patterns.

Generally speaking, you also know exactly how much allocation is going on, and if you're reasonably thoughtful about it, you'll wind up doing the same amount as, or only a little more than, you'd have done in C or C++.

Java used to have a real problem with casting, because putting an object in a collection automatically lost its type information, and you had to do an unsafe downcast when pulling it back out again. That's generally no longer a problem (assuming you're using Java 5, which you really should be at this point) now that they've added generics.

Java's collection-oriented programming makes it easier to solve day-to-day problems than buffer-oriented programming. I was handed first-hand evidence of this when a friend of mine had me solve an interview question that he'd just solved himself. He'd heard another interviewer ask it, but didn't know offhand how to solve it, so he gave it a try. Then he came and watched me solve it. He'd taken about 20 minutes, and I took about 11 minutes. The difference? It was a simple parsing problem, and I'd immediately stuffed all the tokens into a LinkedList, which meant I could pull them off the front, dramatically simplifying the recursion. He'd done it using buffers and indices, which made it really messy.

Solving a problem 50% faster may not seem like much when it's 10 minutes vs. 20 minutes, but what if you could get your project done with half as many people? Or in half the amount of time?

My use of collection-oriented allocation was obviously more "wasteful" of memory than his approach, so it presumably doesn't scale as well. Of course, in this case (and in many day-to-day programming situations), you have no idea whether it will ever need to really scale to the point where your allocation strategy becomes the performance bottleneck.

If you're not sure whether it'll need to scale, buffer-oriented programming in Java is emphasizing "might need to save hardware someday" over "make it easy on the writer and readers of the code". I'm not passing judgment on that choice (at the moment), but it's undeniably the choice you're making.

Collection-oriented programming may seem refreshingly high-level to people who've spent years doing buffer-oriented programming. You can get used to it in a hurry. However, it's got its own set of "expressiveness" drawbacks.

For starters, Java doesn't have first-class functions. C and C++ do have them, via function pointers (and to some extent, templates). So it's pretty easy to do higher-order functions — meaning "apply this function to every element of the collection" — in C, and even more convenient in C++. It's essentially impossible in Java, unless you hardwire together a combination of an interface (e.g. "Comparator") and a function that accepts an instance of that interface (e.g. "findMin()" or "sort()"). Then of course you have to instantiate at least one instance of the interface, and if it needs to accumulate values as it goes, you must instantiate one for every application of the function. In C++ you can completely avoid dynamic allocation when using higher-order functions.

As a result, Java collections make heavy use of the Iterator pattern. Rather than defining a function to apply to the collection elements (which is tantamount to the Visitor pattern, except that in non-broken languages you don't have to declare sixteen classes each time you want to implement it), you ask the collection to give you its elements, one at a time, on demand.

This is in most respects a pretty awful way to do things, at least compared to using higher-order functions. For one thing, Iterators are inherently non-thread-safe, unless the collection happens to support robust iteration (a la ConcurrentHashMap), but few of them do. To make it thread-safe, you have to rely on an error-prone convention of obtaining some pre-specified lock for the entire time you're actually using the Iterator. It's not always convenient or even safe to use the monitor lock for the Collection itself, since that may already be used for semantically different purposes.

Iterators also make it harder to design interfaces over a network — an iterator will inherently result in more traffic than a visitor, because you have to pull an element over to your side before you can decide whether you want to use it. A visitor makes exactly one round-trip. (On the other hand, iterators are more appropriate for displaying a long list of results to a user who's likely to bail out early.)

Even for non-networked, single-threaded access, iterators still don't generally support concurrent modification by the same thread. Some of them support deleting the element the iterator is pointing at right now, but you generally can't delete a different element, or add new elements. You also generally can't tell how far along you are in the iteration, and I'm sure you can think of other issues with it. An Iterator is often referred to as a "poor-man's collection".

Aside from the lack of first-class functions (and the consequent heavy use of iterators) in Java, the other big expressiveness problem with Java collections is that they're designed for imperative-style programming. This manifests in two ways:

    1. There's almost never an option to make a non-destructive modification; i.e. to return a copy of the collection with your requested change made to the copy.

    2. Mutative operations usually return void, or some relatively useless value, so you can't chain operations together.

Most Java programmers would say: "so what?" But once you've played a bit with higher-level allocation strategies (starting with the next one, List-oriented allocation), you'll start to understand that Java's approach is still clunky.

It's actually not bad that Java does things this way. What's bad, in my opinion, is that Java doesn't provide you with good options for doing it any other way. The situation has improved very slightly with Java 5, but I'll leave the topic of functional programming in Java out of scope for now.

To summarize, collection-oriented programming has the following characteristics:

    1. It uses object-oriented Collection interfaces rather than primitive arrays (aka "buffers").

    2. It's primarily imperative in style, making sequential, destructive updates to data structures.

    3. Handling multi-threading is complicated and requires fairly elaborate synchronization schemes for doing iteration in a concurrent environment.

    4. It generally results in cleaner, smaller, clearer, more "maintainable" code than buffer-oriented programming. It also generally results in more memory allocation and more virtual-method invocations, so its innate performance, while tunable, is typically some constant-factor lower than the performance of buffer-oriented style.

Hope that makes sense. Let's move on to the next level of abstraction and see how it differs.

List-oriented programming

This is the default in Lisp, Haskell, and most functional languages. It also happens to be the default in Ruby and (to some extent) Perl; not so much in Python, oddly enough.

It's almost exactly like collection-oriented programming, with the following essential differences:

    1. By default, operations on collections do not have side-effects. If you make a change to a collection, you get a copy of the collection back, unless you specifically use the destructive version of the operation. Most languages (Haskell being the notable exception) provide destructive versions.

    2. The language always has first-class functions, so there's typically a standard suite of higher-order functions applicable to all collections: collect, enumerate, filter, findMin, findMax, dropWhile, takeWhile, and so on. The best-known and possibly best-designed set of higher-order functions can be found in the Haskell Prelude

    3. Operations on collections are (almost) always composable (Lisp screws it up somewhat with the stupid-arsed distinction between lists and returning multiple values) — this is the benefit of every operation returning a collection as a value, rather doing than what Java does and returning either void or perhaps a single value.

Why is any of this useful?

Well, composing functions can yield pretty compact, readable code. My favorite example is from Ruby — let's say you want to get a sorted list of all the methods on class String that actually change the contents of the string; in other words, the destructive operations. Each language has its own convention for how to name the destructive variants (except for Common Lisp, which has no such convention, just a bunch of crappy old unintuitive names). In Ruby, the convention is to end the name with "!". In Ruby, you just say:

irb(main):001:0> "".methods.sort.grep(/!/)

=> ["capitalize!", "chomp!", "chop!", "delete!", "downcase!",

"gsub!", "lstrip!", "next!", "reverse!", "rstrip!", "slice!",

"squeeze!", "strip!", "sub!", "succ!", "swapcase!", "tr!", "tr_s!",

"upcase!"]

Nice. If you're a Java programmer, an interesting exercise is to write this in Java. Since Strings are immutable in Java, assume you're looking for all String-class methods that contain the case-insensitive substring "index" (e.g. all the indexOf/lastIndexOf variants, six in all).

I'd be surprised if you could do it in fewer than 10 lines of code, and it would be ugly enough to make a grown man cry. In fact, I'm going to do it, just 'cuz I want to make this point crystal clear. Here's one way to do it:

List results = new ArrayList();

Method[] methods = String.class.getMethods();

for (int i = 0; i < methods.length; i++) {

Method m = methods[i];

if (m.getName().toLowerCase().indexOf("index") != -1) {

results.add(m.getName());

}

}

String[] names = (String[]) results.toArray();

Arrays.sort(names);

return names;

Most ways to do it in Java look pretty much like this.

The code above may not look so ugly in isolation, but next to the Ruby version, it's downright hideous. Part of this is because unlike higher-level languages, Java has no syntax for reflection; it's all done through a (very) clunky OO interface. This particular function doesn't look so bad, but when it comes time to start invoking methods reflectively, it gets even worse.

But there are a lot of little things going on here, all idiomatic Java style, that add up to create the relative ugliness of the code above:

    1. You have to explicitly allocate the result list. Languages with higher-order functions automatically allocate a collection to hold the results of your visitor function. They provide ways to do it yourself, if you need to, but the default is to do it for you.

    2. You have to use an iterator (in this case, a for-loop, because Java primitive arrays are non-orthogonal to Java Collections, and lack OO APIs.) to grub through the methods, looking for matches.

    3. Simple string comparisons are awkward in Java (and what's with the magic number -1, anyway? Jeez.), let alone regexp comparisons. Although to be fair, regexp matching got a little easier in Java 5, when they added some methods to String class to match regexps, bypassing the otherwise clunky OOP interface you normally use.

    4. Collections don't have a sort() method in Java! (Ack!) List does not, which "sort of" makes sense, since you wouldn't want to inadvertently try sorting a linked list or doubly-linked list (or worse, a circular list). So you have to manually convert the results to an array, which still (even in Java 5) requires an unsafe cast.

    5. Arrays.sort(), which is already a poor substitute for an actual sort function on actual arrays, doesn't return the sorted array. It returns void. So you have to return it on a separate line.

This might not seem like a big deal, but the Ruby code was an order of magnitude smaller than the equivalent Java code. When you scale it up, you wind up with 10 million lines of Java for every million lines of Ruby.

What about C++? In C++, you simply cannot do it, not in any portable way. Although it doesn't have anything to do with allocation styles, programmers using languages that have no reflection wind up writing in very different styles when they need access to this kind of information — usually by hardcoding it somehow, perhaps in a config file.

OK, the chaining/compositing of functions makes some amount of sense, I suppose — with the caveat that if you get a null pointer exception in the middle of the chain, it's going to be tricky to debug.

But why, oh why would you ever want to have a copy of a collection returned whenever you modify it? Isn't that, well, just piggishly wasteful? Shouldn't there be some sort of, I dunno, EPA citation filed against people who deliberately and knowingly waste memory like that?

And while I'm at it, doesn't that mean your code will be butt slow?

I mean, the Ruby code above created a copy of the list at each step in the chain, and it didn't really have to... why would that be the default?

This one's a bit harder to explain, but the basic idea is that side-effects are evil — they're one of the #1 causes of bugs in programs. This empirical observation has held out in study after study, project after project, year after year, in many disciplines and domains. Your code will simply be more robust, on average, if you always try to avoid side-effects.

Why am I not just calling this "functional programming style"? Well, even though this allocation strategy is used heavily in functional programming, the term "functional programming" has a lot of history (and hence, baggage). As one example, it's often contrasted with object-oriented style, but Ruby is about as object-oriented as a language can get, so that distinction doesn't hold here.

Perhaps "list-oriented programming" isn't the best term for it. I tend to think of it as pipelines: each stage is some modification to the output of the previous stage. It's like pipes in Unix, where the output of one command is the input to the next. It's convenient.

It also lends itself very nicely to parallelizing your code across N boxes. Calls from machine to machine are inherently side-effect-free: you don't return the actual data structure from a distributed call; you return a copy of it, one that was serialized over the network (e.g. via XML/RPC).

Will your (in-process) code be butt slow? Well, yeah. Sort of. Most of the time you'll never notice. Ruby code is generally an order of magnitude slower than C code. But it's also at least an order of magnitude shorter than C (or even Java) code. Lisp can approach (and probably exceed) Ruby's succinctness in a big enough project, after you've added domain-specific macros. Other functional languages are equally terse.

But isn't an order of magnitude performance decrease going to be completely unacceptable in a distributed system like Amazon's? Maybe. Sometimes, certainly. But is every single program you write going to be slamming your CPU and/or maxing out memory?

I think good programmers make intelligent choices about allocation styles. And I think the intelligent choice is to use the most expressive style you can get away with — let your profiler tell you if you need to step down a level. Most of your code is not going to be executed very often, and many programs in any environment are going to be I/O bound. So why make the code harder on yourself?

But it's your code, and your call.

Query-based and Pattern-matching styles

Unfortunately I'm out of time, and I want to publish this entry so I can get back to slogging through buffers and indices (guess what style I'm using in my current project?)

Basically the next level up encompasses a whole variety of interesting approaches in which you specify what you want at a very high level, and the compiler generates code to go fetch it for you.

One example is list comprehensions in Python and Haskell. They look a bit like SQL queries. You can say things like "[(x, y) for x, y in mylist if x > 2 and y < 3]", and it'll search for all the (x, y) pairs in mylist (presumably itself a collection of pairs, although the "presume" part is actually an absolute guarantee in Haskell, because it does type inference) meeting your criteria, and return them in a newly-allocated list.

Another example is regular expressions, which provide a very compact way of specifying patterns in text strings. The underlying engine is a pretty complicated state machine — one that you'd really rather not have to write yourself.

XPath is another example, and so are context-free grammars used (e.g.) in parser generators like yacc, antlr or javacc.

Logic programming is another example of very high-level programming, where you declare a bunch of facts about the world, and then make a statement that's proven true or false — or perhaps it solves for a set of variables that satisfies the expression and all the facts.

There are actually quite a few different kinds of programming that fall into this category, and again, I'm not sure if it's right to split them or combine them as a single style.

Using declarative expressions and pattern matching almost always yields code that's more compact, readable, and maintainable than the alternatives. Some pattern languages are notoriously ugly — regexps are perhaps the best example, and alternate, more readable syntaxes for regexps exist in some languages. But not all of them have to be. SQL is pretty readable, all in all, and XPath isn't so bad either.

Of course, as far as allocation goes, you have no idea what's really happening under the covers. You might be able to make some educated guesses, but that's about it. You'd have to do a lot of work to figure out what allocations are happening for even the simplest queries or pattern-matches.

Does that mean they're bad? Of course not. It just means that you should be reasonably judicious when using them — it probably makes sense, for example, to cache queries and/or results in any sufficiently long-lived process.

There are other nifty examples I'd love to talk about, particularly the pattern-matching syntax in Haskell, ML and Erlang. But I'm afraid I'm out of time, and no doubt so are you.

Summary

Given that I just thought of these classifications this weekend, I haven't had all that much time to absorb it. If you've got comments or opinions, I'd love to hear them — just post a comment on this blog. I'm curious to know whether there are other distinct allocation styles in use, and what the pros and cons are.

I hope this has been a little eye-opening for you. I'm not really in the mood to preach about any particular allocation style. I do think it's useful to be aware of them as you're writing code, and my personal approach is usually to use the highest-level allocation strategy I can, until I've proven that it's not going to work.

Anyway, I hope it was interesting. I thought so!

(Published April 19th, 2005)

Comments

"Hardware is expensive, people are cheap."

The modern thought in most companies is the opposite. It appears that you are speaking tongue-in-cheek, so my comment is probably redundant.

In the end thanks to Taiwan and other countries, hardware is often the cheapest part of any business.

Posted by: Ryan R. at April 19, 2005 08:55 PM

Yes, it was tongue-in-cheek. Although I was speaking on behalf of all the people at Amazon using C++, because obviously they feel that hardware is indeed the gating resource. That, or they don't realize how much trouble they're giving themselves for no good reason.

Posted by: Steve Yegge at April 19, 2005 10:16 PM

Clearly you and me talk to different people. Most definitely everyone near me considers that people and our ability to actually get things _done_ with those people (think 6 hours builds) is the gating "resource" here. Hardware is cheap, plentiful and while not an excuse for bad code, we have gotten more in trouble for not delivering, or delivering buggy code, than delivering slow code.

Posted by: Ryan R at April 19, 2005 11:29 PM

Well... my feeling is that Java is the best solution out there today. Our support for it at Amazon isn't ideal yet, but the gains you get by switching to Java far outweigh any Amazon-specific build or deployment difficulties you'll encounter.

I also like Ruby a lot, and it wouldn't surprise me in the least to see Ruby become a viable language for a great many production tasks.

Personally, I think C++ doesn't really have a place in server-side software development today. I'd use C (or D) for anything truly and demonstrably performance-critical *and* for which you could make some financial case that Java isn't capable of performing well enough. And I'd use Java or higher-level languages for everything else. But that's just my opinion, and this is just a blog...

Posted by: Steve Yegge at April 20, 2005 01:20 AM

"But that's just my opinion, and this is just a blog..."

Don't let me get you down. :)

Actually I agree with you here. I like C++ a lot, for the problem areas where it's suited. But I actually turned to an office-mate the other day and said "why do we use C++ for our services, anyway?"

When I first got here, 8 months ago, I assumed that services were written in C++ because the service layer was a bottleneck in the critical path. But when I think about the services that our team owns, they're all database-bound. All C++ has going for it in this space is inertia.

Posted by: Josh H. at April 22, 2005 09:06 PM