decision-time

Decision Time

Stevey's Drunken Blog Rants™

The several dozen of you folks who read my Stevey's Drunken Blog Rants (tm) semi-regularly have been watching with some amusement as I frantically bounce around like a ping-pong ball in a wind tunnel, looking for some sort of productivity solution. Looking for a handout, really, from the open-source community. A silver bullet! Mike C. and Peter D. have been secretly (and sometimes not-so-secretly) laughing at me the whole time, and with good reason. I'm sure I'm as fun to watch as the Three Stooges.

I've been blogging internally for almost a year now, and although I've published nearly fifty ten- to twenty-page articles, essays and just plain rants, I think only five of them have been truly interesting. There've been a few nuggets here and there: minor insights, basically, but nothing worth writing home about.

Five interesting essays. This particular blog entry, I hope, is Number Six. I'll recap the ones I liked, to set the context for where it's aaaaaall been leading.

My Personal Favorites

My first (mildly) interesting blog was When Polymorphism Fails. It featured a funny and highly non-polymorphic Opinionated Elf, and was my first concrete articulation of something I'd been feeling for several years -- namely that "traditional" OOP (especially Java's take on it) had let me down. Nothing major there, but it was a small personal breakthrough, as I realized that after maybe 8 years of using Java extensively, I was starting to be pretty unhappy with it. Plus I really liked that elf.

I used to think my Practical Magic essay was pretty interesting. It explores a seemingly paradoxical problem that haunts our interviewing process and our development processes -- the basic problem being that high-level abstractions simultaneously make people more effective via productivity gains, but less effective via a potential loss of self-sufficiency stemming from over-reliance on the abstractions (particularly frameworks.)

But of course some random non-Amazon blogger, I forget who now, answered the question in one sentence. Paraphrasing him, he said something like: "Abstractions form a ladder, with the appropriate rung being different for different situations, and all good programmers can intuitively slide up and down the abstraction ladder as necessary." And he's right. With the paradox resolved, my essay loses most of its force.

So I think my next interesting essay was It's Not Software. I wrote it in nearly a trance, in a single three-hour sitting. I had been wrestling for weeks with this question: "Why are we still using C++ to write our services?" It seemed (and still seems) patently inappropriate, given all the evidence that's surfaced over the past three to five years of post-dot-com-boom introspection. It dawned on me, after a four-hour conversation about the problem one evening with Todd Stumpf, that most people are trained to write shrink-wrap software, but that's not what we're doing at all. Even most CS curricula focus on 1-box problems, while we're an N-box problem. So I wrote that essay to try to get people to realize that the shrink-wrap world in general, and C++ in particular, has little relevance to our work here.

I got a lot of nice compliments on that essay. They all basically said: "Wow, great write-up, Steve. I agree with you 100%, except for that weird segue about C++ being bad, which I didn't get at all and don't agree with. But aside from that, it was outstanding." I'm sure you can imagine how I felt. Let's just say the dagger of irony has a poison tip.

My third was Being the Averagest. I really like that one, although I was fairly nervous for a few weeks after writing it. Structurally, the essay is a multi-level pun on Paul Graham's famous "Beating the Averages" essay. My title is (almost) an anagram of his, and my sub-sections are a fugue-like inversion of his essay. Paul posits a hypothetical middle-of-the-road language called Blub, which I use as a gross unit of measure for stank-ranking SDEs, and I posit a hypothetical middle-of-the-road programmer named Bob, who doesn't realize he's stuck in Paul's "Blub Paradox". The central thesis of the essay is that programmers have no idea how much better they could be, and even those who do have little incentive to improve. It's basically an individual call-to-arms: don't stop studying after you graduate.

Incidentally, on average, the Bobs I've known have generally been very smart. I just like using the name because: (1) it's short, (2) there's always a chance I'll get to work it in as a doubly-recursive self-referential acronym, and (3) it worked well with the name "Sue" to help illustrate a secondary theme of the essay, which is that HR stack-ranking systems, clumsy as they are, are still the most common measure of SDE productivity, after over 30 years of research in the field.

My fourth halfway-decent essay was Ancient Languages: Perl, in which I finally found a way to get Perl out of my system. I'm sure it sounds like it was just blind hate-politics to many people, but it was in fact the most difficult essay I've ever written. It's surprisingly hard to build a convincing case against an entire (and popular) programming language in a very short space. The language adherents will hear nothing but a faint buzzing sound unless your arguments are simultaneously shocking, persuasive, and memorable.

As I pondered (on and off for nearly a year) how to articulate how I felt about Perl, I had some genuine, perhaps even original insights about the relationship between programming-language religion and "real" religion. Most of it never made it to the presses, but a capsule-summary survived the chopping block near the end of the essay, and that paragraph alone is worth the price of admission.

I know the essay was effective, even if it will take years to make any headway. I know this because our nice Perl community made a respectful effort to send a delegate to negotiate a peace treaty with me, in the comments section. Unfortunately I had to resort to the rather medieval technique of sending the messenger's head back in a box, just to make sure they know it's War we're talking about. Sorry, Doug!

Of course I should have known that wasn't the end of it. I hear we're gearing up to hire Larry Wall here at Amazon, which means I'll have to delete the blog entry on the grounds that one oughtn't be disrespectful to one's co-workers. There's no grandfather clause for disrespect, I'm afraid. But years from now, those of you with long memories may still remember my valiant and desperate stand, during which for a time it even appeared I had the upper-hand. Ah, me. At least there's still the shorter, incredibly funny blog entry by Mike Spille, Perlaphobia, which (little did I know) predates my rant by about nine months.

The fifth and final blog I'd consider semi-interesting is The Emacs Problem, from back in January. And it's only interesting at a meta-level: it's not so great in itself, but I think it's the entry that best captures the ping-pong ball mentality that pervades most of the rest of my writing. "Lisp is great! Except for how much it sucks! Ruby is great! Except it also sucks! Java is also sucky/great! I am a human ping-pong ball!"

Plus, as a side-bonus, that entry contains a pretty good explanation of why Lisp programmers are so dismissive of (most uses of) XML.

That's it! 5 out of 50. 10% interesting, 90% crap. I'm a walking, talking example of Theodore Sturgeon's Law. ("They say 90% of Science Fiction is crap. Well, 90% of everything is crap.")

My most interesting blogs were written at odd times, under oddly stressful conditions. Today's is no exception -- I've had insomnia for three straight days, it's 4:00am, and I have meetings and deliverables all day tomorrow. I hope this winds up being coherent.

The Underlying Problem

So I had a thought, and rather than try (too hard) to be clever and "lead" you to it, I'll just dump it on you unglamorously.

Of all the stuff we do at Amazon, how much of it is provided by C/C++, as a language feature or library? Almost none. In fact we've even replaced many of the standard libraries with our own proprietary, in-house versions.

Same question goes for Perl, and for Java. They do provide a lot of standard libraries, and that helps. But we've still written a huge pile of Amazon-specific stuff in both languages, and we continue to churn more of it out by the truckload.

My thought was basically along these lines...

1) I know a bunch of us want a better language. I've talked about the problems with Java and Perl, and hinted at the problems with C++. I think I've made my point, and I should stop talking about it so much.

Just looking at the number of subscribers to functional-programming@, lisphacker@, and other non-mainstream language lists, it's obvious that we have a bunch of folks here who are at least somewhat interested in harnessing the power of language abstraction to solve our N-box problems.

2) We actually have a pretty good understanding of the essential and desirable characteristics of this language. For instance, it needs to be growable, and it should support massive concurrency in the core language and runtime, since you can't just add that in later as a library.

3) It's becoming obvious that all of the candidates for the Next Big Language (a relatively short list, all in all, of perhaps six or seven languages) are what you might call "fixer-uppers". They're going to need a LOT of assistance to get them production-ready.

4) But hey, didn't we just conclude that C++, Java and Perl all required a LOT of assistance to become production-ready here? Why yes! Because what we do here is different. It's not traditional software development, not by a long shot, and our scale demands are so large that we're arguably even materially different from simpler companies with similar business models -- they can get away with using EJB and Anyone's Middleware and Joe's Database and Bob's Hardware. We can't.

So no matter what language we use, we're going to wind up spending a lot of developer resources on writing custom code -- some of it quite low-level, too: kernel hacking, writing our own memory allocators, writing compilers for our custom languages like Catsubst and JEG, and so on.

In fact, I think you'd be quite surprised at the brilliant and sophisticated coding going on around the company that has nothing to do with e-commerce per se. It's got a lot more to do with the general problems we face -- problems caused by our scale, not our so much our business model.

Although there are dozens (possibly hundreds) of equally good examples, a few should serve to illustrate my point:

    • the general problem of real-time function monitoring led us to write at least one custom time-series analysis language.

    • the general problems of scalable configuration and software load-balancing are forcing us to write our own replacements for traditional solutions like DNS, LDAP, hardware routers, and even core protocols.

    • we've had to write our own custom build systems to deal with the scaling problems associated with our actual source code -- it's one of the biggest code bases in the world.

Well, gosh. I could go on all day about "component frameworks" that we've had to write: languages, persistence engines, caches, configuration and build systems, deployment systems, templating systems... it's a lot.

We're actually rather (in)famous for writing so much custom software and servware. What third-party products do we actually use for "production" stuff -- i.e. anything that's even remotely related to the ordering and fulfillment pipelines. I can't think of many:

    • We use Oracle -- but for less and less these days. We're starting to lean more towards other persistence engines for production stuff: Berkeley DBs, MySQL, XML databases, and even home-grown solutions. Oracle is expensive and, well, buggy. I mean it's great in certain ways, but let's face it: we've spent hundreds of person-years trying to come up with clever ways to keep it from being the main scaling bottleneck and the leading cause of availability problems. Even after all that effort, it's still near the top of both lists.

    • We use BEA WLE, but we're trying to phase it out (in part because BEA's about to go Tango Uniform, but also because we're not really using much of its functionality, and we've hacked on it so much that it's become clear we could have done a better job writing it ourselves.)

    • We use Tibco, but we're trying to phase it out because it doesn't scale, for reasons that I scarcely need mention at this point. If you're brand-new to Amazon, the synopsis is: we were worried that it wouldn't scale, but we're not worried about that anymore. It went ahead and proved it on its own, using the much-dreaded Proof by Repeated Example.

    • We use Apache, which is famously un-scalable, but it's such a useful piece of software that we've put in the effort to make it scale, by not beating on it too hard.

    • We use Linux, which (bless its heart) has actually been gradually improving in performance and scalability. Perhaps someday we may even reap the benefits of those improvements. You do know we're still running a whole bunch of RedHat 6.x boxes all over the place, right?

      • Yup. And it's entirely, 100% C++'s fault. It's totally non-portable, even across operating system versions. Amazon can't afford the labor and testing hours to do the port! This is one of several (maybe a dozen) rather counterintuitive ways in which using C++ is killing our ability to create robust, scalable, highly-available systems here. It's the main reason I write this blog, actually.

I can't really think of much else. Remedy and Perforce, if you count them as "production" systems. Sure, why not, let's count them. And sure enough, they both have significant scaling issues, and they're not even in the direct line-of-fire from our website.

Language to the rescue?

Various people around the company, including me and my boss, have been pondering this problem, and we've all been hoping that the Big Search for the Next Language will wrap up, and an individual or perhaps a small team will demonstrate amazing, earth-shattering productivity gains by building something incredibly cool in almost zero time, something that obviously couldn't have been done by a human being using Java, C++, or (possibly) Perl.

If that sounds comically lame to you, well, you should've told me. Because that's the only approach being contemplated right now. There are a few small groups getting ready to try using Common Lisp or Scheme for some production projects, here and there, and a few others are toying with the idea of trying Erlang or Haskell or OCaml.

After studying this problem quite a bit, I think I've come to the following unexpected conclusion:

There *is* such thing as a silver bullet, but you have to buy them in bulk and hand them out to a whole army.

In other words, languages can be powerful tools, but they need strong community support in order to realize their potential payback.

I still firmly believe that our languages are holding us back, each in its own unique way, and that language abstraction is one of the best vessels for producing large step-functions in productivity. However, that belief has been slowly refined in two big ways.

The first refinement is that not just Any Old Abstraction will do. List comprehensions, first-class functions, path-expressions, destructuring bind, pattern-matching and their ilk won't buy us an order-magnitude improvement. Sure, they'll help. But even in aggregate, they're not really a game-changer, because Amazon is a unique game.

I think what we need is Amazon-specific language abstraction. Perhaps not necessarily e-commerce-specific, but scaling-specific. We need a language for doing distributed computing.

It's actually a bunch of languages (or abstractions) for a wide range of specific problems, including persistence, load-balancing, deployment, service calls, and many others. But I'll call it "a language" for the moment, to keep my arguments on track.

The second refinement to my belief is that languages need help. Especially at Amazon. I'm going to have trouble articulating all the subtle distinctions I'm envisioning here, but I'll do my best.

I've already brought up the first piece of evidence, which is that we've created a ton of libraries (and even several custom languages) to enhance the "out-of-the-box" support offered by our three primary development languages. I'm talking specifically about shared stuff -- code that's useful across teams with very different business responsibilities. FOL, BSF, all the SPLAT stuff, and a growing amount of "open-source" (but still Amazon-internal) software that's being shared on the free market via ad-hoc mechanisms.

I know we don't like sharing. I know we're decentralizing. When we talk, it's supposed to be through hardened APIs. I know this is the party line, and for the most part, I actually agree with it.

However, we do have a bunch of shared stuff, and not enough SDEs in SPLAT and Dev Tools to support it adequately, and everyone knows it's an open problem that isn't magically going to go away. Many people want it to go away, because:

    1. Centralized groups don't fit with the pristine beauty of our independent-satellite/2PT/dev-center/etc. mental models of decoupled development. Any way you look at it, it doesn't bode well: as long as large centralized groups remain, it appears (superficially) that either our basic model is flawed or our execution is flawed. (I'll offer what I think is a better explanation later -- I don't think either is fundamentally flawed, but some things need to be permanently special-cased.)

    2. Even if they were officially blessed as part of the model, centralized groups are intrinsically hard to work with. For one thing, they have to queue requests when things get busy, and you don't like waiting in a queue, because you have stuff to do. For another, they tend to prioritize based on the Greatest Common Good, so if you have a request that's unique to your group, you're inherently less likely to get it dealt with. There are other problems as well.

    3. Centralized groups have a way of growing without bound if you don't keep careful reign on them, because of various subtle dilution/diffusion-of-responsibility dynamics that I won't go into here. But you know what I mean. It's not a knock against them; it's just natural laws at play.

So the company's trying hard to minimize (if not eliminate) all centralized IT functions, and yet there are some stubborn holdouts. Why? Are we screwing something up?

I don't think we are. I mean, yeah, we screw some stuff up, and everything could be better, and so on. But when you get right down to it, and dig into what the remaining centralized groups are actually working on right now, it's stuff that either must be shared in order for groups to communicate at all, or stuff that's obviously, provably more cost-effective (at least at the moment) to keep as a centralized function.

Libraries vs. Languages

My personal take on it is that many of the primary coding responsibilities of Infrastructure, Dev Services and the other remaining centralized software groups are things that really ought to be rock-solid abstractions. At least libraries, but I tend to favor language abstraction when possible, and harp on it as much as I can, for several reasons:

    • Libraries have a funny way of growing into fat frameworks -- they're easy dumping grounds. But putting on those extra pounds makes them harder to learn, use, configure, and re-use for purposes other than their original intent. Language abstractions don't typically suffer from this pitfall.

    • Language abstractions are more amenable to support from compiler-like tools. A compiler can't check whether you're using a library or service properly (e.g. initializing it properly, calling the functions in the right order, and other dynamic usage issues). But it stands a pretty good chance, depending on the language, of figuring out whether you're using the language itself properly.

    • Language abstractions are a rock-solid barrier: the boundary between the source code and the compiled code is pretty strong. It can feel like the language is your universe, even though when you stop to think about it, you're really programming on a tower of machines.

      • In fact, you think about this hard enough, you'll realize you're not on the top floor of this tower, and that you're actually programmatically manipulating large, murky, ill-defined machines that should be better abstracted, but they seem to defy your ability to encapsulate as reusable, extensible libraries. Take Brazil or Apollo, for instance. Or the incredibly complex pipeline that transforms a page request into page content. These machines are bigger than a library, or in fact any OS-level or box-level abstraction. That's the nature of the servware space.

But why do I think the programming language can help with these problems better than libraries can -- at least some of the problems, anyway. Am I talking about, you know, for-loops and stuff? What I'm saying may not initially make much sense.

Unfortunately, the answer is slightly off-topic, and this blog is turning out to be way longer than I anticipated (as usual), so I'll have to shelve the bulk of that discussion for now. But I'll offer a few examples, just to prove I haven't totally gone off the deep-end here.

One example is Erlang. It's a language that I'm not (yet) super familiar with, but I will be soon, as I've finally got a good book about it.

Erlang is a language invented by the Ericsson telecommunications company for implementing high-performance, real-time (both hard and soft real-time), massively distributed and scalable telecommunications networks and online transaction processing systems. The FAQ says they started the project in the mid-1980s, and the goal was to make it easier to program telecommunications systems, by including features that made writing such systems simpler, and avoiding features that made them more complex or error prone.

That's an important distinction, you know. They tackled both sides of the problem. Writing some libraries might make it easier to do our jobs, but C++ and Perl are both notoriously error-prone languages, and there's essentially nothing you can do to "remove" the error-prone-ness from them. You have to document it, remember a lot of strange rules, and hope that not too many of the inevitable accidents will make the front page of the Wall Street Journal.

Stop and think for a second about all the training and campfire lore that goes into teaching our developers (and reminding ourselves) how to avoid the hundreds of common pitfalls in Perl and C++. Think about the high premium we place on knowledge of C++ language details during the interview process (in the many groups that still use C++) -- we know it'll save us time, bugs, and outages if the candidates already come pre-equipped with a lot of that knowledge.

Ericsson, in focusing on writing a safe language, erased a great deal of that effort permanently. At Amazon, it's a tax. It never ends, and we pay it constantly, because at our current rate of growth, we'll always have people using these dangerous languages without being experts in them. The tax is unavoidable; the only choice you get is how you want to pay it:

    • Pay with time -- pad the schedule to give the developer extra time to study the language.

    • Pay with people -- tax the time of your folks who are more familiar with the language, and have them help with code reviews, design decisions, training and brownbags, etc.

    • Pay with quality -- let the bugs fall where they may, and when your pager goes off, have the on-call look into it.

At the risk of beating a dead horse here: If your core language is error-prone, libraries and APIs typically don't help make it less error-prone (or they can only do so along finite, narrow dimensions).

It's way too much to cover here, but Erlang has language-level facilities (i.e. syntax, language-runtime support, and carefully-defined semantics) for dealing with things like massive concurrency (millions of threads on a box), restarting processes, doing peer re-elections, handling exponential back-off and other network-level queuing strategies, distributing data and computations seamlessly... they've come awfully close to making N-box programming feel just like 1-box programming, but without sacrificing robustness in the face of failure.

Erlang is an interesting case-study, and for all I know, it might actually be the best choice for our next language. But at the moment, I'm just citing it as an example of how programming languages can provide robust, convenient abstractions for real problems that you're dealing with every day.

I didn't mean to give the impression that Erlang is the only example. Build systems (and I'm pointing my finger directly at both Ant and Brazil) are complicated machines that would be far better served with direct language support than the ad-hoc assemblages of XML, config-files, scripts, plug-in systems, and the other flotsam that they're (both) composed of today. And I'm guessing people would be a lot happier with Apollo if it offered specific language support as well.

That's it -- no more examples. Cripes, it's 10:30am, and I've been writing for hours; I'm really out of time here. So I'll wrap up.

If not Centralization, then at least Cooperation

My rambling so far can be summarized as follows:

    • One way to view my past year of blogging is that I'm a zany crackpot who can't make up his mind. But I prefer to view it as an exploration of our problem space(s): developing servware, interviewing, maximizing our productivity, minimizing our outages, and so on.

    • My blogs contain a fairly vast body of evidence suggesting that C++, Perl, and Java aren't serving us well -- partly due to fundamental problems with these languages, and partly due to the unique nature of our business and scaling problems, most of which aren't addressed directly by these languages.

    • Mailing-list subscribership and other signs would seem to indicate that there's a good-sized body of programmers here at Amazon who would be more than happy to be using a higher-level language.

    • All of the possible candidates for this language appear to be missing things (usually tools, libraries, docs) that we take for granted in C++, Perl and Java. This is because C++, Perl, and Java are more popular languages. It takes a community to build strong support levels.

    • Most of the efforts to bring in one of these languages have been piecemeal and small-scale. But my recent realization (obvious in hindsight, I suppose), is that to do it right, we need a bunch of people to jump in and commit to picking the best candidate language and growing it. Ourselves. As in, rolling our own, Amazon-style, not waiting for handouts from the external Open Source community. Because they're taking too damn long.

Interestingly enough, any developers who have excellent low-level C/C++/Linux programming skills would be ideal candidates for making changes to the actual language runtime -- e.g. adding in support for high-performance I/O, if the language doesn't have it yet, or tuning the support for our flavor(s) of Linux.

People who like to write tools, they'd write tools. Many tools can be written in some other language than the target language; e.g. our Perl programmers could wind up writing support tools for Haskell or Erlang or whatever. Perhaps not as ideal as having them do it in the target language, but it all has to be balanced against pragmatic concerns. Schedule, for one. And some tools may not initially even be possible in the target language.

A lot of people would have to contribute (or find and port) libraries. This is just a fact of life. It means that a fair amount of work would be spent implementing "shared" (or dare I say it, "centralized") functionality. Certainly the work itself could be distributed.

And some teams would have to bite the bullet and code up production systems in this language, which would put them in the line of fire for potentially tricky support issues -- until we have enough depth of experience with the language, and tools for diagnosing problems, you could just as easily be debugging a problem in the language itself as a problem in your application logic. That makes it harder.

I'm not claiming this will be easy. However, I'm also not the first person to suggest it -- I know at least three or four SDEs in widely different Amazon orgs that have been hinting to me that we might want to tag-team on bootstrapping a new language.

Now it's official: I'm out of time. But I think I got the initial message across. Please comment -- including objections, fierce resistance, etc. I'd rather get that stuff up front, so we can think about it carefully, than have it sprung on us after we've already started down this path.

Initially, just FYI, my personal candidate pool of languages, from which I'd want us to agree on one, includes Gambit Scheme, Erlang, Ruby, Common Lisp, OCaml, and Haskell, in approximately decreasing order of preference. Each has major pros and cons over the others. It would need some serious discussion. But I'm tossing it out to get the shock-factor out of the way as early as possible.

(Published April 28, 2005)

Comments

Continuous optimization. This is a huge and unexplored area of supply chain that I am on task to work on. I haven't worked on it yet, and while tentatively we've planned on using Java (anything is better than C++!), I am 100% open to leapfrogging my performance.

The issues I have found with non-supported languages at Amazon are as follows:

    • Manager/Peer pressure. "You want to use WHAT?!" Having a cheering section to at the very least cheer you up helps.

    • Database interoperability. The DBAs don't allow you to connect to Oracle without using APLOracle ? Crap! One solution is to use MySQL, which it what I may consider.

    • Service and communication interoperability. Rarely any important system stands alone. You need to hit up remote cat, you need to capture and inject pubsub messages, you need to access HTTP/BSF services, etc.

All of these are easily solvable, but not by just one person with a day job.

I want to do this, we need to talk more obviously!

PS: I've been talking about these things for a while, just not on a blog, and also I've been stuck with the 3 problems outlined above. I think the hardest problem (#1) for me won't be a big issue anymore, but I still need help.

Posted by: Ryan R. at April 28, 2005 11:12 PM

For what it's worth, C++, Perl and Java were all disallowed at Amazon early on. Each of them succeeded in overcoming popular resistance by accumulating a critical mass of developers who eventually just pressed forward and started getting real work done in these languages.

Initially it was just ANSI C, our in-house web-templating language, and Emacs-Lisp for the original Customer Service app suite (which was quite popular for many years, actually). Oh, and the inevitable handful of scripting languages that are essential in Unix -- shell-script, awk, etc. All other languages were strictly forbidden.

Perl broke through first, and they had the easiest time doing it, because (a) there were many problems for which Perl was arguably the best option in 1995-1998, and (b) Perl is kiiind of soooort of a standard on Unix, so it was hard to keep it out.

I personally think Python would have been a smarter choice; it's the choice Google made, so obviously it doesn't cause grevious ills for highly-scalable problem spaces. But Perl had Python beat in the CGI space for quick-and-dirty internal web apps, for Unix integration, and overall popularity.

C++ came next. The "why can't I use C++?" griping started to ramp up heavily from mid-1998 to end-1998, and then it started appearing in big projects like Customer Master. Of course it *immediately* (as in, on the very same project) began causing portability/compatibility headaches -- we had to do a GCC-to-CXX migration for BEA compatibility, then migrate back later during PARCS. It was months of work and caused no end of bugs. But the C++ fans pressed forward fairly aggressively, once they realized they outnumbered the original Amazon architects and could brazenly flaunt their non-portable template tricks.

Java took much longer. It was used first in SCOS, and their framework was adopted by CS, but Swing didn't play well with the existing xterm-based network architecture, so it was incredibly sluggish through most of 1999, and I believe it was the infrastructure guys who put up the most resistance, saying there was no possible way we could afford the hardware costs to make Java scale. For all I know, they may have been right; I'm just recounting the sequence of events as I remember it.

I don't think Java transitioned to a first-class language at Amazon (where you could use it with impugnity) until... mid-2003, maybe? Pretty recently. And again, it was largely just a big mass of competent Java programmers who finally just forced the issue. All they needed to do was prove that it worked. They were carrying the pagers for their own systems, and they were hitting deadlines, and the systems were performing adequately for the most part, depending on who you asked.

Even today, though, it remains to be seen whether someone could sell the idea of running a Java 5 JVM instance on every essentially every online and production machine. Many people are still highly skeptical of Java's value-add, and take a "run it on your box, but not MY box" stance. I personally can sympathize with both sides of the argument, but generally speaking, I'd favor replacing C++ with Java.

I should make the important point that I trust Java and the JVM. Regardless of how I feel about Java's long-term viability, I think that being a Java shop (plus Ruby or Python for scripts, of course) would be a huge improvement over being a C++ shop, and could easily carry us for 4 to 6 years. Unfortunately, Java seems to be losing some ground to C++, for a very good reason: having a unified platform for our primary websites is far more important than the particular language choice; just ask anyone who has to deploy to three or more platforms today.

C++, Java and Perl all have a lot going for them: they're well-documented, stable, mature languages with plenty of libraries, big communities, and reasonable interoperability options. At least we're no worse off than the rest of the industry.

But my hope is that a bunch of us can agree to team up, follow the example set by the Perl, C++ and Java early-adopters in years past, and share the pain of driving a higher-quality language to Amazon-level production readiness.

I envision more than just incremental improvement. I think that if we make good choices, and bust our butts on it for a while, we could put ourselves in a position to do the really fancy stuff Amazon always dreams about, like building self-regulating systems with reinforcing feedback loops that need far less human intervention. Right now I just don't see much of that happening. I know it's a minority opinion at the moment, but I hold C++, Perl and Java largely accountable for our lack of progress on this front.

But I can guarantee a fair amount of mundane work up front -- we'll have to solve issues like your #2 and #3 before we can move on to more interesting problem domains. (FWIW, the Ruby folks here have quietly gone and solved at least #3 already. That's one of the side-benefits of using a cool high-level language: you can generally get stuff working really fast, with very little code, and it's more fun too.)

Posted by: Steve Yegge at April 29, 2005 12:49 AM

Problems #2 and #3 can be fun anyways. I don't dislike working on platform issues, especially when you are working towards a magnificent new future.

So if it is your hope some people will team up, any suggestions to those of us who have the projects and time to team up on?

Posted by: Ryan R. at April 29, 2005 01:29 AM

Well, I'd like to give it a couple days, to give more people time to read this, think it over, and comment. In the meantime, I'm finally forcing myself to learn Erlang and Gambit Scheme, since they have uniquely scalable concurrency support that doesn't seem to exist in any other languages out there.

In both Gambit and Erlang, it's easy to fire up half a million to a million active threads on a single workstation, passing messages, each of them able to block on I/O at will without affecting any of the others. You Just Can't Do That with Java (or Ruby, or Python, or any "normal" multi-threaded or multi-processing language I'm aware of). It makes approaching the problem of thousands of concurrent connections a lot easier when the language has that level of concurrency support built-in.

If we were to choose any language other than Gambit or Erlang, it would be a prerequisite to be able to modify the language runtime to give it Erlang-style concurrency -- and get the language maintainers to accept the patch, so we don't have to maintain it in-house. Depending on the language, that could be nontrivial, to say the least.

Hence my initial predisposition towards Gambit. Although it would need a lot of library work, it has a surprisingly strong core set of language features and tools. It would be a good foundation to build on. But I need to study it for a few days to get a feel for it.

Posted by: Steve Yegge at April 29, 2005 02:00 AM

I learned scheme in University. While I wasn't 100% enthusiastic for it at the time, I didn't have the difficulty some other people did with the highly functional and recursive nature of it. I never understood why tail-recursion is better until a few years later.

But one thing we never touched on was macros. Does Scheme have macros? Is that why you are predisposed to Gambit Scheme? From the respective sites, it appears that erlang is a bit more, ahem, production ready than Gambit Scheme. Maybe I read the wrong site though?

Posted by: Ryan R. at April 29, 2005 04:28 AM

Gambit Scheme tracks the R4RS and IEEE Scheme definitions, which don't specify a macro implementation. However, it does implement both Common-Lisp-style macros, and a superset of the hygenic macros specified in R5RS.

Posted by: Derek U. at April 29, 2005 08:21 AM

Hmm, we could instead try to hire Matz and dedicate a few engineers to getting Ruby 2.0 out the door :)

Posted by: Andrew W. at April 29, 2005 08:35 PM

That's not such a bad idea. But I think he's already being paid to work on Ruby 2 and the VM full-time; probably all hiring him here would accomplish is make him lose 2-3 months from his schedule.

Posted by: Steve Yegge at April 29, 2005 09:58 PM