Distributed Programming Boot Camp

by Michael Ernest



Five words on distributed programming; done right, it ain't easy. And by "done right" I do not mean program code that compiles and runs; rather, that the programmer understands what their code does, and can apply their understanding to reading exception dumps correctly and fixing problems and, more importantly, knowing when there's nothing to fix.

So let's first talk about what is not, in the fullest sense of the word, a distributed program. The term we often use for examples like the one below is network programming. You'll see why very soon.

Applets are a good example of what we do not mean by distributed programming. Applets download from an HTTP server to a browser. The browser runs a local JVM to 'execute' the applet. On one hand, there are context-sensitive hooks an applet can use to point to the web server (getDocumentRoot()) and the enclosing browser (getParameter()). On the other hand, none of these hooks point to an active process, just files/directory locations or environment variables.

In a "true" distributed program, two or more processes interact; they share data (and possibly code) in such a way as to "distribute" the work that needs to be done. In this sense, applications that follow the client-server model don't "truly" qualify, although the semantics are often the same. If you really want to get snooty about it, you would even say that web servers and browsers aren't "true" client-server roles, but rather a "request-response" model -- in short, "network" programming.

Imagine instead a program running on your machine that takes on a large job that can be broken down into chunks; it's primary purpose is to determine how long it will take to process the job, based on how many compute servers are available at the time. It then submits the chunks to each server, one at a time, requests updates on the chunk (or just waits on the result), and provides regularly updated estimates of completion time to the user. That's distributed work -- several computers working together on a common task.

Or imagine a program that tracks company purchase requests. Many such requests have to be checked against inventory, an approved vendor list, purchase amount, and finally management approval. An application that expedites such a request, known as a workflow application, must also reduce the overall metadata for purchases down to a view that each party (store manager, contract manager, purchase officer, division vice-president) cares about. It must also ensure the request doesn't fall through the cracks or sit in an indefinite holding pattern. That's also distributed work, in that the overall product (an approved request) relies on the individual work of several other processes.

You may surmise that neither of these examples needs a distributed architecture. The job chunker could be written as a batch program, for example, and email often suffices for workflow for tasks with many responsible parties. Both of these are better understood and simpler than a distributed solution, but need people to serve as the knowledge base or "glue" between pieces. It's the need for better efficiency and consistency that drives people to for a distributed approach -- business people, anyway. You geeks probably just think it would be "cool." Seriously, though, for some companies only this kind of architecture allows them to manage complex work acceptable and stay competitive.

What distributed programs benefit from is the asynchronous nature of complex, interrelated work. Many related but separate activities run at their own pace. To link them together, you need a rendezvous scheme that doesn't force one activity to halt progress while waiting for another activity to finish. The notion of de-coupling, a benefit of good object-oriented design, is critical to distributed systems.

What distributed systems suffer from is the unreliability of networks. By this term, we do *not* mean that networks fail anytime, all the time. Of course they can fail, as can anything else. Unreliability in this context is relative, in this case relative to the likelihood that a CPU instruction will fail.

The system I'm on now, a SPARC Ultra 1 running at 167 MHz, runs a potential 167 million cycles (instructions) per second, or about 600 billion possible instructions per day. How many of those fail, affecting program operation? Don't know -- it's been quite a while since I've had that happen. Compared to that, networks fail a lot, to such a greater degree that we can't write distributed programs with the same level of comfort that we write local, single-process programs.

So how does this manifest itself in the form of working code? In the simplest terms, make a method in one program call a method in another program. "But wait a minute," you OS-savvy types will say, "isn't that kinda sorta what any process does when it talks to the kernel of our operating system? Isn't our program a process that is communicating with another process, the kernel?" Well, yeah, but kernels only emulate asynchronous contact with user programs. And, if they were unreliable, we wouldn't use them for any reason (in the Unix world, anyway. You may feel differently, depending on your current Service Pack).

To work effectively together, two programs have to

a) be able to find each other and connect
b) talk to each other correctly
c) take network unreliability into account

Of course, you don't have to go through Uncle Mike's boring old lecture on the Way All Things Are Done Right if you haven't got the time, patience, or respect for an old man in his last few years of good health. You can, if you like, dive right in with RMI tutorials, only they won't tell you why you want distributed programs to plague you for the rest of your life; they'll only show you how to invoke the plague. Talk back to the Newsletter editor; if you'd like to hear more about the foundations of distributed programs, I'll say more. If not, I'll go back to flirting with Map all day long. Your call, readers!

If you're willing to listen but still want to do something, check out this "live tutorial" that we started building in the Distributed Forum some times ago. It's in need of more stuff, but I'm in need of more read input to write. Let's help each other out, ok? Here's the link:


Thanks for reading!