Programming in the Large

In their article from 1976 DeRemer and Kron make the distinction between programming-in-the-large and programming-in-the-small. They distinguish the “glue code” organizing independent chunks of computation (programming-in-the-large) and the “number crunching” executing low-level computation efficiently (programming-in-the-small). For the field of scientific workflow languages, a similar idea has been proposed by Samuel Lampa. He compares the programming-in-the-small with the genetic machine inside the cell nucleus which is fast and local and programming-in-the-large with hormone-based inter-cell communication which is slow but distributed.

For scientific workflows and, in fact, for large-scale data analysis applications in general, the distinction between the organizational and the computational layer is a fundamental state of being. E.g., in MapReduce the organizational layer is confined to just one program: a sequence of map, shuffle, and reduce while the computational layer allows the integration of arbitrary Java code.

In contrast, in general-purpose single-layer languages this difference is often evened out and language designers take pride in that their languages are general enough to shine in both worlds. Here, multi-paradigm programming is a fundamental state of being. E.g., in Perl you have access to many different programming paradigms: Implement an algorithm in an object-oriented manner and glue algorithms together in a functional manner, all in the same language.

In this post, I want to discuss both approaches. Separating the organizational from the computational layer allows us to consider the requirements for both independently and come up with a special-purpose pair of languages. But it also places the burden on us to manage any inconsistencies between both language layers at their contact surface. In contrast, a single-layer language allows us to maintain a consistent view independent from the abstraction level. But it also places the burden on us to come up with a one-size-fits-all language, which is harder than it sounds.

The choice of the approach, separating the large from the small versus integrating it, has intimate implications for the language design space, especially for error handling and types. I am writing this post as the maintainer of Cuneiform, a language that separates the large from the small, handles errors without exception handling or continuations, and provides only simple types. In the following, I argue that this combination, while prohibitively simplistic for a general-purpose language, saves the day of programming in the large.

Simple Can Do

Type systems are like a stencil allowing you to write only programs that make sense. The crux of type systems is that, in order to be sound, they must be either very rigid (disallowing some programs that would actually compute) or very complex. Many real-world type systems also maintain simplicity at the cost of soundness, i.e., they introduce loopholes. But there is really no winning in this game. Simple type systems are either unsound or too rigid and complex type systems are hard to build, hard to learn, hard to prove sound, and may not even do the trick.

In the wild, we see only type systems that are either complex or unsound (sometimes both) because the combination of simple and sound is so painstakingly explicit that it just isn’t productive in a general-purpose language. However, dividing a language into a layer for programming-in-the-large and a layer for programming-in-the-small changes the lay of the land.

It turns out that knowing all the details of a data item and being able to generalize or specialize its type is much more relevant to the computation layer where data is actually accessed than to the organizational layer where data is just handed over from one library to the next. This allows us to introduce a simple and sound type system in the organizational layer while the computational layer can be governed by a different typing regime.

For instance, the simply typed Cuneiform provides Booleans, strings, and files as base data types. Complex data items need to be serialized into a file, thus, becoming black boxes until the next operator accesses them. E.g., the following Cuneiform function produces a file with the content Hello world. The function definition constitutes the contact surface between Cuneiform and Bash.

def greet() -> <out : File> in Bash *{
  out=output-file.txt
  echo "Hello world" > $out
}*

greet();

Dropping the details of a data item (serializing data) and later reintroducing the exact same details (de-serializing data) is an additional burden compared to single-layer languages. In addition, simple types force us to be explicit and unambiguous all the time. But there is also something we get: Code becomes very legible, no guessing, no implicit conversions. Error messages always point to the source of error and they are meaningful. The possibility of runtime errors is entirely excluded on the organizational layer. This means that whenever you see a runtime error, it is guaranteed to originate in the computational layer.

This last point is much more important in large-scale data analysis than in any other programming discipline because turnaround times are often in the order of hours and days. Handling potential errors upfront (at compile time) becomes extremely important when turnaround times are large. So, what is a close call for or against static type checking in the general setting becomes a no-brainer (in favor of static type checking) in the data analysis setting.

It surprises me a bit that “scripting languages” like Bash or Perl are traditionally associated with dynamic type checking even though rigging a language for ultra-safe may hurt many things except gluing libraries together.

Error Handling and Exceptions

In a similar way the discussion about error handling is a completely different one in single-layer languages as opposed to languages that separate the organizational and the computational layer. In single-layer languages, one size must fit all. Hence, in the wild, we find only languages that, at least, handle exceptions.

I would argue that there are two common patterns in using exceptions that account for about everything we do with them: Either we use them to propagate an unanticipated error which halts the entire program or we use them to signal a certain anticipated outcome in which case we catch the exception and continue computation. While the first scenario has a global effect with a simple handling mechanism the second scenario has only a local effect with an arbitrarily complex handling mechanism.

Note that on the organizational level the second scenario is entirely irrelevant. If exception handling is necessary in the organizational layer it is to propagate the error and halt. But if we can assume that all signal handling exceptions are part of a computation of limited scope then an organizational language can safely opt for simplicity, handling all exceptions as simple errors.

Cuneiform

Cuneiform is a functional language for large-scale data analysis. As mentioned above it features only a simple type system and no exception handling. Although simply typed, functions can be recursive. It provides no references. Thus, variables can be shadowed but never mutated.

Cuneiform uses the divide between the organizational layer and the computational layer in two more ways: First, it supports several languages for the computational layer, e.g., Bash, Python, or Racket. That is also why the lacking of a standard library is not much of a loss in Cuneiform. It scavenges the standard libraries of its computational languages. Second, Cuneiform uses the language divide to determine the size of computation chunks to parallelize and distribute.

This way Cuneiform, as a programming language, is open, because it integrates many languages on its computational layer, but also general, because despite the shortcuts it takes, it is a functional programming language. The choice of plain error handling, simple types, and omission of references leave Cuneiform simple. But in this post I hope I have clarified in how many ways Cuneiform leverages that simplicity without actually suffering the limitations.

Conclusion

The distinction of programming-in-the-large and programming-in-the-small allows us to contemplate the language design space separately for both worlds. The design decisions we traditionally make for general-purpose single-layer languages may not be the same for a special-purpose language pair. Often, we sacrifice simplicity for other ends but a two-layer language allows us to spare the organizational layer from such undue compromising. With Cuneiform I demonstrated how such a language might look.