TDD Pro-Tip: Against Automated Macrotests

TDD Pro-Tip: I advocate against automated macro-tests — those whose base is entire running programs –, as their cost is high and their benefit is doubtful. I very rarely write them.

There is a bewildering variety of terminology out there around what I’m calling macro-tests, so let’s poke around a little.

The central idea of "macro-test" is that we write code that launches an entire subject program and probes its behavior "from the outside". There are often multiple programs in play. Macro-tests sometimes launch all the programs, and sometimes assume implicit manual starts.

A typical example: firing up a web server and browsing to it, then using something like Selenium to launch a browser and bounce through pages and look at results.
Another example: we have written two programs, one of which is the firmware on new hardware, and one of which is the monitor/control program on a desktop. We fire up both programs and run tests by giving the monitor/control program directions.

My case against macro-tests stands on two legs. 1) The cost of making them is normally extremely high. 2) The benefit of having them, where it actually exists, can normally be obtained in other ways.

Please.

Nota Bene: "normally". There are exceptions, we’ll mention them.

One part of the macro-test cost is inherent in the nature of "program" as a concept. A program is a large package of code that runs in the context of an operating system, and is visible to and invocable by that o/s.

There are lots of boxes in code, functions to objects to modules to processes to programs to computers. The order there is significant, as it represents magnitude on 3 dimensions at once: size, opacity, and control. Each of these dimensions correlates with difficulty in testing.

Programs are large. For many applications, they’re the biggest box involved. Follow-on consequences to this: They are comparatively slow to start & stop. They are comparatively complex, with many options and sub-behaviors.

Programs are opaque. Most do their work by hiding the majority of their operation from view. Even when they expose function and data, it is often exposed in ways that are difficult for machines to see. Witness the many complex variants of screen-scraper & stream-parser.

Programs are clumsy. It’s like being opaque, but it’s being inaccessible to input & control, setting up test datasets and invoking the function we wish to test. Programs, by design, expose production function in production context, not testing function in testing context.

All of this adds up to cost.

Writing the test code that works around all this, size, opacity, and clumsiness, this is not a trivial task. In fact, people who are capable of doing all that are even more rare than people who can write the production code in the first place.

There are mitigations, deriving from either our will as programmers or our actual application domain, and we should consider them. If enough mitigations add up just right, they can create exceptions to the general statement about the cost of macro-testing.

The *nix family of operating systems is based around operators that are programs communicating through streams. That domain is much friendlier to testing.

Secondly, I’m a big advocate for the steerability premise: the idea that tests & testability are first-class participants in design.

If we build the code to be tested at the program level, we can also mitigate our difficulty.

A dozen years ago, for instance, Emily and Geoff Bache made a splash in the XP world by showing a pretty sophisticated macro-testing approach that involved inventing what we now call a DSL at the "top but one" level of their app.

That same DSL was capable of specifying 100% of the input data as well. (We’ll come back to this idea, because it’s a form of the chaining premise at work, and we’ll see it when we argue that we can get the same benefit in a different way.)

A third possible variant: I’m hearing a lot of microservice-lambda folks suggesting that this architecture, precisely because it makes "program" so small on those three dimensions, is far easier to test. For now, color me "skeptical, but could be convinced with a real project".

All these mitigations are possible, and likely some more beyond that. And if they line up just so, the "cost leg" of my argument goes down in strength. But I want to stress that none of these are really the general majority case in the trade today.

Skilled designers and developers can do amazing things. Sadly, the demand for them far outstrips the supply. I don’t advocate widespread adoption of technique requiring high levels of mastery until we invest in getting widespread high levels of mastery.
Domains can mitigate, but domains vary dramatically. The overwhelming majority of the software written today is in twin domains that are radically different: website+database apps, and firmware control software. If your domain doesn’t help, it just doesn’t. And the benefit we get from them? Some of the claimed benefit is high, some low, some frankly nonsense.

But the second leg of my argument is that most of it can be obtained using far cheaper methods.

There are three reasons to write tests at all, at any scope against any production code.

Confidence.
Productivity.
Design.

They inter-twingle: each affects and supports the other, like a tensegrity sculpture.

Tests give us confidence that our production code does what we made it to do. It does this by setting up variant settings, invoking commands, and probing the results.

Tests give us productivity because they narrow mental bandwidth requirements, they’re machine-repeatable for stability, and they serve as "side by side" executable documents.

Tests give us design because — somewhat to our surprise — it turns out the theoretical design principles we’ve learned in 60 years drive us to certain designs, and so do the pragmatic coding behaviors we use when we’re maximizing tests and testability. One target, two paths.

My contention here is that all three of those benefits, confidence, productivity, and design, can be achieved by developing our code using tests that have far lower cost than macro-tests. (In the case of productivity the actual benefit is dramatically increased.)

Time to introduce what I call the Chaining premise.

"Test a chain by testing every link."

It is the deep underpinning argument for smaller-scale tests. So small, in fact, that I coined the term microtest many years ago to capture the idea.

The idea at its simplest: go to the bottom of the dependency stack. if Y depends only on Z, and we have very high confidence that Z "does what we want", we can "transfer" that confidence to Y by testing only the Y-Z link. Repeat that with X-Y, and again and again.

Of course, that’s the simple take, and as such it’s a caricature. Real programming in real life presents lots of oddities and variations. Recognizing them, and learning how to deal with them, is why TDD is such a rich and difficult subject for study & experience.

(In my experience, as tricky as it is to get very strong at TDD, there is a path to it that allows us to harvest value all along the way from noob to master. No such path seems to exist for approaches that seem to depend for their value on already having mastery.)

I’m running long, so I’m going to dispense pretty quickly with confidence and design and turn to productivity. (As I say, they’re intertwingled anyway.)

The key insight of the chaining premise isn’t that we trust the part at the bottom.

But if we trust any component in a dependency tree fully, we can grow that confidence to cover any other component that depends on it.

If we pile up enough confidences in our individual components, we pile up a great deal of, if not the exact equal of, any confidence given to us by a macro-test.

As for design, the truth of the matter is that macro-tests enforce design at the macro level. When programs were small single-function one-dependency one-geek operations, the macro-tests had a higher impact on design. Programs aren’t that, anymore.

The app you’re using to read this thread has hundreds of components in it. And it’s — face it — about the smallest app you’re going to use today. Most of them have dependency trees with thousands of components. Large-scale design impact is not enough design impact.

Now for productivity. We touched on the fundamental idea here before, in the context of the TDD path-to-mastery: continuous ongoing harvest of value.

Microtests "close" the boxes they’re applied to, as they’re applied. That closure, that (persistent) "settling of the matter of confidence on that box", creates immense impact on productivity.

When we program computers, the hard part is the thinking. It’s not the typing, and after the first year it’s not the language or the libraries. It’s thinking, about the domain, about modeling it, about composition and decomposition.

And we have ample evidence that the more things we have to think about, the less productive we are in the thinking. When I close one of these boxes, I am reducing my mental bandwidth requirement, literally the number of things I have to think about.

The effect on my productivity is stunning, and it compounds. The more that closure — that confidence and that design — we get along the way, the faster we go.

So now you see the argument unfolded. I don’t use macrotests — unless I have no other choice — because their cost is very high and the benefit I get from them can be obtained in other, cheaper ways, ways that have actually better benefit in some areas.

Now, there’s details. Lots and lots, and I could go on at great length. I have, in fact, done so already. Check geepawhill.org, especially the TDD category.

GeePawHill – Helping Geeks Produce for Over 40 Years.
Helping Geeks Produce for Over 40 Years. My mission is to help people learn how to embrace change and harvest its value. That’s why I started the Camerata: a community of like-minded teams and individ…

In particular, there are a couple of key videos, one describing my five TDD premises, and one debunking the lump of coding fallacy:

TDD & The Lump Of Coding Fallacy | Video | GeePawHill.org
Hey, it’s GeePaw, and if you’re just starting to look at TDD, refactoring, the modern technical synthesis, we need to start with a couple of minutes about the Lump Of Coding fallacy. You’re a working …

Five Underplayed Premises Of TDD | Video | GeePawHill.org
Five Underplayed Premises Of Test-Driven Development (Transcript) Hey, it’s GeePaw! I’m here to tell you today about five underplayed premises of Test-Driven Development. These premises form the kind …

It’s a Tuesday afternoon, and to be honest, I got too much WIP going on, too much mental bandwidth requirement.

I wish I had, and I hope you can get, the narrow bandwidth requirments you need for your best performance. I’m off to go look for mine. 🙂

Damnit. One more thing. I meant to discuss the couple of cases where I do have to use macro-scale tests, and I didn’t. Without further discussion for now: when I deal with genuine hardware indeterminancy, that’s when I’m most likely to have no choice but to rely on macrotests.

Does the GeePaw Blogcast add value?

If so, consider a monthly donation to help keep the content flowing. You can also subscribe for free to get weekly posts sent straight to your inbox. And to get more involved in the conversation, jump into the Camerata and start talking to other like-minded Change-Harvesters today.