Upstream Uptime #2: The Problem In Practice

Part [part not set] of 4 in the series Upstream Uptime

We started talking about the new upstream-centric style or architecture the other day.

I wanted that to be a first swing at describing the problem, but I didn’t really get there. Let’s do that now.

The problems I see in many orgs attempting this approach, service-orientation or microservices, etc., are most visible when we come not to theory but to practice.

One sentence: Old-school practice works in old-school apps because it holds a large part of the app fixed in place while we change a small part, and that doesn’t work in upstream-centric apps because those upstreams are — by intent — not fixed in place.

Upstream-centric architectures are giving us two benefits. One of them is application scalability, the promise of the cloud. The other one is organizational scalability, the promise of distributed parallel development by separately staffed, funded, and managed teams. And it’s that second benefit that is essentially telling us "you can’t hold a large portion of it fixed in place because we want to change all of it all of the time".

Three years ago I worked on a team that had seven upstreams. Some of those upstreams, in turn, had their own upstreams. In toto, the number of separate apps was just under twenty. Several of them were quite large, some were really microservices.
These are the problems we faced, and more than one of us faced more than one of these issues every single day. I am not exaggerating for effect here. Multiple developers 100% dead-blocked multiple times every single day. Dead. Unable to do anything in the code.

First, and this will surprise many theorists, there were simple network outages. Hardware, I’m talking. They were usually short, and they didn’t happen at any consistent connection, but naturally they caused a lot of confusion, especially given the other problems I’ll describe.

Second, upstreams would go down for builds at random intervals and for random periods of time, because the upstream developers needed to do that.

Third, SSO roles & permissions were all timed out on a per-app basis dated from your first access. Upstreams couldn’t tell us that, because they can’t talk to us, because we don’t have permission to be spoken to.

Fourth, all of the upstream-instances and their backing datasets were shared. This made it very difficult and sometimes impossible for a developer to tell whether the code she just ran did exactly and only what she wanted it to.

Fifth, transport-layer errors were the default means of indicating all problems. (HTTP result codes, mostly.) So if you went to lookup an X that didn’t exist, you’d get, say, a 400, Bad Request. And they were overloaded, and they gave no human-readable error description.

Sixth, roughly half of these upstreams were in current full-bore development, so calls that worked a minute ago stop working, for no detectable reason.

Seventh, nearly all of these systems used different transport formats for different kinds of output. HTML — seriously, HTML — and/or JSON and/or XML and/or one of these wrapped up inside one of the others.

Eighth, the logs for the upstreams were a) accessible only through a browser, b) using a separate set of roles that would timeout, c) on a new unbookmarkable link every day, and d) full of hundreds of thousands of lines of enterprise-standard useless logging.

Ninth, the collaboration framework was "I know a gal over there, let me see if she’s around", alternating with backlogged ticket systems that took an hour to fill out.

Okay, I could probably think of more, but they all have exactly the same flavor: the upstream and downstream ecology was entirely focused on an as-yet-unachieved endpoint, the Made, and entirely blind to the extent to which that focus destroyed productivity, the Making.

And this is the key takeaway: Every problem I described to you is entirely a self-inflicted wound. Not one of them had to be that way. All of them came from chosen practice. All of them came from policy decisions.

And ultimately, all of them came from the imbalance of focusing over-much on the made and under-much on the making. It was a kind of mass corporate delusion, that the central act of software development is something other than changing code as needed.

This whole thread seems really negative. I want to be clear here, I am not hating on that org. The problems I saw there I see at many orgs that are first undertaking upstream-centric architectures. It could happen to anyone, and it often does.

So. Advice time. I’m going to telegraph these for now and call it a day. But I’m happy to keep going with details, on some of them in particular, in coming days.

Adopt TDD and CI, because they are our current best understanding of how to work with change instead of against it.
Use automated content-level versioning every time, on every request and every response.
Use content-level diagnostics, including both numerics and human-readable text, on every request and every response.
Make every service team ship a local-runnable version of their service. There is really no excuse for shared instances for developers. That is a gigantic instance of being penny-wise and pound foolish.
Use a local-box reverse-proxy that can display every kind of traffic from every upstream in one location. If you can’t buy that, build it. It’s not nearly as hard as it sounds.
Support developer roles in your permission schemes at the enterprise level, and build those schemes to be informative rather than stonewalled, including contact points or links or whatever is needed to make it straightforward to renew.

Finally, and broadly: worry less about what the city on the hill is going to look like, and more about how we are going to get there at all, every day, by changing code collaboratively.

As I say, in coming days we’ll tackle some of these in much greater detail. If you have questions in the meantime, have at it, and I’ll do my best.

It’s Thursday, I swore I was only going to require myself to do one thing today and then I’d be a free man in paris, and I have done three things. So, basically, I totally won.

I hope you totally win today, too.

Upstream Uptime #2: The Problem In Practice

Related Posts