Why I built a hardware test automation framework
I collapsed the test plan, the test code and infrastructure into one framework, one file.

The Document Is the Test: An Architecture Decision, Made on Purpose
I'll start with the conclusion, because the reasoning matters more than the suspense: a few months ago I decided that the test document and the executable test should be a single artifact, and I built a framework around that constraint. This is the first in a series documenting that build in order - the decisions, the tradeoffs, and the ones I got wrong.
A word on where I'm standing. I've been developing software for over 15 years. For most of that time I've been a user of frameworks, not an author of them; from CMSs like drupal to python and javascript frameworks like Django & Nuxtjs. In addition, I have spent the past 6 years in the belly of the infrastructure on which large scale software run; on physical and cloud servers as well as on tiny SOC that power IOT devices. Now That distinction matters here, because it shaped everything. Years of living inside the tools of others - the good ones and the frustrating ones - left me with firm opinions about ergonomics and developer experience: what makes a tool feel obvious, and what makes capable engineers quietly route around it. I did not start this project discovering systems architecture. I started it with a fairly precise picture of the experience I wanted, and the harder problem of making that picture real.
The Problem, Stated Plainly
Testing connected-home hardware is a different category of work from testing a typical application. A single meaningful test for a smart lock might cut power to the device, wait for it to reboot and rejoin its wireless network, open a mobile app, navigate to a control, issue a lock command, and then confirm, from an independent vantage point, that the physical bolt moved and the cloud recorded the change.
That is not one assertion. It is a coordinated sequence spanning a physical device, a radio protocol, a mobile application, and a cloud backend, each with timing you cannot fully control.
Much of this was being done by hand. The problem I focused on was not the manual effort itself, it was that the effort produced nothing durable. Each run began from zero. The proof of a passing test lived in screenshots and in people's recollection. A defect that surfaced once and then disappeared was frequently impossible to reproduce on command. In a hardware program, that is the costly issue hiding beneath the visible one: the testing knowledge does not accumulate. A result you cannot reproduce is a result you must produce again, manually, indefinitely.
The Structural Choice
In most teams, two artifacts are meant to describe the same reality and never quite agree: the test plan, written for humans, and the test code, which actually runs. They diverge the moment they exist. The plan goes stale; the code becomes opaque to everyone except its author.
My decision was to remove the gap by removing the duplication. One artifact instead of two. You write a readable document - objective, preconditions, procedure, expected results, and that exact document is what executes. There is no translation layer and no "the script approximates the plan." The document is the source of truth, the machine follows it literally, and it records its own evidence back into that same structure.
A test case is just a Markdown file. It opens by declaring what it needs, then lays out the procedure as steps a human reads, with the executable part inline:
id: step-3
type: api
method: POST
url: /devices
expected: Device is created and its ID is captured
validate:
status: 201
outputs:
DEVICE_ID: id
There's no second file. That block is both the line a reviewer reads and the request the runner fires. outputs captures the new ID into DEVICE_ID; the next step uses ${DEVICE_ID}. The document carries its own state forward.
What makes hardware testing different is that one test crosses worlds, and some steps simply can't be automated. The document says so, in the same grammar:
id: step-5
type: manual
action: Power-cycle the device to begin pairing.
expected: Device restarts and enters pairing mode.
capture:
- operator_confirmation
A manual step sits beside an automated one, same shape, same record. The framework doesn't pretend the physical world is fully scriptable; it makes the human stitch a first-class, logged part of the run. That honesty is the point.
If this holds, three problems resolve together. Documentation cannot drift stale, because a document that no longer reflects reality simply fails to run. A new engineer can read a test and understand it, because it was written to be read first. And every execution produces a structured, reproducible record - the same procedure, the same evidence, with or without a person watching.
Why I Call This Architecture, Not Tooling
This is the distinction I most want to be precise about, because sixteen years taught it to me slowly.
When engineers meet a hard problem, the common reflex is to reach for a tool: a better runner, another library, a cleverer script. Tools are horizontal. They help with the task in front of you.
But what I had was not a task. It was a recurring structure across many tasks - knowledge that fails to accumulate. No individual tool corrects a structure like that. You address it by deciding where the source of truth lives, and then requiring everything else to serve that decision.
Choosing that the document is the single source of truth and that code exists to serve the document rather than the reverse, is an architectural decision. It is a constraint I placed on myself before writing the parts I actually find fun. As with most useful constraints, it made some things harder and made an entire class of failures impossible. It meant I could not take shortcuts that would let document and code drift apart. And it meant building a validator that refuses to run a malformed document at all:
$ spec validate ./tests # zero errors before it's allowed near a device
Wrong shape, no run. The check happens before a single device is touched; the framework argues with you while it's cheap, not at 2 a.m. mid-run. The unglamorous layer (parse, validate, structure) had to be genuinely solid before the interesting layer (driving real hardware) earned the right to exist.
I committed the first version, then immediately committed a refactor, because the first structure was wrong and I would rather absorb that cost on day one than on day one hundred. That sequence is the working philosophy in miniature: commit to a decision, then stay willing to correct its shape quickly.
What This Series Will and Won't Be
Twice a week, I'll document this in order, and I'll keep it honest. I'll cover how a plain Markdown document became something a machine executes against real devices. I'll cover the day I deleted code I was pleased with, because it was misreporting whether tests had actually finished. I'll cover the transition from a tool I ran by hand to a service that runs unattended for days and reports, accurately, whether it is healthy.
I'll write some of it for engineers who have never understood why anyone builds a framework rather than shipping features; I'll make that case without hand-waving. I'll write some of it for senior engineers who have watched internal frameworks decay into unmaintainable swamps, and are correct to be wary - I'll show the specific decisions intended to prevent that here, and you're welcome to challenge them.
I'm calling this authoritative and experimental at the same time, and I mean both. The fundamentals are not new to me. This particular build is, and I'm choosing to think it through in public.
Next, I'll get specific about the most load-bearing decision in the project: turning a documentation file into a living, executing test, and what that demanded of the design.
If you disagree with the premise or if you think a single artifact is the wrong call, that's exactly the conversation I want in the comments.
— end of dispatch —
More writing →