No. 06 · Technical
The Well-Built Harness
How the public service can build and ship enterprise-grade software at speed, without lowering the bar.
Abstract. Coding tools like Cursor, Lovable, and Claude Code produce applications which often lack the controls a public service requires. AI agents are often unreliable witnesses of their own performance. Clear instructions, tests, and templates help to overcome these challenges. Alberta has developed and released a collection of files called a 'harness' which clearly defines your expectations for the AI, driving enterprise quality outputs at the speed of the 'vibe coded' application.
Large language models (LLM) have become highly proficient at developing application code. Early in our AI Maximalist program in 2025, Technology and Innovation staff began to evaluate these tools to see how they could be deployed within our government environment. A range of agentic coding tools were explored, such as Cursor, Lovable, Gemini CLI, and Claude Code. These tools made application development faster for skilled developers and easy for even non-technical staff. The time to transform an idea into a functional prototype dropped from roughly 6 weeks (as benchmarked by management in that unit) to as little as 30 minutes. The speed increases have changed how teams work. Now, subject matter experts can participate in co-design as these ideas are worked out, providing real-time feedback and sign-off. This factor alone has resulted in a significant increase in the speed of delivery, reducing or eliminating the time-consuming sharing of emails, screenshots, and designs for manual review. Often, such work can now be completed in as little as a single session. This experimentation also identified some large gaps when these agents were tasked to produce highly secure, repeatable, and compliant applications. Several challenges needed to be addressed before these tools were usable in day-to-day government delivery. ## §01 Aligning with AI agents Despite the gains in performance, several significant challenges in AI coding practices were observed by Alberta's AI Maximalists. First, it was difficult to recreate the same application twice. Given its probabilistic nature, each interaction with AI resulted in it generating applications with a different look and feel, layout, and functionality. The inability to replicate the same application twice offers a significant barrier to deploying AI in a government environment. Further, it eroded trust from Ministry partners and delayed rework and the sharing of tasks between teams of people whose AIs would make different design or technical decisions. Second, these tools prioritize user experience while frequently failing to implement the necessary security protocols. While prototypes can be achieved quickly, it can often take significant time and effort to fix the gaps in the security model (even with AI assistance). And like at the user interfaces, the fixes applied to the security, accessibility, and mobile code were often also inconsistent and difficult to replicate between any two applications. Months of time saved on the frontend can easily be lost on the backend implementing the necessary security controls to protect the cyber posture of these applications. The third significant challenge was the over-confidence of the coding agent. AI agents routinely claim completeness in their code, but when probed demonstrate significant gaps. AI agents are unreliable witnesses to the state and performance of their code creations, and as a result, non-technical staff were unable to accurately evaluate application readiness. Trusting the agent's output without independent evaluation resulted in erroneous outputs which failed downstream security audits. Understanding these limitations, the team began to address these challenges through trial and error, experimentation, and smart play. ## §02 Experimentation and play Since starting the AI Maximalist program in March 2025 to today in May 2026, the Department has produced far more than 1,000 applications. The vast majority are prototypes, but dozens have made it into production, offering real value. The prototypes themselves follow a pattern of exploration, experimentation, and even play. Without a sense of play, it is impossible for staff to learn the boundaries of these models, and to learn how to work effectively. It may be odd to say 'play' within this context, especially in a government environment, but that is how we humans learn. We learn best when we are in a psychologically safe environment, when we are challenged to push our boundaries, and when we are having fun. In those 1,000+ applications we've built, we've explored the limits of AI, and found its real gaps. Each insight allowed us to capture and refine our own processes, manually closing and overcoming those gaps with better instructions, better scripts, and better tests. We see where AI breaks down, and we find innovative ways to use its immense capability while also recognizing its real, and often glaring, shortcomings. Over this period of exploration and play, we've created a recipe book on how to work well. Significantly, the variability between people was also revealed to be a major factor. Two people can have a very different experience with the same model on the same task. How you introduce an idea, explain context, and set direction can drive extremely different outcomes. And in an environment where repeatability and quality are paramount, these small differences can drive unacceptable divergence between individuals in their use of AI. The genius of the AI needed a set of rules within which to work effectively. ## §03 Introducing the AI 'harness' To drive repeatable outcomes, we need to follow a set of patterns and styles that increase the probability that the model will perform the correct series of tasks. For a large language model, it should be no surprise that we codify these rules through words. Effective AI instructions, which are called prompts, explain our intent with a high degree of specificity. When we want the AI to be creative, we provide outcomes-based prompts and let it explore. When we want to be prescriptive, we provide step-by-step rules-based prompts. Through iteration, Alberta has assembled thousands of these prompts into a collection of so-called AI 'skills', instructions which describe the methods and intended outcomes we are achieving. These skills are highly 'opinionated', a technical term meaning that we are leaving little room for imagination for our AI coders. They also try to be as minimalist as possible, avoiding prose to preserve the AI agent's 'context window' attention space. Finding that middle ground is truly an art mastered through practice. Through rounds of trial and error, Alberta staff developed steps to increase the assurance in the quality and consistency of the code output. Starting with small application features (like a landing page or API), the team developed instructions for the LLM to follow in its prompt, which would narrow the creativity of the LLM by limiting uncertainty. These standards, now known as 'skill files', are provided to the coding agent at the beginning of a task. A well-written specification reduced uncertainty and resulted in a consistent output. These instructions are bundled together into a collection known as a 'harness'. ## §04 Building harnesses Building harnesses is difficult and takes extreme diligence. You're curating the skills that a human might gain intuitively over 10-20 years. You cannot vibe code a harness. Using AI to create harnesses has the 'photocopier' problem, in that each derivative version degrades the quality, causing future and unpredictable breakdowns. You need to take the time and meditate on every feature, and test (and test again) each iteration to ensure that the system is faithfully building what you need. Slop must be avoided at all costs. You must show the AI what good looks like. In the technology space, this is by providing as near a perfect application as humanly possible as a template for it to emulate. This is similar to how arts academies throughout history learned to paint, by studying the great masters of the Renaissance, and copying and emulating until they achieve as close to perfection as possible. These 'masterpiece' templates mean everything to an AI. "Riffing off Arthur C. Clarke's Third Law, which states any sufficiently advanced technology is indistinguishable from magic, I would say that any sufficiently advanced template is indistinguishable from code. The best harness is code." · Janak Alford, Deputy Minister, Ministry of Technology and Innovation A harness is a small number of parts that work together, and the first is a file named CLAUDE.md. The agent reads it at the start of every session and follows it as the project's rules. It is short and plain: what the project is, who it serves, the standards it holds to, what the agent may change, and what it must never touch. It is the one place the project's settled decisions are written down, so the agent, and any person who joins, starts from the same page. The skills sit in a folder named .claude/skills. Each skill is a small playbook for one kind of work: how to structure requirements, how to write a database migration, how to run a security audit, how to review a stretch of prose. The agent does not load them all at once. At the start it reads only each skill's name and one-line description, enough to know when each one applies, and it pulls the full playbook only when the task in front of it matches. The skills come in when they are needed, the way a tradesperson reaches into a roll for the one tool the job calls for. The template is the masterpiece the harness is built around, and it is deliberately opinionated about the technology. The stack is fixed and open source: Vue.js with Vite on the front end, Node.js with Express on the back end, Postgres 18 for the data, and single sign-on for identity. It ships as a working pair, a client and a server, with the hard technical problems already solved and running from day one. More than a hundred decisions a public system has to get right, how a user signs in, how input is validated, how secrets are held, how the application logs and defends itself and meets the accessibility standard, are made once, here, before any project begins. Skills, standards, templates, and crystal-clear requirements are what drive the outcome. The requirements say what to build. The skills say how, leading the work through each stage in turn, from requirements to planning, architecture, development, testing, and deployment, with each stage refusing to proceed until the one before it is real. The standards say what good looks like. And the template carries all of it already wired together, so wherever the agent looks it reads the same set of preferences. This is also what lets a non-technical Builder succeed: they add the content their service needs on top of a foundation an auditor has already cleared, and the only thing that changes between projects is the part that genuinely should. Riding alongside the build skills are four review agents, named for colours, that check every application for quality, consistency, and security as it is built. Green covers code and supply-chain hygiene, Yellow the writing, Red adversarial testing from the outside, and Blue the defensive review. They are bundled into the harness, and they are the subject of the next paper, Red, Blue, Green, and Yellow Agents. All of this runs on Claude. The long-horizon reasoning on a project, the planning, the architecture, the threat modelling, uses Claude Opus, which can hold a whole delivery loop in a single context. The many smaller jobs that run in parallel, the deterministic scans and the reviews, run on Claude Sonnet through the Claude Agent SDK, the toolkit for driving a model through many steps with a fixed set of tools. However, the skills and standards are model agnostic and can be run with any leading frontier or open-source AI model. The intellectual property of the harness is what differentiates the outcome more than the specific model. Any individual, team, or organization can codify and share these skills, uplifting the common capacity of all team members. Suddenly everyone has the best coding skills, requirement gathering skills, project planning skills, and security skills. These are like tools in your agentic toolbelt. You still need to learn how to use them. ## §05 Applying the harness at scale Alberta runs a program to invest in our public servants called the Alberta AI Academy. This program will be covered in depth in a future paper but suffice to say thousands of our public servants and more than ten thousand public participants have used the AlbertaAIAcademy.com platform to learn the essentials of effective AI use, from prompting through product delivery. Staff participate in 'cohorts', groups ranging from sixty to six hundred who are learning over three levels. The most recent Level 3 cohort ran over six days where students took a harness-based approach to delivery. Participants came from a range of backgrounds and levels of seniority; three of the department's five Assistant Deputy Ministers sat in the room as students, alongside executive directors, directors, managers, staff, and interns. Few would have called themselves software engineers when the week began. Yet all were willing to be builders, working alongside AI with a harness-driven approach. Over those six days the cohort produced more than **five hundred and sixty working applications**. To graduate Level 3, every student's application cleared the cyber review, the accessibility check, and the look-and-feel standard at first release. Level 3 cohort outcome 560 / 100. More than five hundred and sixty applications produced by one hundred students in six days. On the final day of the class, members of the Cyber team came online with their own toolkit and scanned every submission. The top submissions cleared the bar with a clean bill of health: zero known cyber vulnerabilities or accessibility violations and full compliance with the government standard. Many of the best methods Cyber team now lived natively within the AI harnesses that the students were using, so cybersecurity was a first-class concern for the agent. At every commit and milestone, can run these agents to perform more than ninety-five checks. The apps ship with their own clean bill of health and the receipts to prove it. By the time cybersecurity has to sign off on the app, all the security issues are mitigated, resulting in apps that are highly consistent, easy to understand, and fully cyber compliant out of the gate. "We brought the cyber team into the tent. We asked them to build the toughest review agent they could imagine. Then we baked it in to the harness." · Janak Alford, Deputy Minister, Ministry of Technology and Innovation ## §06 Lessons learned Reflecting on the last 16 months of learning within the AI Maximalist programs, there are some key insights which are worth surfacing: Give yourself time to experiment. These methods are neither obvious nor easy, and providing staff time to learn, create, and refine these methods is a worthwhile investment.
Tags: velocity, harness, claude-code, agents, skills, hooks, cyber, open-source