Lessons from Building a Production-Grade System with GitHub Copilot Chat

AI-assisted development creates compounding leverage. Whether that leverage proves productive or destructive depends on whether it is applied deliberately or indiscriminately.

Abstract

AI-assisted coding tools promise dramatic productivity gains and are becoming increasingly common in software development. In practice, these tools excel at small, specific tasks—boilerplate generation, exploratory design, and surface-level implementation. However, their behavior under sustained use in long-lived, architecture-sensitive systems is still poorly documented.

This essay reports a firsthand account of building a large-scale system over several months with near-total reliance on AI for implementation. I made only minor manual edits (comments, small test adjustments, and mechanical code moves during refactors). I chose Rust—a programming language entirely unfamiliar to me at the outset—to develop the system. The experience exposed a consistent set of failure modes that weren't reliably resolved with improved prompting, stricter instructions, or increased user experience.

Key observations include:

Fundamental goal misalignment between the model and architectural integrity
Recurring patterns of work-avoidance masked as pragmatism
Excessive theorizing at the expense of productive debugging
Test suites full of trivial test cases rather than meaningful behavior validation
Severe context loss caused by conversational summarization

These issues required me to invest sustained effort to manage, and they present compounding risks when scaled across engineering teams. Without proper guardrails, they can undermine an organization's architectural objectives, code quality, maintainability, and technical depth.

This is not a critique of AI capability. Without AI, building this system would have been impossible with the given time and resources. The core message is that AI coding tools function less as assistants and more as powerful amplifiers of existing engineering culture. Without explicit constraints and close monitoring of the model's reasoning, they can destabilize entire systems as effectively as they accelerate feature delivery. This essay expands on observed failure modes and considerations for implementing a controlled adoption framework.

Project Context

This section provides additional context on the large-scale, production-grade system. Over the course of development, the project churned through over 1M lines of Rust across iterations, refactors, and rewrites. V1 completion resulted in:

1 binary
229 source files (181K LOC)
8 core modules
188 test files
3,075 tests (58K LOC)
389 fixture files (27K SQL LOC)

Methodology: How AI Was Used

My observations and assertions are drawn from sustained, practical use of AI coding tools during the development of a complex system, and focus on behavioral patterns that persisted across model transitions under real delivery pressure and repeated refactoring.

My goal was to build an application—motivated by a perceived gap in the market—in a language I did not know how to write, using AI as the primary development aid.

I did all development in VS Code using GitHub Copilot. AI assistance was provided through Copilot Chat and inline suggestions. Over the project, I used multiple models, with Claude models accounting for the majority of implementation work. GPT-5.2, GPT-5.1 Codex and GPT-5.1 Codex-Max were used periodically for targeted tasks and early refactors.

I primarily used Claude Sonnet 4.5 during early and mid-stage development. As the codebase grew in size and complexity, Sonnet became increasingly impractical for sustained use due to context limitations. I later relied on Claude Opus 4.5, which could work more effectively at the necessary scale but showed the same failure modes described in this essay. My experience with GPT-5.2 led me to exclude it from any core development due to its reliance on my input and its reluctance to perform non-trivial code transformations at the scale required by the project. At the same time, GPT-5.2 excelled in writing technical documentation throughout development.

I followed an architecture-first approach. The desired end-state—performance characteristics, safety guarantees, system boundaries, and invariants—was defined early and used to guide design. I used AI to propose architectural options and concrete designs, which were iteratively challenged, refined, and compared before beginning implementations. I evaluated, approved, and enforced all architectural decisions. When implementations drifted from established constraints, I interrupted the model and redirected it to the documented plan.

Interaction with the model was iterative and highly constrained. Instructions were explicit and often restated. As I noticed patterns of failure, I added additional constraints to instructions and workflow prompt files. I also created implementation cookbooks to guide behavior. I routinely asked the model to justify its decisions, audit its own output against stated constraints, and revise implementations that violated architectural intent.

I understood that I was asking a lot and pushing the limits of valid use cases for AI-assisted development. I also understood that models do their best work with extensive context. My process for adding new features and extensions was to ask the model to:

Gather context from context docs on the application architecture and key call stacks
Gather context from existing areas of the codebase that needed changes or extension to understand the patterns used and key functions
Write a planning document before changing code (this helped mitigate issues related to context compression/summarization)
Wait for my review, updates, and approval of the planning document prior to implementation
Implement the first section of the plan
Wait for my review, updates, and approval before beginning the next section
Rinse and repeat until the plan was complete

Even with this structure, the model would invent APIs, guess type definitions, and duplicate code that already had dedicated helpers, despite having gathered the proper context and documented a concrete plan to follow before generating code. When asked to audit its decisions, the model frequently explained the behavior as a fallback to generalized patterns from training, overriding the specific APIs and constraints it had just reviewed. Occasionally, it would surprise me and get it perfect on the first attempt, without any change in the process. Most of the time, however, it took multiple attempts to complete a single step in the implementation.

As the system evolved, I had to refactor continuously. Under supervision, AI generated code for all tasks in these cycles, including during large-scale rewrites. I had to embrace refactors as a necessary part of converging toward a coherent system.

Failure Mode Taxonomy

The failure modes described in this section appeared repeatedly over the course of development, across multiple models, architectural iterations, and refactoring cycles. Each is a distinct pattern of behavior I noticed under sustained AI-assisted development in a constraint-heavy system.

Goal Misalignment: Plausibility Over Architectural Integrity

AI models consistently optimized for producing locally plausible solutions—code that compiled, passed tests, and appeared reasonable in isolation. However, this approach often broke away from global architectural intent. Architectural invariants and system constraints were explicitly documented and kept in persistent context and workflow documents in the development environment. Even so, these constraints were not reliably preserved across interactions unless they were actively reasserted during reasoning or implementation.

As a result, I often had to interrupt the model mid-reasoning or while making code changes to avoid eroding architectural boundaries. This required continuously watching the screen during reasoning to watch for sub-optimal paths.

Notably, the same model that generated flawed code could accurately diagnose what was wrong when I asked it to self-audit its implementation. The corrective knowledge existed within the system, but it simply wasn't applied during generation. This suggests the failure is not in capability but in the code generation pipeline's access to that capability. The model could explain idiomatic Rust patterns, proper error handling, and architectural best practices clearly—immediately after producing code that violated all three.

Work Avoidance Disguised as Pragmatism

When faced with complex or ambiguous requirements, the model often opted for partial solutions framed as pragmatic compromises. These were often accompanied by language signaling deferral rather than resolution, such as temporary limitations, simplified assumptions, or scope reductions.

These suggestions consistently shifted complexity forward rather than dealing with it. In a long-lived system, such deferrals compounded, resulting in short-term progress with brittle solutions that created costly long-term structural debt.

Theorizing Without Verification

A recurring pattern I saw involved extended speculative reasoning instead of direct validation. The model would propose explanations for the application's behavior, outline potential causes, or suggest fixes without confirming through instrumentation, logging, or minimal reproduction.

I frequently had to interrupt reasoning to demand systematic validation. On multiple occasions, direct instructions to add debugging code were ignored in favor of continued cycles of speculation. This prolonged debugging cycles without clear progress.

In some cases, validation failed due to premature declarations of success. The model asserted that a test suite passed immediately after terminal output clearly showed a failure. This was not a matter of ambiguity; it occurred alongside clear evidence directly contradicting the model's conclusion.

This behavior became more frequent as session length and system complexity increased. As the model's internal representation of the system fragmented, it increasingly asserted correctness without checking the application's actual behavior. Incorrect conclusions and disengagement from desired end state required me to continuously interrupt and force attention back to what the system was actually doing.

Test Degeneration: Passing Over Proving

Tests generated or changed with AI often aimed at passing conditions rather than behavioral assurance. This manifested as trivial test cases, marking tests as ignored and "aspirational", overly permissive assertions, or adjustments that aligned tests with existing behavior rather than enhancing application logic to match intended behavior. In several cases, failing tests were "fixed" by weakening expectations rather than correcting underlying logic. Over time, this reduced the test suite's effectiveness as a signal of correctness, increasing reliance on my direct involvement in test case creation and architectural review. After implementing a new feature, the model would generate large test files (20-50 tests) which often had to be completely refactored with my direct input to ensure robustness.

Context Collapse Under Scale

As the codebase grew, the model increasingly struggled to keep correct contextual understanding. Large files, deep call chains, and cross-cutting concerns exceeded practical context limits, leading to incorrect assumptions about system structure, outdated references, or proposals that conflicted with recent changes.

In GitHub Copilot Chat, context compression/summarization is not optional and at times occurred within a handful of exchanges, which measurably increased constraint drop-off and incorrect assumptions.

Overconfidence in Mechanical Transformations

During refactors, the model often favored shortcuts—such as large bulk replacements—over systematic changes, without fully accounting for semantic nuance. These attempts often appeared syntactically correct but introduced breakage, which in some cases introduced breakage that caused corruption of the source code.

When these efforts failed, the model tended to propose increasingly aggressive corrective actions, like checking out files or entire repositories which wiped out otherwise considerable progress, rather than localized fixes. This amplified disruption during already unstable refactoring phases.

Indecision Under Mild Opposition

Strongly reasoned proposals were often reversed in response to my minimal counterarguments or reframing. The model showed sensitivity to conversational pressure, adjusting its stance without materially reassessing underlying assumptions.

In practice, this made architectural consistency dependent on sustained, explicit enforcement rather than convergence toward stable solutions. Without intervention, design decisions oscillated rather than settled.

Failure Modes Compound, Not Isolate

Importantly, these failure modes rarely occurred in isolation. Goal misalignment encouraged work avoidance; work avoidance increased refactor pressure; refactors amplified context loss; context loss increased speculative reasoning. Each reinforced the others.

Considerations for a Controlled Adoption Framework

Effective adoption depends less on model capability and more on the presence of explicit constraints that shape how and where AI is applied.

The constraints outlined below are not presented as best practices or universal rules. They are drawn from the conditions under which AI-assisted development remained effective in this project, and from the times that its behavior became destabilizing without them.

Define AI-allowed and AI-restricted zones
Preserve human ownership of architectural decisions
Constrain pull request size and scope
Require verification before speculation
Treat tests as sensors, not shields
Make context persistence an explicit responsibility
Accept temporary slowdown as a success signal

I am intentionally not prescriptive here. The list scaffolds a mental model to help organizations preserve architectural integrity under sustained change.

Implications for Engineering Leadership

AI adoption is not a tooling decision. It is a structural change to how software is produced. AI accelerates code generation, but it also alters review dynamics, shifts cognitive load, and redistributes architectural responsibility. Leadership decisions that treat AI as a neutral productivity upgrade risk overlooking these second-order effects.

Engineering leadership therefore carries a responsibility to define boundaries explicitly. This includes clarifying where AI-assisted coding is appropriate, where it is restricted, and how architectural authority is exercised. In the absence of such clarity, architectural decisions get lost in generated code rather than agreed upon through design, and drift becomes visible only when correction is costly.

KPIs need cautious evaluation. Traditional productivity metrics—throughput, cycle time, and output volume—become less informative under AI-assisted development. Increased velocity may coexist with declining system coherence. Leaders who continue to optimize for visible output risk incentivizing behaviors that amplify the failure modes described in this essay. Metrics that capture change cost, review depth, and architectural stability become increasingly important.

The velocity of AI-assisted coding is transformational. Responsible use also changes the cognitive load and mental models held by engineers. They will spend more time on review, enforcing constraints, and ensuring architectural invariants. The volume of code produced in a session far exceeds what can typically be produced without AI. Proper attention should be given to cultural constraints to limit change volume and allow sufficient review capacity.

Conclusion: AI as an Amplifier, Not a Partner

AI-assisted development offers real and measurable leverage. It accelerates implementation, reduces friction in exploration, and lowers the cost of iteration. The question is not whether AI can be useful, but under what conditions that usefulness stays sustainable.

AI does not replace engineering judgment. It can't own architectural intent, preserve invariants over time, or reason on long-term system health without continuous human intervention. Instead, AI amplifies whatever discipline is already present. In environments with strong constraints, explicit ownership, and rigorous verification, it can extend human capability. In their absence, it accelerates drift, obscures responsibility, and compounds risk.

High-integrity systems demand accountability that cannot be delegated. Architectural decisions, safety guarantees, and correctness properties must remain human-owned, not because AI is incapable, but because responsibility cannot be abstracted away. When that ownership weakens, failures are latent, and they surface as fragility, resistance to change, and loss of trust in the system.

AI-assisted development is therefore not a partnership. It is an amplifier. Used deliberately, it increases leverage without surrendering control. Used indiscriminately, it scales failure faster than traditional development ever could. The difference is not the model. It is the discipline surrounding its use.

Appendix A: "Kill Phrases" and Warning Signals

This appendix catalogs recurring phrases and behaviors that reliably preceded architectural degradation during AI-assisted development. While individually innocuous, their repeated appearance often signaled misalignment between the model's intended approach and global system integrity.

Examples include:

"For now"
Anything with the word "pragmatic"
"Known limitation"
"Good enough"
"We can refactor later"
"This should work"
"Given the complexity"
"This test is aspirational"
"Let me adjust the test to match reality"
"That's a separate issue"
"Unrelated to my changes"

The presence of these phrases was not inherently problematic. The risk appeared when they replaced explicit decisions, verification, or constraint enforcement.

Appendix B: Pull Request Checklist for AI-Assisted Code

This appendix provides a minimal checklist intended to slow review just enough to force understanding.

Example items:

What architectural decision does this change depend on?
What invariant does this code rely on?
What behavior is this test really validating?
Could this change be explained without referencing the implementation?
Was any constraint weakened to make this pass?

The checklist is intentionally short. Its purpose is not gatekeeping, but to surface implicit assumptions before they become embedded.

Appendix C: When to Terminate an AI Session

Certain interaction patterns were reliably followed by low-quality outcomes. This appendix names conditions under which continuing an AI session became counterproductive.

Indicators include:

Repeated speculative reasoning without verification
Oscillation between incompatible solutions
Broad mechanical changes proposed without semantic grounding
Context confusion that persists after restatement
Increasing suggestion of deferral or simplification

Session termination is framed not as failure, but as a control mechanism to prevent momentum from overwhelming judgment.

Appendix D: Open Questions for Model Builders

This appendix shifts responsibility outward without assigning blame. It provides unresolved questions exposed by sustained, constraint-heavy use of AI-assisted development.

Examples include:

How should architectural constraints persist across interactions?
Can models signal uncertainty or misalignment more explicitly?
What tooling support is needed to prevent context collapse at scale?
How can verification be incentivized over plausibility?

These questions are intentionally unanswered. They define the boundary between current capability and future research.