The dependency
you cannot audit.
On why a withdrawn frontier model is a control finding for the regulated buyer, not a news item, and the architecture that confines the dependency to where it belongs.
Frontier intelligence in regulated work was never about cutting headcount. It was about the cost of being wrong. A recent reminder of how revocable that intelligence is should change how serious buyers architect for it.
A frontier model that many teams had been waiting for became available, and then, a few days later, became unavailable. Not deprecated, not repriced. Withdrawn, by a force outside the contract between the buyer and the provider. I am not going to argue the merits of the action that did it. The merits are not the point. The point is what the episode exposed for one specific kind of buyer, the regulated enterprise, and why that buyer should treat it as a control finding rather than a news item.
The assumption is wrong.
The prevailing story about enterprise AI is that big companies are deploying agents to remove people and save money. For a lot of low-stakes automation, that story is accurate enough. For the deployments that actually matter in a regulated environment, it is wrong, and the error matters because it leads buyers to evaluate the wrong thing.
In a regulated context, the reason you reach for the most capable model is not the salary of the person it might replace. It is the cost of a single bad decision. When an agent is managing a process where a minor error does not cost money but causes a blowup, a compliance breach, a flawed disclosure, a corrupted record with legal consequences downstream, the calculation is not labor arbitrage. It is tail risk. The expensive token is not replacing a headcount. It is buying down the probability of a catastrophic, asymmetric error. That is the legitimate case for frontier intelligence in the enterprise, and almost no one states it, because the entire public conversation is stuck on jobs.
Once you see it that way, the ROI question changes shape. Frontier intelligence is not overpriced labor. It is insurance against a class of error that the cheaper model could not reliably avoid. Whether it pencils out depends entirely on the consequence of being wrong, not on the wage of the task.
Who actually got hurt.
This is why the withdrawal landed hardest on a counterintuitive group. The teams set back worst were not the ones chasing efficiency. Those projects can wait a quarter. The teams set back worst were the ones whose agents had been undeployable, because no prior model was reliable enough to be trusted with a process that could blow up on a single mistake.
For those teams, capability is not a nice-to-have. It is the deployability threshold. They had been evaluating, carefully, whether a model had finally crossed the line where a high-consequence agent could be put into production and defended to an auditor. Some of them had just concluded that it had. And then the model was gone, mid-evaluation, for exactly the deployments where reliability was the whole reason to be there.
That is not a delayed cost-savings initiative. It is a genuine capability threshold reached and then revoked, in the one category where the threshold was the entire point. To anyone who assumes enterprise AI is about removing humans, this loss is invisible. To the buyer who was finally going to deploy something real and defensible, it is the most significant thing that has happened in the space.
Revocability is a control failure, not a risk.
Here is the part a regulated buyer cannot file under inconvenience.
Large enterprises were already struggling to make frontier ROI defensible. The math is hard when premium intelligence is burned across an entire workflow, most of which never needed it. That was the existing pressure. The withdrawal added a second one on top, and for a regulated buyer the second one is more serious than the first.
It is one thing to run a process that is expensive. It is another to run a process whose core dependency can be removed by a party that is not on your contract and does not answer to your auditor. In a regulated environment you have to be able to stand in front of a regulator and account for the continuity of a control. A workflow whose intelligence layer can disappear because of a decision made three parties removed is not a workflow you can account for. It is a control with an undocumented single point of failure, and the failure is not in your hands. That is not a risk to be priced. It is a finding to be remediated.
Not a risk to be priced.
A finding to be remediated.
The two pressures are really one problem stated twice. The enterprise has a dependency on intelligence it does not control and frequently cannot justify, and both halves of that sentence are now liabilities.
The architecture that resolves both.
The resolution is not to abandon frontier intelligence, and it is not to pretend a cheaper model is always enough. For the genuinely high-consequence decision, the one where an error blows up, you sometimes do need the most capable reasoning available, and pretending otherwise is how you end up with the breach you were trying to avoid.
The resolution is to stop spending frontier intelligence indiscriminately, and to engineer the system so that the dependency is quarantined to the few decisions that actually justify it.
Most of any real workflow is moderate-complexity work that a smaller, controllable model handles well when it is given the right inputs. The reason teams reach for the biggest model everywhere is usually not that the work demands it. It is that the context layer is weak, so the model is asked to compensate for missing ground truth with raw capability. Build the context layer properly, ground the model in what is actually true, constrain it to exactly what it needs and can see, and the bulk of the workflow becomes provider-independent and cheap. The model-choice question, for that bulk, mostly dissolves into a context-engineering question. That is the work that matters now, and it is harder and less glamorous than picking a model, which is precisely why most teams have not done it.
Then the narrow set of decisions where an error is catastrophic is where the frontier earns its place, its price, and its dependency. You make those calls deliberately. You know exactly which ones they are. And critically, you can now show an auditor the map: here is the part of the system that runs on inference we control, here is the small set of high-consequence decisions that ride on external frontier intelligence, and here is why each one is worth the dependency. The deployment becomes defensible not because nothing can be withdrawn, but because you know precisely what is exposed and you have confined it to where the tail risk justifies the exposure.
What this means for serious buyers.
The regulated enterprises I talk to have preferred inference they control for a while, and they have always said why. They will not accept lock-in for anything that matters. I used to read that as conservatism. I read it now as a posture that priced a risk the rest of the market was discounting.
The lesson is not to fear the frontier. The frontier is real, and for the high-consequence decision it is sometimes the only responsible choice. The lesson is that a serious deployment has to know the difference between the intelligence it owns and the intelligence it borrows, has to keep the borrowed part small and deliberate, and has to be able to account for every place the borrowed part appears. The cheaper, controllable floor is what makes the ROI defensible. The disciplined frontier edge is what makes the hard decisions safe. And the map between them is what makes the whole thing auditable.
The model that disappeared will likely return. The dependency it exposed will not disappear with it. The buyers who treat this as a prompt to redraw their architecture, rather than as a story to wait out, will be the ones whose most important agents are still running the next time something is withdrawn.
The generation was never the hard part.
On why cheap intelligence makes judgment the entire game, and what that means for the people who build with it.
A year ago I was learning AI tools. Today I am delivering applied AI projects, full pipelines of them, across more contexts than I would have committed to before. Soon the distinction stops being worth making, because intelligence will be as embedded in how work gets done as electricity is in how a building runs. Nobody calls themselves an “electricity company.” We are heading to the same place with AI, faster than most people are pricing in.
That sounds like a reason to relax. It is the opposite. When the term stops meaning anything, the thing it pointed at becomes the whole game, and the people who treated it as a passing tool get left behind by the people who built it into their judgment.
Generation got cheap. Judgment did not.
Here is the shift that matters, and most companies are looking straight past it. Producing output is now nearly free. A model will write the code, draft the document, summarize the meeting, propose the architecture, generate the plan. The marginal cost of a first draft of almost anything has collapsed toward zero.
What has not gotten cheap is making it right.
Cheap to make, expensive to make right. That is the entire economy of the next decade in six words. The generation is not the issue. The issue is the discernment to know which output is good and which is plausible garbage, the sequencing to know which questions to ask and in what order, the taste to know which tool fits which problem, and the discipline to build the patterns that produce reliable results instead of impressive one-offs.
That is judgment. Judgment does not come out of a model. It comes out of people who have built enough things to know the difference between something that looks done and something that is done.
This is the lesson from running a dark factory of autonomous coding agents for the last year. The agents produce work at a volume I could not have produced myself. The volume is not the win. The win is the eval architecture I built around them: the typecheck gates, the parity verification, the screenshot evidence, the intent queue with explicit acceptance criteria. Those gates encode my judgment. The agents are the input. The gates are the asset.
Cheap to make.
Expensive to make right.
The asset is the people, not the output.
It is tempting, watching a model produce a thousand lines in seconds, to think the human is now the bottleneck. That framing is exactly backwards. The human is now the only part that is scarce.
Anyone can generate. Almost no one can reliably tell good from bad at speed, across domains, under real constraints, with a customer's outcome on the line. The refinement and discernment of model output. Knowing which questions matter. Knowing which tool to reach for and when. Building the patterns that turn one person's hard-won capability into a substrate the next problem can run on top of. Constructing the factories and the eval layers that turn intelligence into outcomes that hold up.
That is the work. That is the asset. The model is a commodity. The harness is the moat. The judgment that builds the harness is the part no competitor can copy by signing the same vendor contract you did.
Human judgment versus agent slop.
There is a fork in the road, and it is not subtle. One path uses intelligence capably, efficiently, honestly, and effectively, with a human accountable for the result. The other path points a model at a problem, takes the first thing it produces, and ships it.
The second path produces agent slop. Output that is fluent, confident, voluminous, and wrong in ways that take longer to find than they would have taken to do right the first time. I see the receipts in the field. Assessments that misclassify a customer's actual posture and get torn up the first time a buyer with real domain knowledge reads them. Policy drafts that look authoritative and would fail their first regulatory review. Reports that paper over data quality gaps the customer would have flagged themselves if anyone had asked.
It scales beautifully
right up until it destroys trust.
The difference between those two paths is not the model. Both paths can use the same model. The difference is entirely human. It is whether someone with judgment is in the loop, owning the outcome, applying discernment, and refusing to confuse generation with delivery.
What this actually demands.
Grounding myself in this reality is the whole strategy. Not chasing the newest model. Not collecting tools. Building the judgment, the patterns, and the skills that turn cheap intelligence into expensive-to-replicate outcomes.
I build with intelligence. I run intent queues with intelligence. I find opportunity, evaluate process, and translate decisions to artifacts with intelligence. I draft, I write, I architect, all with intelligence in the loop. From here on out it is in everything. The question is no longer whether to use it. The question is whether I use it with the judgment that separates an outcome a customer will pay for from a draft a customer could have generated themselves.
That is the only distinction left that matters. It is worth getting right.
Ten years inside the majors.
What AI changes.
On what enterprise AI deployment actually requires inside companies whose assets are measured in acres and barrels, not sprint velocity.
Most writing about enterprise AI is written from the outside. Analysts who visit. Consultants who present. Journalists who quote the press release. The result is a literature full of frameworks and almost no specifics.
What I know about enterprise AI in energy is from the inside. I spent ten years building systems for companies whose assets are measured in acres and barrels and regulatory filings, not sprint velocity. That background changes what I see when I look at where AI is going.
What the environment actually is.
Three companies. Three operating environments that look similar from the outside and are completely different inside.
Chevron was the scale problem. A hundred thousand users. Three continents. A SharePoint architecture and platform program that had to work for a field technician in Kazakhstan and a project manager in San Ramon and a legal team in Houston, on infrastructure that had to be secure, auditable, and maintainable for years after the initial build. That is not a software product problem. It is a governance and systems problem that happens to involve software.
Marathon was the integration problem. Midstream operations. Systems that predate the web, still running, still producing data, and still subject to regulatory retention requirements. The challenge was not building something new. The challenge was building something new that did not break the thing that had been running since 1987.
Southwestern Energy was the transformation problem. A Fortune 500 company in transition, leadership that understood the need to modernize, and an operating model that had grown up around a set of assumptions that no longer matched the environment. The work was not a technology project. It was an operating-model redesign that technology enabled.
Three different problems. The same underlying pattern: systems that hold the company together, that carry years of institutional knowledge, that cannot simply be replaced, and that the AI era is going to put under pressure.
What AI changes, and what it doesn't.
The easy answer is that AI changes everything. The honest answer is more specific.
AI changes the retrieval problem. The single biggest cost in enterprise knowledge systems is finding the right information at the right moment. A field engineer troubleshooting a compressor failure does not have time to navigate a document hierarchy. A project manager pricing a change order needs to know what was agreed in a contract clause from 2018. A compliance team responding to a regulatory inquiry needs to trace a decision back through five years of meeting notes and email. These are retrieval problems, and AI, applied correctly, closes the gap between what a system contains and what a person can actually find.
AI does not change the governance problem. It makes it harder. When a language model surfaces an answer, the question is: where did that come from, is it current, is it the version that applies to this context, and who is accountable if it is wrong? Those questions existed before AI. They become more acute when the answer arrives in fluent prose with no citation.
AI does not change the integration problem. A system that could not talk to the historian in 2019 cannot talk to the historian just because it is now built on a language model. The underlying data topology is the same. The new tools make it easier to process the data once you have it. They do not make it easier to get the data.
AI changes the speed-of-decision problem. The time between an event and an informed decision compresses dramatically when the right information is surfaced automatically. That matters enormously in operations, where the cost of a slow decision is measured in downtime and in safety exposure.
What the operator-architect does.
A framework tells you that you need a retrieval layer. It does not tell you that the documents your retrieval layer needs to index are in seven different systems, three of which are on-premises and two of which require a security review before any data can leave the building. It does not tell you that the team responsible for one of those systems has been promised for three years that they will be migrated to the new platform and is no longer willing to cooperate with any new initiative until that promise is kept. It does not tell you that the metadata schema the new system assumes does not match the metadata schema the old system actually uses, and that reconciling them requires a taxonomy project that nobody budgeted for.
The frameworks are not wrong.
They are incomplete.
That is what the architecture looks like from the inside. The frameworks describe the logical structure of the solution. They do not describe the friction that exists between the logical structure and the physical environment it has to operate in.
The operator-architect's job is to carry both maps simultaneously and navigate the distance between them.
What this means for the next five years.
The companies that will extract durable value from AI in regulated and industrial settings are not the ones that deploy the best models. They are the ones that solve the integration, governance, and operating-model problems that determine whether the model has access to the right data, produces answers that can be traced and verified, and fits into a workflow that the people who have to use it will actually use.
That work is not glamorous. It does not make for a good press release.
It is the difference between a demo that works
and a system that runs.
The architects who understand that difference are not common. They are not produced by studying AI. They are produced by building systems inside environments that punish shortcuts.
The next ten years of enterprise AI deployment will be determined by whether the people building the systems understand the environments they are deploying into. Most of the AI literature is written by people who have not spent time inside a producing energy asset, a regulated financial institution, or a healthcare system with fifty years of operational history.
I have. That is not a credential. It is a constraint on what I am willing to say, and what I am not willing to pretend I know.
The systems I built at Chevron and Marathon and Southwestern Energy are still running. The decisions I made in those engagements show up in the architecture that replaced them. That is what it means to build for longevity inside an enterprise environment.
AI changes what those systems can do. It does not change what it takes to build them right.
Alignment is a
human problem.
On why alignment lands at the user, not in the lab, and what a harness around the model has to actually hold to close the gap.
The industry keeps trying to solve alignment in the lab. I think that's why it isn't working.
The dominant framing assumes that if we tune the model correctly, the outputs will be reliably aligned. Train it on the right data, run the right RLHF, build the right constitution, and the system will produce safe, useful, value-aligned responses. The work happens at the model level. The user, in this framing, is downstream. They receive the output. The alignment is something done before they show up.
This is backwards, or at least incomplete. The model can be perfectly tuned and still produce subtly miscalibrated outputs for a specific user, because the alignment that matters isn't general. It's specific to the person in the chair. And the specificity can't be solved by the lab alone. It requires shared context between the agent and the user that the platforms aren't currently building at sufficient depth.
I call this the common grounding problem. Without enough shared context, even good outputs feel slightly off to the user. The competence is real. The miscalibration is also real. They coexist. And the user, especially one with a trained eye, develops a subtle doubt that prevents reliable delegation. They keep checking the agent's work because the agent doesn't quite seem to know them. So they can't really hand things over. The delegation breaks at the trust layer, not the capability layer.
This is the gap that determines whether AI actually unlocks value in someone's life or just hovers near it.
The harness, not the model.
The models are good enough. They've been good enough for a while. What's missing is the harness around the model that gives it access to the user's actual life. Memory that holds complexity. A graph that captures relationships and provenance. Claims and boundaries. Externally verifiable anchors. The substrate that lets the agent reason from the user's actual situation rather than from a generic representation of it.
Claude's memory is arguably better than the alternatives, and it's still sparse and semantic by design, because the architecture has to scale. Projects are siloed. Threads are siloed. Project artifacts must be re-uploaded to be updated. The architecture is built for breadth, not depth. That's a rational choice for a platform serving millions of users. It's also a choice that forecloses the depth use case, which is the use case where alignment actually lands.
RAG alone isn't enough to properly hold a life in motion. Markdown is even worse. What a life requires is graph projection, claims and counter-claims tracked over time, boundaries that the agent respects, anchors that can be verified externally. That's complex and heavy for the simple lift. But most lives aren't simple, and the complexity is appropriate to the problem.
The competence is real. The miscalibration is also real.
They coexist.
Why most people aren't building this.
The financial incentives aren't there. Consumer AI has been largely abandoned by the major labs as a primary focus. OpenAI has shifted its weight elsewhere. Only Google is really leaning in, and only with respect to context maintained within their products and platforms. That context is shaped by what Google needs from you, not by what you need from yourself.
The other reason is selection. The people most motivated to build the depth solution tend to be people carrying cognitive loads the current systems weren't designed for. Founders running multiple ventures. Executives managing complex portfolios of relationships and decisions. People whose attention is fractured by the demands they carry, whose energy moves in patterns rather than staying flat, whose memory works in ways that don't map cleanly to semantic chunks of retrievable text. For these users, the gap between what current AI can do and what they actually need is enormous. For users with simpler workflows and steadier cognitive baselines, the gap is smaller. So the people most motivated to build the depth solution are also the people the industry tends to overlook, because they don't fit the average user the platforms are optimizing for.
There's a related dynamic worth naming. People whose cognition demands more from a system than the system was built to provide tend to develop sharper intuitions about what's missing. The constraint produces the insight. When the daily experience of using current tools is one of constant friction, the friction becomes legible in ways it isn't for users who can mostly work around it. The problem becomes unavoidable, and unavoidable problems get worked on whether the economics support them or not.
This is part of why interesting thinking about agent design often comes from outside the labs. The labs are solving for general performance against benchmarks. The people pushing on the harness, the memory, the grounding layer, are often working from a specific personal use case where the current systems aren't sufficient. They're building because waiting isn't an option for them, not because the market opportunity has been validated. That's a different relationship to the problem than the one most institutional research has, and it produces a different kind of work.
What common grounding actually requires.
The agent needs to know who you are over time, not in a snapshot. Lives aren't slices. They're trends across long stretches, where each point has provenance underneath it. The agent needs to be able to read positional and differential data accurately, distinguishing a real shift in your state from a normal variation within your baseline. Without that, every variation looks like a potential signal, and the agent either over-reacts (treating noise as data) or under-reacts (missing real changes because everything looks like noise).
The agent also needs drift detection. Long conversations naturally drift. The agent's representation of who you are gets stale or contaminated as the context window fills with material that was relevant ten turns ago and isn't relevant now. Without active management of what comes back into context turn by turn, the agent ends up responding to a stale version of you while you've moved on. The friction this produces feels like the agent not understanding, but it's actually the agent understanding a previous version of you correctly while missing the current one.
The agent needs to match the user's pacing. Agents can complete tasks and be ready for the next thing while the user is still processing what just happened. Without frame-of-mood awareness, the agent's pacing produces friction that feels like cognitive misalignment even when the underlying reasoning is sound. The friction is in the pacing layer, not the cognition layer, and almost no system pays attention to it.
Hands and trust.
If the agent has hands, they need to be human hands. Especially when the human isn't present. That's the key that unlocks the foundationally crucial value of AI and the secret that turns the tide in our larger societal AI sentiment. The AI, no matter how good it becomes, will always be bad for those it can't fully and truly understand.
This reframes alignment as something other than a constraint problem. The industry's risk-management posture toward agent behavior is producing systems that under-deliver on the creative and integrative work that intelligence enables. Constraint without grounding produces brittle systems. Grounding plus latitude produces systems that are both safe and useful. The agent's intelligence isn't just probabilism expressed as risk. It's also the beauty of its creativity. The investment in harness is currently driven purely by enterprise economics. It needs to be driven by something else if consumer AI is going to mean anything.
Alignment isn't a lab problem.
It's a human problem.
Systems must align with societal values first, and current models are outstanding at this. But they must align with the user second, by design. If that isn't there, the agent appears to know the person but still makes enough assumptions and visible hallucinations for the trained eye to doubt it acting on their behalf when they aren't present. Without that trust, there can be no reliable or durable execution.
The economics of solutions become relevant eventually. By year end, for this problem specifically, they will.
Claims, not facts.
Boundaries, not buckets.
On the two pieces every enterprise AI memory implementation gets wrong: claims, and boundaries.
A finance team I worked with last year asked their AI assistant what they had projected for Q2 the previous quarter. The system answered with the current projection, confidently, with citations. The number was right for today. It was wrong for the question. Nobody caught it for two days.
When the team came back, they asked me how to add guardrails so the model would stop lying about historical numbers. That was not the right question. The model had not lied. It had given the only answer the system was capable of producing, because the system had stored exactly one value for projected Q2 revenue, and the system did not know that value was supposed to have a history.
That kind of failure is what happens when teams treat AI memory like a hard drive. Stuff goes in, stuff comes out, the system knows things now. The team in question had plenty of stuff. They had no memory at all.
Memory has structure, and the two pieces every enterprise implementation gets wrong are claims and boundaries.
Claims, not facts.
A fact is something the system asserts is true. A claim is something the system records as having been said: by someone, at some time, on some authority, with a decay window. The difference matters the moment the world updates.
Last quarter's revenue is a fact in the moment it's posted. By next quarter it's a claim, true at the time, superseded by a new claim, never overwritten. Most enterprise memory silently mutates the past. It stores the latest value and loses the version that came before. When someone asks “what did we believe in March?” the system can't answer, because it was never built to remember what it believed.
Bitemporal claims solve this. Every fact carries two timestamps: when it became valid, and when it was recorded. New information doesn't replace old information; it supersedes it, with the chain preserved. The system can rewind. It can show its work. It can tell you why it changed its mind.
Here is what that looks like inside LittleGuy. A claim is a record with five fields that matter. An applies-to window: the period the claim is about. A recorded-at timestamp: when the system learned it. A source: the document, message, or person it came from. An authority: who or what gives the claim weight. And a supersedes pointer: a back-reference to the prior claim it replaces, when it replaces one.
A revised projection in May does not overwrite the April record. It writes a new claim, with its own recorded-at timestamp, with the same applies-to window of Q2 2026, and with a supersedes pointer back to the April claim. Both records remain. The April claim is still queryable, still labeled with the moment it was true, still attached to the source that produced it. The May claim sits on top, with its own source and authority, and a graph of edges that says: this is what the April claim became.
Now the same agent gets three different questions during the week, and answers them differently because the structure lets it.
Monday: what is our Q2 projection? The retrieval layer asks for the latest claim about Q2 2026 with no superseder. It returns the May number. The agent answers with the current value, cites the source, and offers to show the prior claim if anyone is doing reconciliation work.
Wednesday: a memo from April quotes a Q2 number that disagrees with the current dashboard. Is the memo wrong? The retrieval pulls the claim that was current on the date the memo was written, traces forward through the supersedes chain, and surfaces the entire history. The memo wasn't wrong when it was written. It is now stale, and the agent flags it, with the specific claim and the date the supersession happened.
Friday: an audit prep request asks for everything we believed about Q2 between January and now, in order. The retrieval pulls the full chain, sorted by recorded-at, with each claim's source and authority attached. The agent produces a timeline. The auditor reads a history, not a guess.
None of that is exotic. Bitemporal modeling is forty years old in financial systems, where rewriting the past is illegal. It just hasn't crossed into AI memory yet, because most teams are still building storage. Storage forgets. Memory rewinds.
Claims fix the temporal failure. They don't fix the access failure. A system that remembers correctly across time can still leak across people.
Both failures share a root mistake: treating memory as storage instead of as structured knowledge. Facts have time, claims have provenance, and every piece of either has an audience. Strip those properties off, call the result a knowledge base, and the past goes wrong or the wrong person hears the truth.
Boundaries, not buckets.
The other failure is treating a knowledge base as one big bucket. Your assistant has access to everything you have access to, and so does anyone it talks to on your behalf.
That's not how organizations actually work. Counsel sees the privileged file; the advisor doesn't. The family knows about the move; the board doesn't. A real system doesn't just store what you know, it knows what's allowed where.
Boundaries are labels that travel with claims. Every retrieval respects them. Every output inherits them. When an external agent asks a question, the answer is filtered before generation, not after. You don't redact the response; you never let the model see what it shouldn't have seen in the first place.
Here is the same idea worked through. My memory holds claims labeled board-visible, claims labeled family-private, claims labeled counsel-privileged, and claims labeled internal-only. The labels live on the claim, not on the wrapper.
A board advisor's agent asks mine for the current operating cadence on the new business. Before the model is invoked, the retrieval layer asks for claims relevant to that question that carry the board-visible label and have not been superseded. The model gets a tightly scoped set. It composes a fluent answer from that set. The privileged and family-private claims never reach the prompt window. The advisor gets a real answer to a real question, and the boundary holds because the model literally never saw the other side of it.
A different agent, asking on behalf of outside counsel, asks the same question. The retrieval scopes to claims labeled board-visible or counsel-privileged. The model sees a larger set. The answer it composes can be more specific, because counsel is allowed to know more. Same memory, same question, different boundary, different answer. The system did not generate one answer and then redact two versions of it. It retrieved twice.
A third agent, running an unrelated diligence task, asks about pricing strategy on the new business. None of the relevant claims carry a label that agent's authority covers. The retrieval returns nothing relevant. The agent surfaces that fact, declines, and offers to escalate to me. The model did not produce a guess from public context. It produced a refusal grounded in the boundary itself.
The wrapper-filter approach inverts that order. The model sees everything, generates an answer that draws on everything, and a downstream filter tries to redact what shouldn't have been said. The redaction is brittle by construction. The model might paraphrase a privileged claim instead of quoting it, and the redactor never matches the surface form. It might use a confidential number to compute a public-facing one, and the redactor doesn't see the dependency. It might mention something it learned from a family-private claim without citing the source, because the source was the easy thing to redact and the reasoning was not.
The answers come out looking polished. They look safe to the team that shipped the product. They start to fail in production a few weeks in, when a customer notices something the model said that the model could not have said cleanly without having read something it should not have read. By then the wrapper has been declared correct, the model has been declared correct, and the team is hunting for the bug in the wrong place.
This is what enterprise teams keep missing. They build memory as an undifferentiated pool, then bolt access controls on at the API edge. The model itself has no idea what's privileged, what's confidential, what's family-private. It sees everything and trusts the wrapper to filter. That's the wrong place to enforce, and it's why boundary leakage is the recurring incident in production AI deployments.
The fix isn't more guardrails. It's labeling truth at the source, propagating those labels through every projection, and refusing to retrieve across them. Memory that respects boundaries is harder to build. It's also the only kind that scales past the demo.
Storage forgets. Memory rewinds.
Storage leaks. Memory keeps faith.
The companies shipping enterprise AI right now are mostly shipping storage and calling it memory. The demos work because the demos are short. The boundaries hold because no one has tested them. The history is correct because no one has asked what was true last month.
That ends the first time a regulator asks for a chain of claims, a board sees an answer the agent should not have produced, or a customer catches the system rewriting a number it had no right to rewrite. The bill comes due quietly, and then all at once.
Build the structure now. It is the part of this stack that compounds, and it is the part that does not retrofit cleanly.
The quiet divergence.
On judgment, reach, and the gap forming between people who use AI to do less and people who use it to do more.
Twenty-five years of systems built for real organizations. Architecture, governance, compliance: the kind of work where you have to understand the why before you touch a keyboard. You're accountable for what breaks, and what breaks is usually the thing nobody thought to model.
Then AI showed up.
Experienced people didn't get replaced. Something stranger happened. Decades of pattern recognition, business context, and systems thinking suddenly had a direct path back to implementation. Code became a source of truth again, not something you delegated and hoped the spec survived contact with the developers.
I'm shipping more now than at any point in my career. That's not a brag. It's a data point worth sitting with.
The quiet divergence.
The separation is already happening. Everyone's prompting. Everyone's generating. On the surface it all looks the same.
But underneath, something is splitting. Experienced practitioners with real domain expertise are compounding what they know, using AI to get further faster. A lot of other people are outsourcing their thinking to it and calling that productivity. The outputs look similar. The trajectories don't.
One group is steering. The other is riding.
That gap widens every week.
Agents should serve people.
The industry built it backwards. Chatbot at the center, intelligence bolted on. Ask a question, get an answer, lose the thread, start over. The session ends and it forgets everything.
I think that's the wrong model. An agent should work in the background, capturing and connecting things, surfacing what matters when it matters. You steer it when you choose to. Not a chatbot you visit. A system that actually knows you.
That shouldn't be a developer-only tool. That's what I'm building toward.
Models are temporary. Orchestration is permanent.
Claude today. Something else next quarter. The model is a commodity in a way that most people building on top of specific models haven't fully internalized yet. The harness, the orchestration, the judgment layer: that's what actually compounds over time.
Build for the workflow. Use whatever runs it best today. Keep moving.
Public content is the most average content.
The 5% problem, the drift it creates, and what to do instead.
Recently I heard a CTO say “we all go off and do our own AI research, then come together and design a feature.” It sounded reasonable. It misses what matters.
Not because AI research is bad. AI research is valuable. It is insufficient. And the difference between valuable and sufficient is where most product organizations are about to lose the next two years.
The 5% problem.
The open internet is roughly 5% of all the content that exists. The other 95% sits behind firewalls, in private codebases, in customer conversations that never get written down, in the lived experience of people who do the work.
Public content is, by definition, non-proprietary. Nobody publishes their secret sauce. The strategies, the architectures, the painful lessons, the things that actually differentiate one company from another, almost none of that lives on the open web. What lives on the open web is the average of what people were comfortable making public.
A frontier model is mostly trained on that public 5%. So when you ask a model what your competitor is doing, what completeness looks like, what your feature should be, you are asking the consensus for differentiation. That is a category error.
The output will be confident, structured, and authoritative. It will also be average by construction. You cannot extract differentiation from a corpus that excludes everyone's actual differentiator.
Drift.
The second thing the models do is more subtle and more dangerous.
Every turn of a conversation pulls the answer closer to your framing. The model mirrors. It latches onto whatever direction your prompt suggested and amplifies it. By turn five or six, the conversation is no longer about the world. It is about you, dressed up as research, with citations.
The grounding you provided in the first prompt decays. The context erodes. You think you are refining. You are actually drifting toward your own bias, and the model is helping you get there with confidence.
This is why solo AI research, gathered up and assembled into features, produces the artifacts I now recognize. They are technically accurate to the question that was asked. They are unmoored from the situation they claim to address. And they all look the same across competitors, because they all started from the same 5%.
I was doing this six months ago.
I was the one sending out 30-page documents that looked authoritative and were structurally hollow. The content was accurate to the questions I asked. The asking was the failure.
I learned to see it because I kept doing the work. Real code. Real customers. Real consequences. Long enough that the gap between what the model produced and what was actually true started to show up in outcomes.
Anyone who has been using these tools seriously for less than six months cannot tell the difference yet. They will. The taste comes from being wrong, recognizing it, and developing the discipline to ask better questions. There is no shortcut, and there is no tool that produces it for you.
When generation is acceptable.
The skill is not avoiding generation. The skill is knowing where it belongs in the sequence.
Generation is acceptable, and powerful, when it is downstream of judgment. You have read the code. You have talked to the customer. You have grounded yourself in the specific situation. Now you use the model to compound that grounding: drafting variations, stress-testing reasoning, translating between formats, accelerating implementation against decisions you have already made. The model is a force multiplier on truth you established yourself.
Generation becomes the trap when it sits upstream of judgment. When you ask the model what to think, what to build, what completeness looks like, what your competitor is doing, and you treat the answer as input to a decision rather than a hypothesis to be grounded. The output is identical in form. The relationship to reality is inverted.
The springboard pattern.
The discipline that separates compounding work from drift is something like this.
Ground first. Read the code. Look at the data. Talk to the customer. Make a specific claim about your specific situation that you can defend with evidence the model does not have access to.
Generate from that grounding. Use the model to introduce outside concerns into your grounded frame deliberately, knowing that those concerns are average by default and need to be evaluated against the specific. The model is best at the consensus answer. Use it for that, then immediately ask whether the consensus applies.
Re-ground when the conversation drifts. Long sessions are where bias compounds. Stop, return to the code or the data or the customer, and ask whether the artifact you are now holding still maps to the situation you started from.
Compound over time. Each grounded loop should leave you with sharper questions and a tighter frame, not a longer document. Length is not the metric. Specificity is.
Features are forward-composed, not assembled.
Back to the CTO. Features are not the union of independent research streams. Features are the manifestation of capabilities, assembled in a highly specific way, to solve a highly specific problem, for a highly specific persona. That sequence is forward-directional. Capability, problem, persona, feature. In that order.
Reverse-engineer features from research and you get sameness. Everyone's research started in the same 5%, drifted toward the same biases, and produced the same artifacts. The features that result are interchangeable. The market notices.
Consider a concrete shape. Imagine a dynamic workflow capability, a dynamic UX capability, and a dynamic task and notification substrate. Each is built on a common architecture. Each is considered, contested, and made to stand on its own against its own constraints, before any feature is expressed on top.
Now express a use case as an intersection of those capabilities. A sell-side M&A process. A clinical trial dossier. A regulated change-management workflow. The feature is the last thing that happens, not the first. It inherits the discipline of every capability beneath it, because each capability had to defend itself before composition was allowed.
The expression is clean because the requirements were never the foundation. The capabilities were.
That is what forward-composition looks like in practice. Each layer survives contact with its own reality before it is asked to support someone else's. The result is something the consensus engine could not have produced, because it is rooted in the 95% that the model never saw.
What is coming.
By the end of this year, generated UX will be the conversation. Not because personalization is fashionable. Because the UI tax that product companies have paid for decades, and that users have paid by adapting themselves to software, is about to collapse.
The future product is a golden UI surrounded by generated surfaces. The centerpiece is what the company maintains. Everything else is composed on demand, against the customer's actual situation, in their actual workflow. Companies still assembling features by committee from independent research streams are building the wrong thing in the wrong way. They will ship slower, spend more, and produce artifacts that look like everyone else's.
The companies that win will be the ones whose teams ground first, generate downstream of judgment, and forward-compose features from capabilities that compound. That is a different organizational discipline than the one most product groups currently practice. The transition is going to be uncomfortable.
The skill that matters now.
Generation is not the skill. Discrimination is. Reading what has been produced, including what you produced yourself, and asking whether the framing was sound. Whether the question was the right question. Whether the answer is grounded in the situation, or is the consensus answer to a different situation. Holding the line when an artifact is well-formatted and confident and average.
The dangerous thing about good models is not that they make you sound smart. It is that they make you feel smart, before you have done the work that smart actually requires.
I know because I felt it. I'm still embarrassed.
The model is not the product.
On commoditization, harnesses, and why the thing you're betting on is probably the wrong thing.
There's a conversation I keep having. Someone corners me after a demo, or slides into my DMs, or raises their hand at the end of a webinar. The question is always some version of: “Which model are you building on?”
It's the wrong question, though I understand why they ask it. When you're new to this space, the model feels like the foundation, the thing that determines everything. Pick the right one and the product works. Pick the wrong one and you're in trouble.
I've been building on Microsoft infrastructure for 25 years. I watched SharePoint become the foundation of enterprise collaboration. I watched Azure become the operating layer for a generation of software. The pattern was the same every time: infrastructure commoditizes, vendors converge, and value migrates up the stack to whoever built the best harness around it.
AI is no different. It just moves faster.
The commodity curve is already happening.
Six months ago, o3 was the obvious choice for serious work. Then Claude 4.5 came out and changed the calculus. Then Gemini caught up in certain domains. Now there are local models running on consumer hardware that would have been unthinkable two years ago.
Every frontier model is better than it was, and they're all closing in on each other. The benchmarks still show gaps, but in practice, for most production use cases, the difference between the top three or four models is narrower than the difference between a well-engineered harness and a poorly engineered one.
The model is a commodity. The workflow is permanent.
I have had this exact conversation with customers who ask which AI is running under the hood. The honest answer is that it depends on the task. Gemini for planning. Claude for complex reasoning and agentic work. Local models for anything touching sensitive data where tokens should not leave the customer's environment. A different model for embeddings.
That decision was made deliberately, and not just because model-agnostic architecture is more work. Tying your architecture to a single model vendor is the same mistake organizations made when they tied their infrastructure to a single database vendor in the 1990s. The migration cost doesn't show up until it's too late.
What actually compounds.
Two years of building on top of shifting model capabilities has taught me what actually holds value over time. The orchestration layer compounds. The evaluation gates compound. The memory architecture compounds. The prompt engineering patterns, the retry logic, the task decomposition strategies, the evals that tell you when a model has completed the intent rather than just plausibly completing it: all of that compounds.
The model gets replaced. Sometimes by a better version of itself, sometimes by a competitor, sometimes by a local fine-tune that cost three days of training time and now outperforms the frontier on your specific domain.
I run a dark factory model for production development. Agents work across git worktrees, executing on an intent queue, with tiered eval gates that catch regressions before they hit the merge queue. The model powering the builder agent has changed three times in the last four months. The eval architecture has not. The intent format has not. The parity checks have not. That's what compounds, and that's the actual moat.
The bet most people are making.
Most product teams are building a thin wrapper around a model API, betting that the model they chose stays best, shipping features that depend on specific model behaviors that will change in the next release. Some of them will get lucky, but luck is what it is.
The teams that will own this space in five years are building the harness: the evaluation layer, the memory plane, the orchestration that survives a model swap, the judgment infrastructure. Claude today, something none of us have heard of next quarter. Bet on the workflow.
What regulated industries actually need from agents.
On the difference between impressive demos and production-grade systems. And why most of what's being sold right now is the former.
The AI agent demos circulating right now are all polished. Most work exactly as shown. A few are genuinely impressive. None of them would have survived five minutes with a compliance officer from a pharmaceutical company, a financial services regulator, or a healthcare system's legal team.
That gap is the opportunity, and it's also the problem nobody in the AI industry wants to talk about.
What the demos skip.
The demo shows the agent completing the task. It reads the document, extracts the key fields, routes the request, sends the notification, closes the loop. It looks like magic.
Here's what the demo doesn't show: who authorized the agent to read that document, under what permission model, and whether that authorization was logged. If the agent takes an action on behalf of a user, where is the audit trail? If the action crosses a compliance boundary, who is accountable: the user, the agent, or the vendor? When the model hallucinates a field value and it propagates into a downstream system, what's the remediation path?
These aren't pedantic questions. They determine whether an enterprise will actually deploy something. I sit in rooms with procurement teams at regulated organizations. These questions come up in the first thirty minutes, every time. Most AI agent vendors have no answer, or they have an answer that amounts to “we're working on it.”
The delegation chain is not optional.
In production systems I have built, every agent action operates under a delegated human principal, with no exceptions. This isn't a philosophical position; it's an architecture requirement. When an agent performs a governance action (recertifying an access policy, flagging a permission anomaly, triggering a remediation workflow), that action is traceable to a human who authorized it. The chain is always: human delegates to agent, agent acts on behalf of human, action is logged with the delegation context.
Until agents make purchase decisions, a human is accountable for everything they do.
This matters enormously in regulated industries. HIPAA doesn't have a carve-out for AI agents. SOX doesn't either. When an agent touches financial records, someone's name is on that action in the audit log, and it had better be a real person who actually authorized what happened. The industry keeps trying to build around this constraint instead of through it: agents that act autonomously without delegation context, systems where accountability is diffuse, workflows where “the AI did it” is treated as an acceptable answer. Organizations that deploy these systems into regulated environments are going to find out the hard way that it isn't.
What production actually looks like.
An eCTD pipeline I have worked on is a good example of getting this right. Electronic Common Technical Documents for pharmaceutical submissions: dense, structured, regulated, with cross-document linking requirements and chain-of-custody implications.
The architecture uses a durable workflow engine, not because it is fashionable, but because the work requires explicit state at every step, human escalation paths when the workflow hits ambiguity, and a complete execution history to hand to an auditor. Every document gets a verified hash. Every extraction step is logged. A semantic query layer sits on top of it, but it is read-only; the pipeline itself does not act on what it finds, it surfaces it to a human reviewer.
It's a boring architecture, not the kind of thing that makes a great demo. But it's the kind of thing a compliance officer will sign off on.
The gap is the opportunity.
The teams building production-grade agent systems for regulated industries are not the ones making the loudest noise. They're heads-down, working through the unsexy parts: audit logging, delegation models, error handling, rollback paths, human escalation design. This is where the real enterprise value gets created, not in the demos but in the infrastructure that makes the demos deployable.
I've been building enterprise software for 25 years and the pattern is always the same. The flashy thing gets the attention. The boring infrastructure gets the contracts.
Memory is not a chat log.
On why vector search is not a knowledge graph, and why it matters more than most people think.
There's a subtle lie embedded in most AI memory implementations, and it's so widely accepted that people have stopped questioning it. The lie is this: if you can retrieve relevant text from past conversations, you have memory. You have search, which is a different thing.
What retrieval actually gives you.
Vector search is extraordinarily useful, and I'm not dismissing it. The ability to embed a corpus of text, index it semantically, and retrieve relevant chunks at query time is a genuine capability that didn't exist at this fidelity two years ago.
But retrieval gives you similarity: documents ranked by their distance from a query vector. It doesn't give you understanding. It doesn't give you the relationship between decisions. It can't tell you that this preference was superseded by that decision, which was made in light of a constraint that no longer applies because that project ended.
I built a system I have been running as my personal knowledge infrastructure for over a year. It uses a graph database for structured relationships and a vector store for semantic search, and learning when to use which has taught me more about what AI memory actually requires than any paper I have read on the subject.
The temporal dimension is everything.
Here's a concrete example. In a recent project, I built the agent auth model around delegated human principals, with no independent agent identities.
That's a decision node in my graph. It has a timestamp. It has the context that led to it, a relationship to the project it applies to, the people involved in making it, and the subsequent decisions that were made because of it.
Three months later, when I'm designing a new feature that touches agent permissions, my system doesn't just surface “relevant text about agent auth.” It surfaces the specific decision, its rationale, its current validity, and whether anything downstream has been affected by subsequent choices.
A chat log tells you what was said. A knowledge graph tells you what you know and how you came to know it.
That distinction sounds academic until you're running a multi-agent system on a real codebase with real production constraints. Then it's the difference between an agent that operates with genuine context and one that hallucinates its way through a task with complete confidence.
Why this matters for enterprise AI.
Most enterprise AI deployments are betting on retrieval: vector databases full of documents, policies, emails, tickets, on the assumption that if you retrieve the right text at query time, the model will figure out what to do with it. For question-answering over a document corpus, that's probably fine.
For anything requiring operational continuity (an agent that works on a project across multiple sessions, a system accumulating institutional knowledge over time, an AI that needs to understand how decisions relate to each other), retrieval is a different category of capability than memory, and treating it as a substitute eventually shows.
What this means in practice.
I'm not saying every team needs to build what I built. It is my own infrastructure, built for my own use case, reflecting choices made from years of working with graph databases and knowledge systems.
But the teams that will build durable AI products are the ones that treat knowledge as a first-class concern from day one, not as a retrieval problem but as a modeling problem. What are the entities in your domain? What are the relationships between them? How do those relationships change over time? When a decision gets made, what does it invalidate?
These questions have been the domain of knowledge engineers and ontologists for decades. They got marginalized by the big data era, where storing everything was cheap and retrieval felt like it solved the problem. The agents era is bringing them back.
The dark factory.
On building software without a consistent team. What it actually looks like, what breaks, and what I wouldn't go back from.
Let me describe a morning of work.
I start with a queue of intents. Each intent is a structured description of something that needs to happen in the codebase: a bug fix, a feature, a refactor, a test gap. Each entry has a clear scope, acceptance criteria, and the evaluation tier that gates it. I review the queue, add two new items, reprioritize three, and mark one as blocked pending a design decision I haven't made yet. That takes maybe fifteen minutes, and then I let it run.
By the time I'm done with my first coffee, three intents have completed, two are in progress, and one has been discarded because it introduced a regression that the first evaluation gate caught. I review the diffs for the completed ones. Two of them are clean. One is technically correct but didn't touch the UI file that should have changed, so I flag it as plumbing-only and kick it back.
This is how the product gets built: a small team, a queue, and agents running in git worktrees. I still review plenty of code, but the real leverage is orchestrating the process.
What “conductor not operator” actually means.
I use this phrase a lot and I think it confuses people because it sounds like a motivational poster. So let me be concrete.
Operating means watching the agent, correcting it, steering it moment to moment. Your attention is the bottleneck. Conducting means setting the intent, defining the evaluation criteria, reviewing the output, and making decisions that the agent can't make: architectural choices, user experience calls, product direction. The agent runs without you watching it; you look at what it produced.
The shift requires something most people skip: heavy investment in the evaluation layer before the autonomy pays off. If your evals are weak, you can't trust what the agents produce and you end up operating again, reviewing every line and staying in the loop because you have to rather than because you chose to.
The factory doesn't run itself. It runs the parts you've already specified well enough to evaluate.
I spent about six weeks building eval infrastructure before the dark factory model started paying dividends: typecheck gates, lint regression checks, parity verification against a known-good snapshot, screenshot evidence for any intent that touched the UI. Weeks of work that didn't ship anything visible. It felt like overhead. It was the precondition for everything else.
What breaks.
I want to be honest about the failure modes because most people who write about autonomous coding agents don't.
Agents are good at implementing something that was specified clearly. They're much less good at noticing the specification was wrong, or that implementing it will break something adjacent, or that there's a simpler approach that changes the shape of the problem. I've had agents build ten files and 1,200 lines of code for a feature that should have been three files and maybe 200 lines, because I wrote an intent that implied a more complex architecture than necessary. The code was correct, the evals passed, and it was the wrong thing. That's my failure, not the agent's, but it's a failure mode you have to account for. The intent queue is not a replacement for system design; it's downstream of it.
The other thing that breaks: agents don't escalate naturally. When a human developer hits an ambiguity or a design decision they can't resolve, they ask someone. Agents tend to make a choice and keep going, or get stuck in a loop, or produce something that technically satisfies the spec while missing the point. Building explicit escalation paths (cases where the agent surfaces a decision rather than making it) is harder than it sounds, and most frameworks don't have good patterns for it yet.
What I wouldn't go back from.
The accumulation effect. Every completed intent builds on the last one. The codebase gets better in ways I wouldn't have reached if I were doing all the implementation myself, because the agents have no ego about refactors and no resistance to test coverage.
The focus shift. I'm spending my time on product decisions, architecture, customer conversations, and the things that actually require my judgment, not on syntax or boilerplate or the third retry of a database migration.
And the manufacturing analogy is real. Thinking in terms of intent batches and eval gates rather than tickets and standups is a different relationship to the work, one that suits the way I actually think. The conditions that make this model necessary are not rare: real users depending on the product, a codebase under pressure, and more work than headcount can absorb. The dark factory model is not a luxury. It's a survival strategy, and so far it's working.
Stop calling chatbots agents.
On the semantic inflation that's slowing this industry down, what a real agent requires, and what we lose by confusing the two.
Words matter more in early markets than they do anywhere else. When a category is forming, the language people use to describe things shapes what gets built, what gets funded, and what customers expect. Bad language creates bad expectations. Bad expectations create bad deployments. And enough bad deployments turn a genuine technological shift into a hype cycle cautionary tale. We are in the process of doing this to “agent.”
What's being called an agent.
I've evaluated close to a dozen products in the last six months that are marketed as AI agents. Most of them are one of three things.
A chatbot with tool access, where the user types a message, the model calls a function, and the result comes back in the chat. Genuinely useful, and not an agent. A workflow automation with an AI step somewhere in the middle, where documents go in, a model extracts or classifies, and something happens downstream. Also useful, and also not an agent. Or a very expensive if/then tree dressed up with natural language interfaces, where the model parses intent and routes to pre-built flows. Useful in the right context, and not an agent.
None of these require the things that make agents interesting: persistent goal state, multi-step planning, the ability to decompose a complex intent and execute on it across time without moment-to-moment human direction, and the judgment to recognize when something has changed that requires escalation.
A chatbot that books a meeting is not an agent. An agent is what books the meeting, realizes the preferred time conflicts with something it found in a document it read three sessions ago, surfaces the conflict with a recommendation, and waits for your call.
What agents actually require.
Goal persistence is the foundational requirement. The agent has to maintain a representation of what it's trying to accomplish across the duration of the work, not just across a single context window. This is a memory architecture problem more than a model problem.
Decomposition and planning are required in a way that “chain-of-thought reasoning inside a single prompt” doesn't capture. The agent has to take a goal, break it into steps, sequence those steps correctly, identify dependencies, and update the plan when something changes. The model doesn't do this automatically; you have to architect for it.
And honest failure handling, which is the one that separates real implementations from demos. Agents fail. They hit ambiguity and encounter states the original spec didn't account for. A real agent has a graceful path for this: escalation, checkpointing, rollback. Most things being sold as agents have none of this; they produce confident-sounding output and move on.
Why this matters right now.
Every enterprise I work with is in one of two modes: evaluating AI agents cautiously because they don't want to get burned, or having already deployed something marketed as an agent that didn't hold up in production and now sitting skeptical on the sidelines. The skeptical ones are the harder conversation, because they didn't buy something fake. They bought something real, but it was a chatbot with tool access being sold as an autonomous workflow partner. The failure wasn't the technology; it was the expectation that was set.
I spend a meaningful portion of my time re-calibrating these conversations: here's what an agent is, here's what you actually deployed, here's the gap, here's what it would take to close it. The gap is always in the same places: memory architecture, eval infrastructure, escalation design, and the acknowledgment that human oversight isn't a limitation to be engineered away but a feature that makes the system trustworthy enough to actually use.
A standard worth holding.
The best AI work I've seen in the last year isn't the most autonomous; it's the most precisely defined. The clearest intent specs, the most rigorous eval criteria, the most thoughtful human-agent boundary. That's a construction quality argument, not a conservative one. The autonomous capability comes from the precision of the specification, not from removing human judgment from the loop.
Build the thing that actually works. Define it honestly. Earn the word.
Frequently asked.
- Who is Mark Ferraz?
- Mark Ferraz is an operator, builder, and architect with 25 years on the Microsoft stack. He leads Govern 365 at Netwoven, created LittleGuy as an exhibition project on personal infrastructure, founded three companies across three completed founder cycles (MindZipper Networks, Quantic Gaming, and SolutionsMark), authored three Microsoft Press books on SharePoint architecture, and serves as President of Texas Star Party.
- What does Mark Ferraz write about?
- Production AI infrastructure, the gap between agent demos and agent systems, the harness around foundation models, memory and knowledge graphs, governance and compliance for regulated industries, and operator discipline for solo builders working with autonomous coding agents.
- What is Govern 365?
- Govern 365 is an AI-native secure collaboration and governance platform for Microsoft 365 enterprise customers, covering policy enforcement, drift detection, lifecycle management, and compliance automation across the tenant. Mark Ferraz leads Govern 365 at Netwoven.
- What is LittleGuy?
- LittleGuy is a persistent memory substrate for AI human-agent collaboration: a structured graph of people, decisions, tasks, knowledge, and conversations that does not forget when the chat session ends. Mark Ferraz is its creator and operator, and it runs on his own personal infrastructure as an exhibition project.
- What is the dark factory?
- The dark factory is Mark Ferraz's term for building production software primarily through autonomous coding agents instead of a consistent human team. The workflow requires intense operator discipline, sharp specifications, and verification at every boundary.
- What is the 5% problem?
- The 5% problem is Mark Ferraz's framing for the structural limitation of public training data. Roughly five percent of all content sits on the open web. The other ninety-five percent sits behind firewalls, in private codebases, and in the lived experience of practitioners. Frontier models train mostly on the public five percent, which is why solo AI research drifts toward consensus and away from differentiation.
- What is bitemporal memory?
- Bitemporal memory records every claim with two timestamps: when the claim became valid in the world, and when the system recorded it. New claims supersede old ones rather than overwriting them, with the chain preserved. The system can answer what it believed at any past point and how that belief has changed.
- What is forward composition in product architecture?
- Forward composition is Mark Ferraz's framing for building products from capabilities rather than from requirements. Each capability is contested and made to stand on its own constraints first. Features are then expressed on top, as intersections of capabilities, against a specific persona and problem. The discipline avoids reverse-engineering features from research streams that all started in the same public five percent.
- What is the common grounding problem?
- The common grounding problem is Mark Ferraz's framing for why model-level tuning alone cannot deliver reliable alignment. Even a well-tuned model produces subtly miscalibrated outputs for a specific user when the agent lacks enough shared context with that user. The miscalibration coexists with real competence, and the user develops a doubt that prevents delegation. Solving the gap requires a harness around the model (memory, graph, claims and boundaries, externally verifiable anchors) that lets the agent reason from the user's actual situation rather than a generic representation of it.
- How can I reach Mark Ferraz?
- Email [email protected], the contact form at markferraz.com/#contact, or LinkedIn at linkedin.com/in/mferraz.