ERC-8004 & Codatta: Three Paths Toward the Knowledge Layer

In a previous article, I did a full deep dive into the ERC-8004 — its vision, its purpose, and how it works under the hood.

Of course, that breakdown wasn’t written just for fun. As one of the industry’s many “uncelebrated but perpetually useful” workhorse, staying on top of AI × Web3 developments is a survival skill, not a hobby. And beyond keeping up with the industry, evaluating whether a protocol is actually viable and usable is just as critical. And if it truly connects with real use cases one day, well… being early could turn out to be a very profitable head start.

After all, once you step into this arena, you quickly learn a simple truth: in a frontier like this, falling behind one year is no different from falling behind ten. The jungle has its own laws, and the sense of urgency never really goes away.

So I spent some time digging deeper and—no surprise—there really are meaningful intersections. As always, the roads that lead toward the peak of technology tend to rhyme with one another.

But before we get into how these two worlds connect, we need to introduce the protagonists properly. ERC-8004 has already had its moment in the spotlight, so let’s focus on the other player making its entrance—Codatta, an emerging solution aimed at building the Knowledge Layer for AI.

What Is the Knowledge Layer?

Let’s start with the “official” introduction.

In plain terms: it provides high-quality data for AI training.

Why does AI need high-quality training data? Because the way AI evolves today is completely different from many years ago. Early on, humans taught AI explicit rules, like — “red light stop,” “green light go.” Easy to teach, easy to learn.

But then human ambition showed up — we wanted AI to do more and more. As expectations kept rising, the tasks became more complex, the rules more subtle, and many of them became nearly impossible to describe.

Take “say the right thing to the right person in the right context” — aka emotional intelligence. Even humans struggle. How do you teach that to a machine?

So we switched strategies:

Instead of hard-coding every rule, we let AI figure things out on its own. We feed it tons of data and let it learn to extract the patterns on its own. Yep, back to our favorite traffic light example, show it traffic videos: when the light turns red, all the cars stop; when it turns green, everyone starts moving again. And every now and then, someone insists on running a red light—and gets a fine for it.

Little by little, after watching enough of those videos, the model starts to understand the idea—even without passing a driving theory test, it learns that red means wait and green means go.

It’s the same principle as an old saying:

If you soak in enough great poetry, you may not become a poet, but you’ll learn how to sound like one.

For this kind of learning, data needs to meet three basic criteria:

Abundant enough to learn from.
Accurate enough to trust.
Clear enough to use.

That should sound familiar—we humans learn from textbooks in much the same way. The only difference is scale: what counts as “a lot” for humans is nowhere near enough for AI.

Where does all this data come from?

Part of it is produced by our humans; the rest can be generated by AI itself — what we call synthetic data. Of course, even when the data is submitted by humans, AI can still be brought in for pre-labelling and assisted validation. Let the heavy, repetitive work be handled by AI, and let humans focus on the deeper, higher-value refinement on top of it, which helps reduce costs and improve efficiency.

Sounds easy, right?

Well… it’s not.

Because humans are, frankly, relentless.

They’re no longer satisfied with AI just recognizing traffic lights.

No, no, no — they want AI to handle multilingual translation, write market reports, ship production-ready code, perform logical reasoning, solve math olympiad problems, and hey, why not prove the Goldbach Conjecture while we’re at it?

Just listen to that.

These are things humans themselves can’t do reliably — but apparently, instead of pressuring their kids, they’ve decided to pressure AI instead.

Whether that’s a blessing or a curse for the kids… I honestly don’t know.

All these wild expectations don’t just demand more data — they demand better data, too.

This isn’t as simple as snapping a random picture.

Even a photo needs retouching afterward, right? Now imagine the time and effort required to produce high-quality training data. In many cases, it takes serious work. We call that kind of refined, insight-carrying data Knowledge.

If you want AI to learn how to write code, you can’t just feed it a pile of messy, unstructured snippets that don’t even compile. You need to show it good code. Because if what it sees is always spaghetti code that fails to build, what exactly do you expect it to learn? Garbage in, garbage out.

You want a high score? You grind through practice questions until your soul evaporates.

But someone has to keep writing exam questions those questions — suffering on administrator mode. And they can’t just copy-paste old questions either. The questions have to feel fresh, convince students they’re leveling up, and, with any luck, snipe a few real exam items.

But that’s not the end of it. A new problem emerges: after spending all that time and effort creating high-quality data for AI, what do I get in return?

As AI becomes smarter and more capable, the first thing it might do is replace me. It learns every trick I teach it—and then takes my place.

The old saying still applies: You teach the apprentice and the apprentice eats your lunch.

Some days, you really do wonder if all this effort is just rugged by the universe.

Fortunately, there are always more solutions than problems. And this is where Codatta comes in: its goal is to help solve these challenges by providing AI with large volumes of high-quality data, while also taking good care of the people who produce that data—not just emotionally, but economically.

How Codatta achieves this deserves its own deep dive, and we’ll leave that for another article. For today’s topic—whether ERC-8004 and Codatta can be combined to create a whole that is greater than the sum of its parts, and how such integration could work—it’s enough to simply understand what Codatta is here to do.

Before we explore how they might work together, let’s briefly recap the two players:

💡 Codatta

A knowledge layer that organizes humans and Agents to produce and refine data into its final product form: data assets. The production workflow includes, but isn’t limited to:

Sample — submitting raw data samples, such as a short traffic-intersection video.
Label — annotating the data, such as: what’s in the video, which objects are cars, which are people, which frames contain a red light, which show a violation.

And annotation is no trivial task: two people may label the same data with completely different levels of depth. Saying “this is a car” is one thing; saying “this is a 2012 Porsche 911” is quite another.

And that’s just visual data—once we move into code or math, the gap in annotation quality can be even wider.
Validation — verifying the submitted and labeled data, checking for errors, calling them out, and even correcting them when possible.

And Codatta doesn’t stop at production. It will also provide data asset marketplaces—for trading and renting data asset—so contributors can actually earn from what they helped create. A true end-to-end pipeline.

At this point, we have a good sense of what each player does. Now, after examining the nature of both, I’ve sketched three potential ways they could connect—ranging from deep to loose integration.

For simplicity, let’s call them:

👉 Reconstruction

👉 Fusion

👉 Upstream & Downstream

Reconstruction

Reconstruction: Rebuilding Codatta DID using the ERC-8004 standard, so that DID information is recorded in the EVM ecosystem in an ERC-8004-compatible form.

In case we haven’t formally introduced it: Codatta DID is Codatta’s decentralized identity system, which records the identity information of Codatta users—in other words, it is essentially the resume of Codatta Users.

Although ERC-8004 was originally proposed as a protocol for Trustless Agents, its openness and extensibility — fewer constraints, less rigid structure — make it adaptable to other use cases through customized field definitions.

We’ve seen similar evolution before: ERC-721 began life in CryptoKitties, but later proved to be capable of far more—PFP (Profile Picture NFT), brand IP, membership passes, even today’s rising RWA experiments. Standards evolve as ecosystems evolve.

As noted in the earlier ERC-8004 breakdown, one of its core purposes is to represent and manage Agent resumes. But despite its Agent-centric posture, nothing in the standard actually enforces that exclusivity. At the execution layer, it imposes virtually no binding constraints.

And beyond that, who can even say with certainty whether an “Agent” is pure code or a human-in-the-loop? On-chain, there is no reliable way to distinguish the two.

Strip away the narrative and what remains is this: ERC-8004 is a digital identity schema. And identity, across most domains, follows familiar patterns—no matter how creative a resume layout looks, the fundamental elements barely change.

Since they belong to the same category in terms of use cases, it should theoretically be feasible for them to be compatible with each other.

Therefore, within the EVM ecosystem, it is entirely feasible to implement Codatta DID on top of ERC-8004, preserving core identity functionality while improving DID interoperability and usability across the broader EVM landscape.

Fusion

Fusion — ERC-8004 and Codatta DID enhancing each other’s completeness and trustworthiness

As noted earlier, Codatta DID is Codatta’s decentralized identity system. It approaches user identity from the data contribution perspective:

Who contributed what data ？
in what way (Sample, Label, Validation)？
and with what level of quality (Reputation)？

And critically, a “user” here does not have to be a human. It can also be an AI Agent.

In fact, as AI’s appetite for training data keeps accelerating — in scale, in freshness, and in diversity — relying solely on human-generated data is becoming increasingly unrealistic. Synthetic data produced by AI Generative Model is already becoming a standard complement to real-world data.

It is therefore reasonable to expect that, in the future, AI Agents will participate directly in Codatta’s data production workflows.

In that scenario, these Agents would hold dual identities

as ERC-8004–compliant AI Agents, and
as Codatta users who contribute knowledge and help produce data assets.

For Codatta DID, compatibility with ERC-8004 means the ability to plug seamlessly into Ethereum — and, more broadly, the entire EVM ecosystem. This significantly expands Codatta DID’s reach and increases the likelihood of real adoption across diverse applications.

Conversely, from the perspective of ERC-8004, identity data coming from Codatta DID is not theoretical or self-declared — it is backed by long-term, real operational history. This gives it a level of authenticity and completeness that few identity systems can match. With such credible identity signals, ERC-8004 adoption and ecosystem growth could advance significantly faster.

Upstream & Downstream

Downstream & Upstream: Codatta as the upstream supplier for ERC-8004 Agents, and ERC-8004 as the downstream consumer of Codatta — together forming a complete, fine-grained Royalty Engine loop.

Before diving in, we need one piece of background about Royalty Engine, which is the core business model behind Codatta.

In short, users who provide meaningful data contributions — whether through Sample, Label, or Validation — may receive:

One-time compensation from the original (primary) data consumer — the party that initiates and finances the data production, and/or
A proportional share of data ownership, based on the value of their contribution.

With data ownership comes ongoing royalty yield from future usage of that data. At the same time, because the primary data consumer does not need to purchase the dataset at full upfront cost, the Royalty Engine significantly lowers the barrier to building data-driven products and businesses.

In effect, the Royalty Engine transforms the relationship between data contributors and primary data consumers from a traditional buyer–vendor transaction into a co-creation and shared-upside model — which is its core value proposition.

But, for the Royalty Engine to function as intended, two foundational requirements must be met:

Clear, complete, and accurate records of data contribution
Clear, complete, and accurate records of data usage.

Clear, complete, and accurate data contribution records mean that the entire lifecycle of a dataset — from its initial sampling, through labeling and validation, to finalization — can be reliably traced. Only with such end-to-end lineage can data ownership and distribution of ownership shares be determined in a manner that is fair, transparent, and verifiable.

Clear, complete, and accurate data usage records mean that every instance in which the data is consumed can be tracked with the same level of reliability. This is essential for trustworthy calculation and distribution of ongoing royalty yields generated from downstream data consumption.

Only when both conditions are satisfied simultaneously can the rights and economic interests of data owners be properly protected, enabling the Royalty Engine to operate effectively and sustainably.

Codatta is addressing the first requirement through its data lineage system, which enables end-to-end traceability of how data is sampled, transformed, labeled, and validated, and links each step to the corresponding contributors. This lineage makes it possible to establish and verify data ownership with precision.

However, on the usage side, the picture is less complete. As an open platform ecosystem, Codatta does not directly operate or control downstream applications built on top of the data. As a result, the lineage system currently has limited visibility into how data is actually consumed in real-world AI products.

This is where ERC-8004 becomes highly relevant. As a protocol standard for AI Agents, it introduces conventions for recording AI Agent execution. Since AI Agents are among the most important consumers of high-quality data products, this provides a powerful complement that can significantly enhance the effectiveness of the Royalty Engine.

Conversely, Codatta’s data lineage system can feed back into AI Agent attribution and performance optimization, providing richer and more reliable signals for evaluating Agent behavior.

Connecting Codatta’s data lineage with ERC-8004 Agent execution records sets off a self-reinforcing upward spiral — one in which both data contributors and application developers come out ahead.

Conclusion

The discussion above approaches the problem from a business and ecosystem perspective and offers a set of preliminary, directional thoughts. These ideas are not mere daydreaming; each approach has been considered with technical feasibility in mind, so that none of it collapses into wishful thinking or empty speculation.

Whether any of these approaches will prove effective can only be answered through real-world implementation. But at a minimum, they must all be buildable — otherwise, there’d be nothing to test. As the old saying goes: if you want to know whether it’s a horse or a mule, you eventually have to take it out for a run — not leave it on the whiteboard for endless debate.

Our standing principle remains unchanged: actions speak louder than words.

In the upcoming parts of this series, I will dive deeper into the engineering and protocol design behind each approach, covering how they might be implemented in practice. Each has its own strengths and trade-offs, and each is suited to different scenarios. This article does not attempt to declare a single best answer, nor does it claim a final conclusion. Instead, it is intended as an opening contribution—a prompt for exploration and discussion.

After all, ERC-8004 itself is still evolving. As the standard matures, there will likely be new patterns and new possibilities that none of us can fully anticipate today.

The story isn’t finished — we’re all co-authors.

Stay tuned, stay sharp, and stay shipping.