The more one writes about identity, the more the word becomes a term for something that is as unfathomable as it is all-pervasive.
So said Erik Erikson, the psychologist who coined the term "identity crisis.” It often feels true: identity is a nebulous concept, with many connotations and its meaning highly context-dependent. This is no different in Web3.
In this post I’ll try to alleviate this: a framework for thinking about identity on the web as primarily a tool to store, manage and retrieve information.
It won’t clear up every use and misuse of the term, but I hope it lends clarity in thinking about how identity can shape Web3, how applications can build for this, and what these choices will mean for our experience of the web.
When people talk about identity they usually mean one of three related but quite different scopes: (a) a unique identifier, (b) a holistic view of an entity, or (c) a specific piece of context about an entity.
Unique identifiers are critical in any social setting. Names are an adequate ‘identifier’ amongst friends, family, or small tribes (under Dunbar’s 150 person threshold) where familiarity can be assumed. Beyond that, more rigid identifiers help make actors ‘legible’ in a broader system. States implemented IDs to manage taxes, conscription and social programs. Web applications have userIDs in a user table to track, manage and serve their customers.
A holistic view refers to all the possible information about a user or other actor. Attempts to attach lots of data to unique identifiers can create rich information sets about people or entities. Pursuit of this is seen in the user databases of Facebook & Google, India’s Aadhaar and China’s universal reputation system, and customer data platforms like Segment and LiveRamp.
A specific piece of context can refer to any of the many subsets of the holistic view. KYC or identity verification - a many-billion-dollar industry - is about verifying that someone is the unique identifier
they are claiming to be within a state system. Authentication, anti-fraud, anti-spam, and reputation algorithms are, similarly, specific services focused on a subset of information within the holistic view.
Do I contradict myself? Very well then I contradict myself, I am large, I contain multitudes. -Walt Whitman
Unique identifiers are necessary, but on their own pretty useless. They are nearly always used to route to some information. That might be a name and address in state records, a document in a file system, a password in an application database, or a token balance or transaction history on a blockchain. In every case, the identifier is useful because it conveys related information.
Many situations call for retrieving or verifying a specific piece of context linked to an identifier. For example, Gitcoin needs an ‘identity system’ that prevents attacks on its Grants platform. In practice, they need to map evidence of personhood (KYC verification, twitter account) to a unique identifier. The more more information they have on how likely that actor is unique or fraudulent, the better their platform can function.
Holistic views of identity are always incomplete — just as we can never perfectly describe our ‘true selves’ in meatspace, our digital selves will never be fully consistent or comprehensive. But the more data is gathered around a single (or set of linked) identifiers, the more information we have to draw on for any given context.
The common thread is: identity systems create the ability to reliably associate information with a unique identifier. An identity system is more reliable, and therefore useful, the more it is:
Different contexts call for different considerations for privacy and security; for trust via 3rd parties or auditability or decentralization; for consistency vs. availability. But at the most primitive layer, an identity system is stronger if it has the potential to make more information more legible more consistently.
The web, to massively oversimplify, runs on hardware, code, and data. Every website you go to has logic and rules written in code, and nearly all are populated with information encoded in data. That data - whether today’s news or your friends’ tweets or your last email drafts - has to be retrieved accurately and reliably when you arrive at the site. That’s done through identifiers.
Just as unique identifiers are not useful without attached data, data on the web isn’t very useful if it can’t be retrieved at the right time. Unique identifiers, and routing tables and logic built around them, are used throughout the web to organize the data that populates it. Who is creating those identifiers? And who is organizing data around them?
Today, it’s nearly every site you visit, product you use, or company you encounter. Identifiers are listed in a database they have created, mostly private and siloed from every other company’s. Data is put there, and linked there, as well. Often this organized around a user table: each row represents a user, each column a type of data, and the table stores or points to each users’ record of that type of data.
How does this identity system hold up to our criteria above?
Dependable: 👍🏽 pretty dependably available, but 👎🏽👎🏽 with no auditability, highly susceptible to hacks and errors
Flexible: 👍🏽 database types can be linked to handle all kinds of information, though it can be a bit of a mess
Accessible: 👎🏽👎🏽👎🏽 with every app needing their own identifier, information (and managing it) is incredibly fragmented, redundant and inefficient
From a macro perspective, this is a pretty bad identity system for the web - because it’s not an identity system, it’s many different ones. It fragments information, limiting its value and use for every participant. (It also creates terrible incentives to hoard and abuse user data that are beyond the scope of this article).
From a more micro perspective, in a users experience with any given application, it is the application who is responsible for the user’s identity - for their unique identifier, the data associated to it, and the reliable links between them. This arose simply because there was, until recently, no other option. This is intuitively wrong.
Blockchains are a form of distributed ledger technology (DLT), which are basically shared databases. A shared database seems like a great place to put a unified user table, and get rid of the archaic need for every application to create their own identity system.
This is the promise of decentralized identity and a core pillar of the Web3 vision: every user and builder is in control of their own data, value, relationships, and information. In this vision, each user becomes the unified discovery point for their own data, creating reuse and composability across applications. This creates shared network effects, native interoperability, and compounding experiences that siloed, centralized applications can’t compete with.
A primitive version of this envisions a single unified registry of users (on a single DLT) and a standard way to add information to that registry for all apps. Users are given control their own cryptographic, sovereign address (or identifier) with which they sign all data to create the trust needed for data in an open context. We get every app to use the same registry (blockchain) and issue data using a standard format (NFTs) and we’re theoretically in identity nirvana - a web in which we bring our social graphs across apps, engage with audiences and communities seamlessly across platforms, and move easily between new products and services as soon as they become available because they all natively interoperate.
However, this vision of decentralized identity - most recently relying on addresses and NFTs, quickly breaks down in practice. It is too rigid and doesn’t perform well as an identity system to manage and route to data at scale. On our rubric:
Dependable: 👎🏽 blockchains today, designed for consensus on scarce financial assets, cannot possibly scale to meet the scale of abundant data; nor can they handle off-chain (or partitioned) updates
Flexible: 👍🏽👎🏽 most on-chain ledgers enable new data structures and standards, but within fairly defined limits bound by the consensus system. This restricts the use cases and applications of this system
Accessible: 👎🏽 a single registry limits users and apps to a single DLT or blockchain, when inevitably different chains and networks will be used
We can learn from the shortfalls of primitive cryptographic identity systems to see what’s needed in a more tenable decentralized identity system. It’s clear a single registry (index), identifier standard, or data structure standard will always be too rigid.
It must work with a variety of identifiers. It must be open to a flexible, extensible set of data models and structures. It must work across contexts and networks on the web. It should be designed with the principle that identity is about managing and discovering information, and so it should put data first.
To manage data, we need a protocol that makes it easy to store, discover, and route information about an identifier. For Web3 to live up to its promise, that routing table should (a) be unified rather than siloed by application or any other boundary, and (b) be sovereign, give control of data directly to each identifier.
This suggests a simple design: every identifier maintains a table of its own data. Unified, these identity-centric user tables form a distributed user table for the internet. This distributed user table is not an actual table but a virtual one that arises out of a few component parts that correspond to parts of a traditional user table:
Identifier: rather than an entry in an application database, a decentralized identifier should be provably unique and cryptographically controlled. Accessibility requires accepting multiple forms of identifiers across a variety of networks - a la the DID standard for decentralized identifiers.
Data Structures: similar to how application developers define their own data structures, a decentralized data layer needs to enable developers to define custom data models while ensuring that those models are reusable and stored publicly.
Index: users bring their identifier while applications define the data models. Standard indexes can combine these elements into a user table (or application table) so that when a user interacts with an application – creating data – that information is catalogued appropriately for future routing. This creates an easily-discoverable record of a user’s data — mapped to the data model and cryptographically associated to the identifier.
For this distributed user table, from the Ceramic blog: “Each user has full sovereign control over their row(s) and can bring that data with them to each app they visit. If an app wants to know what data is available and how to use it, they can reference the data model which will contain a name, description, and other metadata.”
How does this identity system, based on DIDs and data models and a distributed user table, hold up on our criteria?
Dependable: 👍🏽👍🏽 runs on a collection of public networks that anybody can participate in, including partitioned or local ones
Flexible: 👍🏽👍🏽 works with any data structure that a developer can define
Accessible: 👍🏽👍🏽 works with any open network and unique identifier
This system also has a number of additional properties that make for a highly flexible and reliable identity system. It is:
If “identity systems create the ability to reliably associate information with a unique identifier” as described earlier, we want an internet identity system that establishes the bare minimum protocol for managing and routing to trustworthy data, and everything else to be left to the ingenuity and diversity of application developers.
We want to avoid siloed systems - including specific applications, registries or blockchains - and maximize the flexibility of data types. We want an easy-to-use system that lets us build applications with rich forms of data, have that data associated to the appropriate identifier, and get maximum utility out of our identities and collective information.
This technology and model is newer, but growing rapidly. Thousands of Web3 developers are already building with this identity system using tools like Spruce and infrastructure like Ceramic. Not only will this model for identity and data help developers build more powerful and scalable Web3 apps much more quickly than ever, but it’s how Web3 can collectively build a composable dataverse that Web2 platforms can’t possibly compete with.
Get started building with the DID datastore library or join the conversation in chat.ceramic.network.
Thanks to Mason, Jad, David and Kevin for the reviews, edits and inspirations for this post.