Identity graph data model, graph resolution algorithm, deterministic vs probabilistic matching, confidence scoring, and the D1 schema that powers it all.
A single customer uses 3.6 devices on average. They browse on their phone during commute, research on a work laptop, and convert on a home desktop. Without cross-device identity resolution, each session looks like a different person — inflating unique visitor counts, breaking attribution, and fracturing behavioral profiles. This whitepaper details ClickStream's identity graph data model (nodes and edges), the graph resolution algorithm, the distinction between deterministic and probabilistic matching, the confidence scoring system, edge type definitions, the D1 database schema, and the complete identity resolution flow from anonymous visitor to unified customer profile.
When you view a visitor profile in the Visitors tab at einstein.clickstream.com, you see a unified timeline across all their devices — desktop, mobile, tablet — stitched together automatically. No login is required from the visitor. The platform resolves identity using deterministic signals (cookies, hashed emails) and probabilistic matching (behavioral biometrics, IP clustering). This whitepaper explains the resolution algorithms that make it work.
The average customer journey spans multiple devices and sessions. Consider this scenario: a user sees a paid search ad on their phone, clicks through and browses for 2 minutes, leaves. Later that day, they search for the brand on their work laptop, read pricing documentation for 10 minutes, and sign up for a webinar using their work email. Two days later, they visit the site on their home desktop, log in, and complete a purchase.
Without identity resolution, analytics reports this as three separate visitors with three separate journeys. The paid search ad gets no attribution. The webinar signup is unconnected to the purchase. The behavioral profile is fragmented across three incomplete records.
Cross-device identity resolution is the process of connecting these fragmented signals into a single, unified customer profile. It is the prerequisite for accurate attribution, complete behavioral profiling, and reliable customer analytics.
ClickStream models identity as a directed graph where nodes represent identity signals and edges represent observed relationships between signals. This graph-based approach is more flexible than a flat table because it naturally handles many-to-many relationships (one person can have multiple devices, and one device can be shared by multiple people).
| Node Type | Identifier | Persistence | Uniqueness | Source |
|---|---|---|---|---|
| Cookie | First-party visitor ID | 400 days | Unique per browser+device | Server-set cookie via CNAME proxy |
| Email Hash | SHA-256 of normalized email | Permanent | Unique per person (ideally) | Form submission, login, checkout |
| Phone Hash | SHA-256 of normalized phone | Permanent | Unique per person | Form submission, account creation |
| Device Signature | Hash of device attributes | Session to days | Medium (shared across similar devices) | Screen, timezone, language, platform, GPU |
| IP Cluster | IP address or /24 subnet | Dynamic | Low (shared across household/office) | Request headers |
| External ID | CRM ID, SSO ID, etc. | Permanent | Unique (from external system) | API integration, tag parameter |
Edges connect two nodes and carry a confidence score (0.0–1.0) representing the system's belief that the two nodes belong to the same person. Edge types determine the initial confidence:
| Edge Type | From Node | To Node | Base Confidence | Description |
|---|---|---|---|---|
| login | Cookie | Email Hash | 0.99 | User logged in with this email on this cookie |
| form_submit | Cookie | Email Hash | 0.95 | User submitted a form with this email |
| checkout | Cookie | Email Hash | 0.98 | User completed checkout with this email |
| same_session | Cookie | Device Signature | 0.80 | Cookie and device signature observed in same session |
| ip_match | Cookie | IP Cluster | 0.40 | Cookie observed from this IP range |
| ip_temporal | Cookie | Cookie | 0.30 | Two cookies from same IP within short time window |
| signature_match | Cookie | Cookie | 0.60 | Two cookies with same device signature |
| external_link | Cookie | External ID | 0.95 | CRM/ad-tech system linked this cookie to external ID |
| email_match | Email Hash | Email Hash | 1.00 | Same email hash (identity anchor) |
Deterministic matching creates high-confidence edges based on explicit identity signals. These edges are created when the visitor takes an action that directly reveals their identity:
login edge (0.99) between the current cookie and the email hash.form_submit (0.95).checkout (0.98).login (0.99).Deterministic edges are the strongest signals in the identity graph. A single deterministic edge can resolve a previously anonymous visitor into a known customer and link all their historical behavioral data to that customer profile.
Probabilistic matching creates lower-confidence edges based on statistical signals that suggest two nodes may belong to the same person:
When two cookies are observed from the same IP address within a short time window (e.g., 24 hours), they may belong to the same person using different browsers or devices. Base confidence: 0.30 (low, because IP addresses are shared across households and offices).
When two cookies share the same device signature (screen resolution, timezone, language, platform, GPU renderer), they may be the same device with cleared cookies. Base confidence: 0.60 (medium, because device signatures are not perfectly unique).
When two cookies exhibit similar behavioral patterns (same content affinity, similar navigation paths, similar engagement scores) from the same geographic region, they may be the same person. Base confidence: 0.20 (very low, used only as supporting evidence).
Probabilistic confidence increases when multiple independent signals corroborate each other:
The graph resolution algorithm merges nodes that are connected by edges exceeding a confidence threshold. The algorithm runs incrementally on every new edge creation:
When the algorithm detects that two previously separate identity clusters should be merged (e.g., a visitor logs in on their phone and the email hash connects their phone cookie to their desktop cookie), it executes a cluster merge:
The identity graph is stored in Cloudflare D1, a SQLite database at the edge. The schema consists of three core tables:
The complete identity resolution flow executes on every incoming event at the edge:
last_seen and proceed to edge evaluation.Some customers operate multiple domains (e.g., marketing-site.com and app.product.com). ClickStream supports cross-site identity resolution while maintaining data isolation:
When a visitor logs in on one domain and the same email hash appears on another domain (both belonging to the same customer), the identity graph creates a cross-site edge. This requires both domains to be configured under the same ClickStream account.
| Signal | Cross-Site Confidence | Notes |
|---|---|---|
| Same email hash | 0.99 | Strongest cross-site signal |
| Same phone hash | 0.95 | Strong cross-site signal |
| Same external ID | 0.95 | CRM or SSO integration |
| Same device signature + IP | 0.50 | Moderate (could be shared device) |
| Same IP only | 0.15 | Very weak (office networks) |
Cross-device identity resolution transforms fragmented analytics into unified customer intelligence. ClickStream's graph-based identity model — with typed nodes, weighted edges, confidence scoring, and incremental resolution — handles the complexity of real-world customer journeys where one person uses multiple devices, clears cookies, switches browsers, and interacts across multiple company domains.
The key architectural decisions that make this work are: first-party cookies as the primary anchor (high persistence, high confidence), deterministic matching via email hash as the cross-device bridge, probabilistic matching as a supplementary signal with conservative confidence thresholds, and incremental graph resolution that executes on every event without batch processing.
The D1 schema is designed for edge-first execution: simple, denormalized where needed for performance, and partitioned by site for data isolation. The identity graph is not a data warehouse feature that runs nightly. It is a real-time system that resolves identity on every page view, every form submission, and every login event — at the edge, in under 3ms.
Stop paying to acquire the same person twice. Cross-device identity resolution unifies every visit into one conversion path — so your ad spend works harder.
GET EARLY ACCESS