The Vendor Lock-In You Don't Talk About
Every SaaS analytics platform makes the same implicit deal: give us your clickstream data, and we'll give you dashboards. It sounds reasonable until you realize what you've actually agreed to.
When you use Google Analytics, Mixpanel, Amplitude, or Heap, you are:
- Sending your raw behavioral data to a third party. Every click, scroll, page view, and conversion event flows to their servers. They store it. They process it. They own the infrastructure.
- Accepting their data model. You see the data in the dimensions and metrics they define. Want to combine signals in a way their UI doesn't support? Build a custom report -- if their API allows it.
- Paying for access to your own data. Export limits, API rate limits, data retention limits. Your data, their rules.
- Training their models with your data. Google explicitly uses GA4 data to improve ad targeting. Your behavioral data makes their ad platform better -- for their other customers, including your competitors.
If you're not paying for the product, you are the product. If you are paying for the product and they're still using your data, you're both the customer and the product.
What Data Ownership Actually Means
Data ownership isn't a philosophical concept. It has concrete, technical requirements:
| Requirement | SaaS Analytics (GA4, Mixpanel) | ClickStream |
|---|---|---|
| Data stored on your infrastructure | No -- their cloud | Yes -- managed infrastructure with full Parquet export |
| Raw event access | Limited (BigQuery export for GA4, with quotas) | Full -- Parquet files with every field |
| Data retention you control | No -- 14 months max for GA4 free | Yes -- keep data forever |
| Export without vendor permission | API rate limits, export quotas | Export anytime, your schedule, standard Parquet format |
| Encryption keys you control | No | Yes -- AES-256-GCM per site |
| Can switch vendors without losing history | Difficult -- proprietary formats | Yes -- standard Parquet |
| Vendor can't access your data | No -- they process it | Yes -- encrypted at rest |
ClickStream processes your data at the edge and stores it on managed Cloudflare R2 infrastructure. You can export your full dataset as standard Apache Parquet files at any time. You own your data. You own the encryption keys. You can read exported Parquet files with any tool that supports the format -- which is every modern data tool.
The Parquet Advantage
Why Parquet? Because it's the de facto standard for analytical data, and choosing it is a deliberate anti-lock-in decision.
What Parquet Gives You
- Columnar storage: Queries that touch 3 columns out of 50 only read those 3 columns. Orders of magnitude faster than row-based formats for analytical queries.
- Compression: Parquet files are typically 75-90% smaller than equivalent CSV or JSON. A month of behavioral data for a mid-traffic site might be 500MB in Parquet vs. 5GB in JSON.
- Schema evolution: Add new fields without breaking existing queries. Old files still work when you add new behavioral scores.
- Universal compatibility: DuckDB, Apache Spark, Pandas, Polars, BigQuery, Snowflake, Databricks, Athena, Presto -- everything reads Parquet.
's Parquet Schema
Every exported event includes the complete behavioral context:
-- ClickStream Parquet Export Schema
visitor_id STRING -- First-party cookie ID
session_id STRING -- Session identifier
timestamp TIMESTAMP -- Event time (UTC)
event_type STRING -- page_view, click, scroll, form, custom
page_url STRING -- Full URL
referrer STRING -- Previous page or external referrer
-- Identity signals
identity_signals STRING -- Pipe-delimited: cookie|hem|maid|click_ids
identity_confidence FLOAT -- 0.0-1.0 match confidence
device_type STRING -- desktop, mobile, tablet
browser STRING -- Chrome, Safari, Firefox, etc.
-- Behavioral scores (all 0-100)
intent_score INT
engagement_score INT
frustration_score INT
purchase_timing INT
churn_risk INT
content_affinity INT
velocity_score INT
loyalty_score INT
conversion_prob INT
session_quality INT
attention_score INT
nav_efficiency INT
price_sensitivity INT
social_proof_resp INT
bot_probability INT
channel_attribution INT
-- Attribution
utm_source STRING
utm_medium STRING
utm_campaign STRING
click_id STRING -- gclid|fbclid|msclkid|ttclid
click_id_type STRING -- Platform identifier
-- Metadata
site_id STRING -- Multi-tenant site identifier
exported_at TIMESTAMP -- Export batch timestamp
This schema gives you everything: raw events, behavioral scores, identity signals, and attribution data. All in a format that any data tool can read without vendor-specific connectors or proprietary SDKs.
Cost Comparison: SaaS Analytics vs. Owned Pipeline
The total cost of ownership (TCO) comparison is striking, especially at scale:
| Cost Category | GA4 (with BigQuery) | Mixpanel Growth | ClickStream |
|---|---|---|---|
| Base platform (100K MTU) | $0 (free tier) | $834/mo | $299/mo |
| Data export/warehouse | $200-500/mo (BigQuery) | $0 (included, limited) | $0 (Parquet export included) |
| Data retention beyond 14 months | BigQuery cost ($200-1000/mo) | Not available on Growth | R2 storage (~$15/TB/mo) |
| Identity resolution | Not included | Limited (email only) | Included (6-layer stack) |
| Behavioral scoring | Not included | Basic (3-4 metrics) | Included (26 models) |
| Data ownership | Google owns it | Mixpanel hosts it | You own it |
| Total (100K MTU, 1 year) | $2,400-18,000 | $10,008+ | $3,588 + ~$180 storage |
At 1M monthly tracked users, the gap widens further. SaaS analytics pricing scales with volume. R2 storage pricing is $0.015/GB/month. Your behavioral data for 1M users might cost $50/month to store indefinitely.
The AI/ML Angle: Your Data, Your Models
This is where data ownership becomes a competitive advantage, not just a cost optimization.
When you export your behavioral data as Parquet files, you can:
Train Custom Models
Use your historical behavioral data to train models specific to your business. A generic "purchase intent" model works. A model trained on your customers' purchase patterns works dramatically better. You know your domain. Generic vendor models don't.
# Example: Train a custom conversion model on your ClickStream data
import polars as pl
from sklearn.ensemble import GradientBoostingClassifier
# Read your exported ClickStream Parquet files
df = pl.read_parquet("s3://your-warehouse/clickstream/2026-01/*.parquet")
# Your behavioral scores become features
features = df.select([
"intent_score", "engagement_score", "frustration_score",
"purchase_timing", "velocity_score", "session_quality",
"attention_score", "price_sensitivity"
]).to_pandas()
labels = df.select("converted").to_pandas()
model = GradientBoostingClassifier(n_estimators=200)
model.fit(features, labels)
# Your model, trained on your data, predicting your conversions
print(f"Feature importance: {dict(zip(features.columns, model.feature_importances_))}")
Build Predictive Pipelines
Feed behavioral scores into your own ML pipelines for next-best-action recommendations, dynamic pricing, churn intervention timing, or content personalization. You can't do this with data locked in a vendor's dashboard.
Create Custom Audiences
Use behavioral clustering on your own data to create audience segments that no vendor's built-in segmentation can match. Combine behavioral scores with your CRM data, purchase history, and support interactions for a complete customer profile.
Run Competitive Analysis
Your behavioral data is proprietary. No competitor has it. Models trained on your specific customer behaviors give you insights that generic analytics can never provide. This is an actual competitive moat -- but only if you own the data.
The companies that will win the next decade of digital business are the ones that own their behavioral data and build proprietary models on top of it. You can't build a moat on rented land.
The Migration Path
Moving from SaaS analytics to an owned pipeline doesn't have to be a big-bang migration. Here's a pragmatic approach:
Phase 1: Parallel Collection (Week 1)
Add the ClickStream script tag alongside your existing analytics. Both systems collect data simultaneously. You lose nothing and start building your owned data lake from day one.
Phase 2: Validation (Weeks 2-4)
Compare ClickStream's behavioral data against your existing analytics. Verify visitor counts, conversion tracking, and attribution data match (they should, with ClickStream typically showing higher visitor recognition due to first-party cookies).
Phase 3: Model Building (Weeks 4-8)
Start training custom models on your Parquet exports. Build dashboards using DuckDB or your preferred analytics tool. Create automated pipelines for your most important metrics.
Phase 4: Cutover (Week 8+)
Once your owned pipeline is producing better insights than your SaaS analytics (it will), remove the old tracking code. Your historical data is exportable as Parquet files for use in any tool. No vendor transition required.
Data Sovereignty and Compliance
Owning your analytics pipeline also simplifies compliance:
- GDPR Article 17 (Right to Erasure): When a user requests deletion, you delete from your storage. No waiting for a vendor's support ticket queue. No wondering if they actually deleted it.
- GDPR Article 20 (Data Portability): Export the user's data from your Parquet files in any format. It's your data, in a standard format, on your infrastructure.
- CCPA/CPRA: Full visibility into what data you hold about each consumer. No vendor black box.
- Data Residency Requirements: ClickStream supports data residency requirements. EU data stays in EU. US data stays in US. Export your data to your own regional infrastructure at any time.
With ClickStream, every site's data is encrypted with AES-256-GCM using per-site encryption keys. Even ClickStream can't read your data at rest. This is data sovereignty in the strictest sense: you hold the keys, you hold the data, you control the access.
The Bottom Line
The analytics industry has normalized a model where you pay a vendor to collect your data, store it on their infrastructure, and charge you to access it. They use your data to improve their products. They limit your access through API quotas and export restrictions. And when you want to leave, your historical data either stays behind or comes out in a proprietary format that's expensive to migrate.
There's a better model:
- Collect data using first-party infrastructure -- your domain, your cookies, your edge workers
- Process at the edge -- behavioral scoring in under 3ms, no origin round-trips
- Export your data anytime -- standard Parquet files for use in your own tools and warehouses
- Build your models -- train on your data, create proprietary intelligence
- Control your compliance -- your encryption keys, your data residency, your retention policies
Your analytics data is one of the most valuable assets your company produces. Stop renting access to it. Own it.