The Art Market Graph

Unified Knowledge Architecture for an Opaque Market

Version 1.0 | Last Updated: January 2026

Abstract

Problem: The art market suffers from extreme opacity—no centralized price database (unlike stocks), fragmented data across auction houses/news/museums, and identity ambiguity across dozens of name variants per artist.

Solution: Egon creates a unified data universe through a 7-tier canonical artist resolution system achieving 99.98% link rate (4,980/1 orphan) across 10+ disparate sources, enabling cross-dataset intelligence impossible elsewhere.

Innovation: Source-aware resolution routing (Getty skip for speed vs. accuracy tradeoffs), tier-aware signal weighting (Discovery tier signals weighted significantly higher than Blue-chip expected performance), 4-dimensional orthogonal embeddings for weighted combination queries, <300ms resolution across 30+ data sources.

Impact: Multi-dimensional search queries impossible on public web ("aesthetically like X but cheaper"), real-time cross-source momentum detection (5-signal confluence scoring), universal deduplication across all name variants.

1. The Art Market Data Problem

The art market is opaque by design. Unlike public financial markets with centralized reporting (Bloomberg Terminal, real-time quotes, mandatory disclosure), the art market operates on controlled information asymmetry: private gallery sales, paywalled auction databases, selective data release.

The same artist appears in dozens of name variants across sources. Building canonical records that deduplicate artist identities across data sources enables unified intelligence—but requires sophisticated resolution logic to balance speed, accuracy, and data quality across professional sources (news, auctions) vs. user input (autocomplete).

Stock Market
Bloomberg Terminal
Centralized price database, real-time quotes, mandatory disclosure
Art Market
Intentional Opacity
Private sales, paywalled data, selective release, gatekept attribution

The Core Challenge: Detecting emerging artist momentum requires unified intelligence across auction data, news signals, institutional validation, and collector behavior—before the broader market recognizes the pattern. This is only possible with canonical artist identities that link disparate data sources with high accuracy and sub-second latency.

2. Multi-Source Data Ingestion

Egon operates a multi-source ingestion layer: news intelligence pipeline (daily RSS extraction from curated art publications), auction data integration (hybrid scraping with fallback methods), institutional validation APIs (museum collections with two-stage false positive filtering), and authoritative identity services (Getty ULAN for biographical data and name variants).

Tier-Aware Signal Weighting

Different signals carry different predictive weight depending on artist tier. This tier-aware relevance system is the key innovation enabling early momentum detection:

Signal Type Discovery Tier Multiplier Blue-chip Tier Multiplier Rationale
gallery_move Highest Minimal Rare for emerging artists, highly predictive of tier elevation
museum_acquisition Very High Low Career-defining for Discovery tier, routine for established artists
auction_record Low High Expected for Blue-chip, signals market strength when achieved
retrospective High Neutral Exceptional institutional validation for emerging artists
price_trend Moderate-High Moderate Discovery price appreciation signals momentum, Blue-chip expected
artist_milestone Moderate Minimal Awards/residencies critical for emerging artists, less so for established
market_shift Neutral Moderate Macro trends affect established artists more significantly

Weight Derivation: Multipliers derived from backtesting 2,000+ artist trajectories over 5-year periods. Discovery tier weights optimized for <18-month predictive horizon (tier elevation events), Blue-chip weights optimized for market timing (buy/hold/sell signals). Weights recalibrated quarterly based on prediction accuracy.

Freshness Weighting: Signal age affects relevance:

  • Breaking (0-7 days): Highest multiplier (recent events weighted more heavily)
  • Current (8-30 days): Baseline multiplier
  • Evergreen (31+ days): Reduced multiplier with progressive decay over time

Signal Extraction: Each extracted signal undergoes Claude NLP analysis to determine base strength (1-10 scale), tier relevance classification, and freshness scoring. Tier-aware multipliers and temporal decay are then applied via a proprietary weighting algorithm to generate final actionable scores.

Data Universe Architecture

graph TB subgraph Sources["Data Sources"] RSS["News Intelligence"] AUC["Auction Data"] MUS["Institutional Validation"] GET["Authoritative Identity"] end subgraph Ingestion["Ingestion Layer"] NF["News Processing"] AS["Auction Integration"] MD["Museum Service"] GS["Identity Service"] end subgraph Resolution["Canonical Resolution Pipeline"] T1["Tier 1: Exact Match"] T2["Tier 2: Alias Lookup"] T3["Tier 3: Fuzzy Matching"] T4["Tier 4: Authoritative Lookup"] T7["Tier 7: Create New"] end subgraph Database["Unified Database"] CA["Canonical Identity Hub"] LINKED["All Data Tables
(linked via canonical ID)"] end subgraph VectorLayer["Vector Search Layer"] VE["Multi-Dimensional Embeddings"] IDX["Optimized Indexes"] end RSS --> NF AUC --> AS MUS --> MD GET --> GS NF --> T1 AS --> T1 MD --> T1 GS --> T4 T1 --> T2 T2 --> T3 T3 --> T4 T4 --> T7 T7 --> CA CA --> LINKED CA --> VE VE --> IDX style Sources fill:#e8f4f8 style Ingestion fill:#fff4e6 style Resolution fill:#f0f8ff style Database fill:#f0fff4 style VectorLayer fill:#fff0f6

3. Canonical Artist Resolution System

The heart of Egon's data universe: a 7-tier matching pipeline achieving 99.98% link rate (4,980 AuctionArtist records linked, 1 orphan; 100% MarketSignal records linked).

Source-Aware Resolution Routing: The Latency/Accuracy Tradeoff

Authoritative verification provides high confidence but incurs latency. The key innovation: route authoritative lookups based on source type to optimize for speed (user autocomplete) or accuracy (professional data ingestion).

graph TD START[Input: Artist Name] --> NORM[Normalize:
lowercase, strip accents, trim] NORM --> T1{Tier 1:
Exact Match?} T1 -->|Yes
High Confidence| DONE[Return canonical_artist_id] T1 -->|No| T2{Tier 2:
Alias Match?} T2 -->|Yes
High Confidence| DONE T2 -->|No| T3{Tier 3:
Fuzzy Similarity
Above Threshold?} T3 -->|Yes
Variable Confidence| DONE T3 -->|No| PROF{Source
Type?} PROF -->|Professional
news/auction| T4{Tier 4:
Authoritative
Match?} PROF -->|User Input
autocomplete| SKIP_GETTY[Skip Authority
speed optimization] T4 -->|Yes
Authoritative Confidence| CREATE_ALIAS[Create alias
from authority variants] T4 -->|No| T7[Tier 7:
Create New
CanonicalArtist] CREATE_ALIAS --> DONE SKIP_GETTY --> T7 T7 --> DONE style T1 fill:#d4edda style T2 fill:#d4edda style T3 fill:#fff3cd style T4 fill:#cfe2ff style T7 fill:#f8d7da style DONE fill:#d1ecf1

Data Freshness System (Automatic TTL Management)

Egon's sophisticated multi-tier TTL system automatically maintains data freshness across 10+ data sources without manual intervention:

Data Type TTL Strategy Invalidation Trigger Rationale
News Signals Daily extraction Daily RSS feed harvesting (9 sources) Market events require real-time detection (gallery moves, auction records, museum acquisitions)
Auction Resolution Cache Extended TTL Automatic expiration on periodic schedule Auction artists change slowly; high cache hit rate reduces authority API calls
Museum/Getty Data Extended TTL Automatic expiration + periodic authoritative updates Institutional holdings stable; authoritative biographies rarely change
Full Analysis Cache Tiered by user subscription Automatic TTL + preference change + market event Balances data freshness with API efficiency; user preference changes trigger immediate invalidation
Artwork Images (CDN) Incremental collection Periodic LLM-powered discovery (5 strategies) + continuous market data ingestion Image library continuously expands with LLM-selected representative artwork examples for each artist; Cloudinary CDN with 4 optimized presets (thumbnail, display, analysis, archive)
Auction Context Real-time (no cache) Every query fetches latest Auction performance trends are time-sensitive investment signals

Why Automatic TTL Enables Custom Deep Research:

  • Autonomous data invalidation: System determines when data is stale without manual intervention. News signals refresh daily, museum data every 90 days, analysis cache at user-tier-specific intervals.
  • On-demand LLM research: When cache expires, Claude performs extensive multi-query OSINT research + database enrichment—research depth impossible with static pre-cached data.
  • Cost-performance balance: 85-98% cache hit rates reduce API costs (Getty ULAN, museum APIs, image collection) while preserving freshness where market timing matters (news, auction context).

Trigram Threshold Tuning (Optimized Balance)

The trigram similarity threshold balances precision (avoiding false matches) vs. recall (catching valid typos):

  • Production threshold: High precision (>99%), strong recall (>94%) — current optimized setting
  • Lower threshold: Reduced precision (~95%), higher recall (~98%) — increased false positives like "Pablo Picasso" matching "Pablo Neruda"
  • Higher threshold: Extreme precision (>99.5%), reduced recall (~89%) — misses valid typos like "Pablo Piccaso"

Why Search Canonical Names Only: Searching aliases creates false positives (e.g., "Pablo" matches both "Pablo Picasso" canonical name and "P. Picasso" alias → ambiguous). Canonical-only search ensures 1:1 matching.

Confidence Scoring System

Each resolution tier assigns confidence scores used by downstream logic (investment grading, user recommendations):

Tier Confidence Range Downstream Use
Tier 1 (Exact) 1.0 No confidence penalty in investment scoring
Tier 2 (Alias) 1.0 Verified alias, full confidence
Tier 3 (Trigram) Variable (threshold-based) High similarity: full confidence. Low similarity (near threshold): flagged for manual review
Tier 4 (Getty) Very High Authoritative source (Getty ULAN), near-perfect confidence
Tier 7 (New) Low (unverified) New canonical artist, requires manual verification before high-confidence use

Investment Grading Impact: Confidence < 0.7 downgrades investment grade by 1 letter (A+ → A). Confidence < 0.5 blocks investment grading entirely (returns "Insufficient Data"). See Investment Science whitepaper for complete grading methodology.

Performance Metrics (Production)

99.98%
Auction Link Rate
4,980/4,981 AuctionArtist records linked
100%
News Signal Link Rate
All MarketSignal records linked (100% professional source accuracy)
<300ms
Resolution Speed
Typical resolution time (excluding Getty, 90%+ cache hit)
10+ Tables
Universal Linking
Watchlist, Collection, Analysis, Auction, News, Lifecycle

4. Cross-Dataset Intelligence

Canonical artist resolution enables sophisticated cross-dataset queries impossible with fragmented data. The key capability: single canonical ID links 10+ database tables for unified intelligence.

Universal Canonical Linking

erDiagram CanonicalArtist ||--o{ AuctionArtist : "canonical_artist_id" CanonicalArtist ||--o{ MarketSignal : "canonical_artist_id" CanonicalArtist ||--o{ ArtistData : "canonical_artist_id" CanonicalArtist ||--o{ Watchlist : "canonical_artist_id" CanonicalArtist ||--o{ Collection : "canonical_artist_id" CanonicalArtist ||--o{ AnalysisHistory : "canonical_artist_id" CanonicalArtist ||--o{ ArtworkLifecycleEvent : "canonical_artist_id" CanonicalArtist { int id PK string canonical_name string authority_id boolean verified jsonb structured_metadata vector embeddings }

Multi-Source Signal Aggregation

Confluence scoring requires data from 5 independent sources—impossible without canonical linking. For complete algorithm specification with weight derivation and contrasting examples, see Investment Science whitepaper Section 3.

# Conceptual illustration of multi-source confluence scoring
def calculate_confluence_score(canonical_id):
    # Single canonical ID enables cross-dataset queries
    signals = gather_signals_from_all_sources(canonical_id)

    # Apply proprietary tier-aware weighting
    weighted_signals = apply_tier_aware_weights(signals, artist_tier)

    # Multi-source confirmation bonus
    base_score = aggregate_weighted_signals(weighted_signals)
    confirmation_bonus = calculate_multi_source_bonus(len(signals))

    return normalize_to_scale(base_score * confirmation_bonus)  # 1-10 scale

Without canonical resolution: Would require querying each table with multiple name variants ("Pablo Picasso", "PICASSO, Pablo", "P. Picasso"), manual deduplication, potential double-counting, 10-100x slower execution.

Real-Time Deduplication

Canonical linking enables automatic deduplication across user activities. For example, a collector tracking "Pablo Picasso", "PICASSO, Pablo", and "P. Picasso" in their watchlist would see 3 separate entries without canonical resolution—but with canonical IDs, these automatically merge into a single artist record with unified engagement scoring.

6. Technical Architecture

Egon's data universe creates differentiated capabilities through three core technical innovations:

1. Source-Aware Resolution Routing

  • Authority skip for autocomplete: Fast resolution for user input vs. thorough verification for professional sources
  • Tiered caching strategy: Multiple cache tiers with TTLs optimized per data type for high cache hit rates
  • Confidence scoring: Variable confidence ranges affect downstream investment grading (low confidence = grade penalty)

2. Tier-Aware Signal Weighting

  • Discovery tier amplification: Gallery moves (highest weight), museum acquisitions (very high), retrospectives (high)
  • Blue-chip tier dampening: Auction records (high weight), market shifts (moderate), milestones (minimal)
  • Backtesting validation: Weights derived from 2,000+ artist trajectories, recalibrated quarterly

3. 4-Dimensional Orthogonal Embeddings

  • Independent dimensions: Low correlation between dimensions ensures weighted queries control specific attributes
  • Weighted combination queries: "Aesthetically like X but cheaper" impossible without orthogonal embeddings
  • Sub-200ms vector search: Optimized vector indexes for fast similarity queries
Link Rate
99.98%
4,980/4,981 auction records, 100% news signals
Resolution Speed
<300ms
Typical (excluding Getty, 90%+ cache hit)
Vector Search
<200ms
Weighted 4D queries across 5,000+ artists

Network Effects

More data sources → higher canonical link rate → better cross-dataset intelligence → more valuable insights → more users → more user interaction data → improved tier-aware signal weights

graph LR A[Add New Data Source] --> B[More Artist Name Variants] B --> C[Higher Canonical Link Rate
99.98%] C --> D[Better Cross-Dataset Intelligence
5-signal confluence] D --> E[More Valuable Insights
Momentum detection] E --> F[More Users] F --> G[More User Interaction Data
Watchlist, Collection, Analysis] G --> H[Improved Tier-Aware Weights
Backtesting recalibration] H --> A style C fill:#d4edda style D fill:#fff3cd style H fill:#cfe2ff

Key Differentiators

  • 99.98% canonical link rate (4,980/1 auction records)
  • Source-aware resolution routing (Getty skip for autocomplete, mandatory for professional sources)
  • Tier-aware signal weighting (Discovery tier signals weighted significantly higher than Blue-chip expected performance)
  • 4-dimensional orthogonal embeddings (r < 0.2 correlation enables weighted queries)
  • Multi-dimensional semantic search ("aesthetically like X but cheaper" impossible on public web)
  • Cross-dataset intelligence (5-signal confluence scoring across 10+ tables)
  • Real-time deduplication (across all name variants via single canonical ID)