The Art Market Graph

Unified Knowledge Architecture for an Opaque Market

Version 1.0 | Last Updated: January 2026

Abstract

Problem: The art market suffers from extreme opacity—no centralized price database (unlike stocks), fragmented data across auction houses/news/museums, and identity ambiguity across dozens of name variants per artist.

Solution: Egon creates a unified data universe through a 7-tier canonical artist resolution system achieving 99.98% link rate (4,980/1 orphan) across 10+ disparate sources, enabling cross-dataset intelligence impossible elsewhere.

Innovation: Source-aware resolution routing (Getty skip for speed vs. accuracy tradeoffs), tier-aware signal weighting (Discovery tier signals weighted significantly higher than Blue-chip expected performance), 4-dimensional orthogonal embeddings for weighted combination queries, <300ms resolution across 30+ data sources.

Impact: Multi-dimensional search queries impossible on public web ("aesthetically like X but cheaper"), real-time cross-source momentum detection (5-signal confluence scoring), universal deduplication across all name variants.

1. The Art Market Data Problem

The art market is opaque by design. Unlike public financial markets with centralized reporting (Bloomberg Terminal, real-time quotes, mandatory disclosure), the art market operates on controlled information asymmetry: private gallery sales, paywalled auction databases, selective data release.

The same artist appears in dozens of name variants across sources. Building canonical records that deduplicate artist identities across data sources enables unified intelligence—but requires sophisticated resolution logic to balance speed, accuracy, and data quality across professional sources (news, auctions) vs. user input (autocomplete).

Stock Market

Bloomberg Terminal

Centralized price database, real-time quotes, mandatory disclosure

Art Market

Intentional Opacity

Private sales, paywalled data, selective release, gatekept attribution

The Core Challenge: Detecting emerging artist momentum requires unified intelligence across auction data, news signals, institutional validation, and collector behavior—before the broader market recognizes the pattern. This is only possible with canonical artist identities that link disparate data sources with high accuracy and sub-second latency.

2. Multi-Source Data Ingestion

Egon operates a multi-source ingestion layer: news intelligence pipeline (daily RSS extraction from curated art publications), auction data integration (hybrid scraping with fallback methods), institutional validation APIs (museum collections with two-stage false positive filtering), and authoritative identity services (Getty ULAN for biographical data and name variants).

Tier-Aware Signal Weighting

Different signals carry different predictive weight depending on artist tier. This tier-aware relevance system is the key innovation enabling early momentum detection:

Signal Type	Discovery Tier Multiplier	Blue-chip Tier Multiplier	Rationale
gallery_move	Highest	Minimal	Rare for emerging artists, highly predictive of tier elevation
museum_acquisition	Very High	Low	Career-defining for Discovery tier, routine for established artists
auction_record	Low	High	Expected for Blue-chip, signals market strength when achieved
retrospective	High	Neutral	Exceptional institutional validation for emerging artists
price_trend	Moderate-High	Moderate	Discovery price appreciation signals momentum, Blue-chip expected
artist_milestone	Moderate	Minimal	Awards/residencies critical for emerging artists, less so for established
market_shift	Neutral	Moderate	Macro trends affect established artists more significantly

Weight Derivation: Multipliers derived from backtesting 2,000+ artist trajectories over 5-year periods. Discovery tier weights optimized for <18-month predictive horizon (tier elevation events), Blue-chip weights optimized for market timing (buy/hold/sell signals). Weights recalibrated quarterly based on prediction accuracy.

Freshness Weighting: Signal age affects relevance:

Breaking (0-7 days): Highest multiplier (recent events weighted more heavily)
Current (8-30 days): Baseline multiplier
Evergreen (31+ days): Reduced multiplier with progressive decay over time

Signal Extraction: Each extracted signal undergoes Claude NLP analysis to determine base strength (1-10 scale), tier relevance classification, and freshness scoring. Tier-aware multipliers and temporal decay are then applied via a proprietary weighting algorithm to generate final actionable scores.

Data Universe Architecture

graph TB subgraph Sources["Data Sources"] RSS["News Intelligence"] AUC["Auction Data"] MUS["Institutional Validation"] GET["Authoritative Identity"] end subgraph Ingestion["Ingestion Layer"] NF["News Processing"] AS["Auction Integration"] MD["Museum Service"] GS["Identity Service"] end subgraph Resolution["Canonical Resolution Pipeline"] T1["Tier 1: Exact Match"] T2["Tier 2: Alias Lookup"] T3["Tier 3: Fuzzy Matching"] T4["Tier 4: Authoritative Lookup"] T7["Tier 7: Create New"] end subgraph Database["Unified Database"] CA["Canonical Identity Hub"] LINKED["All Data Tables
(linked via canonical ID)"] end subgraph VectorLayer["Vector Search Layer"] VE["Multi-Dimensional Embeddings"] IDX["Optimized Indexes"] end RSS --> NF AUC --> AS MUS --> MD GET --> GS NF --> T1 AS --> T1 MD --> T1 GS --> T4 T1 --> T2 T2 --> T3 T3 --> T4 T4 --> T7 T7 --> CA CA --> LINKED CA --> VE VE --> IDX style Sources fill:#e8f4f8 style Ingestion fill:#fff4e6 style Resolution fill:#f0f8ff style Database fill:#f0fff4 style VectorLayer fill:#fff0f6

3. Canonical Artist Resolution System

The heart of Egon's data universe: a 7-tier matching pipeline achieving 99.98% link rate (4,980 AuctionArtist records linked, 1 orphan; 100% MarketSignal records linked).

Source-Aware Resolution Routing: The Latency/Accuracy Tradeoff

Authoritative verification provides high confidence but incurs latency. The key innovation: route authoritative lookups based on source type to optimize for speed (user autocomplete) or accuracy (professional data ingestion).

graph TD START[Input: Artist Name] --> NORM[Normalize:
lowercase, strip accents, trim] NORM --> T1{Tier 1:
Exact Match?} T1 -->|Yes
High Confidence| DONE[Return canonical_artist_id] T1 -->|No| T2{Tier 2:
Alias Match?} T2 -->|Yes
High Confidence| DONE T2 -->|No| T3{Tier 3:
Fuzzy Similarity
Above Threshold?} T3 -->|Yes
Variable Confidence| DONE T3 -->|No| PROF{Source
Type?} PROF -->|Professional
news/auction| T4{Tier 4:
Authoritative
Match?} PROF -->|User Input
autocomplete| SKIP_GETTY[Skip Authority
speed optimization] T4 -->|Yes
Authoritative Confidence| CREATE_ALIAS[Create alias
from authority variants] T4 -->|No| T7[Tier 7:
Create New
CanonicalArtist] CREATE_ALIAS --> DONE SKIP_GETTY --> T7 T7 --> DONE style T1 fill:#d4edda style T2 fill:#d4edda style T3 fill:#fff3cd style T4 fill:#cfe2ff style T7 fill:#f8d7da style DONE fill:#d1ecf1

Data Freshness System (Automatic TTL Management)

Egon's sophisticated multi-tier TTL system automatically maintains data freshness across 10+ data sources without manual intervention:

Data Type	TTL Strategy	Invalidation Trigger	Rationale
News Signals	Daily extraction	Daily RSS feed harvesting (9 sources)	Market events require real-time detection (gallery moves, auction records, museum acquisitions)
Auction Resolution Cache	Extended TTL	Automatic expiration on periodic schedule	Auction artists change slowly; high cache hit rate reduces authority API calls
Museum/Getty Data	Extended TTL	Automatic expiration + periodic authoritative updates	Institutional holdings stable; authoritative biographies rarely change
Full Analysis Cache	Tiered by user subscription	Automatic TTL + preference change + market event	Balances data freshness with API efficiency; user preference changes trigger immediate invalidation
Artwork Images (CDN)	Incremental collection	Periodic LLM-powered discovery (5 strategies) + continuous market data ingestion	Image library continuously expands with LLM-selected representative artwork examples for each artist; Cloudinary CDN with 4 optimized presets (thumbnail, display, analysis, archive)
Auction Context	Real-time (no cache)	Every query fetches latest	Auction performance trends are time-sensitive investment signals

Why Automatic TTL Enables Custom Deep Research:

Autonomous data invalidation: System determines when data is stale without manual intervention. News signals refresh daily, museum data every 90 days, analysis cache at user-tier-specific intervals.
On-demand LLM research: When cache expires, Claude performs extensive multi-query OSINT research + database enrichment—research depth impossible with static pre-cached data.
Cost-performance balance: 85-98% cache hit rates reduce API costs (Getty ULAN, museum APIs, image collection) while preserving freshness where market timing matters (news, auction context).

Trigram Threshold Tuning (Optimized Balance)

The trigram similarity threshold balances precision (avoiding false matches) vs. recall (catching valid typos):

Production threshold: High precision (>99%), strong recall (>94%) — current optimized setting
Lower threshold: Reduced precision (~95%), higher recall (~98%) — increased false positives like "Pablo Picasso" matching "Pablo Neruda"
Higher threshold: Extreme precision (>99.5%), reduced recall (~89%) — misses valid typos like "Pablo Piccaso"

Why Search Canonical Names Only: Searching aliases creates false positives (e.g., "Pablo" matches both "Pablo Picasso" canonical name and "P. Picasso" alias → ambiguous). Canonical-only search ensures 1:1 matching.

Confidence Scoring System

Each resolution tier assigns confidence scores used by downstream logic (investment grading, user recommendations):

Tier	Confidence Range	Downstream Use
Tier 1 (Exact)	1.0	No confidence penalty in investment scoring
Tier 2 (Alias)	1.0	Verified alias, full confidence
Tier 3 (Trigram)	Variable (threshold-based)	High similarity: full confidence. Low similarity (near threshold): flagged for manual review
Tier 4 (Getty)	Very High	Authoritative source (Getty ULAN), near-perfect confidence
Tier 7 (New)	Low (unverified)	New canonical artist, requires manual verification before high-confidence use

Investment Grading Impact: Confidence < 0.7 downgrades investment grade by 1 letter (A+ → A). Confidence < 0.5 blocks investment grading entirely (returns "Insufficient Data"). See Investment Science whitepaper for complete grading methodology.

Performance Metrics (Production)

99.98%

Auction Link Rate

4,980/4,981 AuctionArtist records linked

100%

News Signal Link Rate

All MarketSignal records linked (100% professional source accuracy)

<300ms

Resolution Speed

Typical resolution time (excluding Getty, 90%+ cache hit)

10+ Tables

Universal Linking

Watchlist, Collection, Analysis, Auction, News, Lifecycle

4. Cross-Dataset Intelligence

Canonical artist resolution enables sophisticated cross-dataset queries impossible with fragmented data. The key capability: single canonical ID links 10+ database tables for unified intelligence.

Universal Canonical Linking

erDiagram CanonicalArtist ||--o{ AuctionArtist : "canonical_artist_id" CanonicalArtist ||--o{ MarketSignal : "canonical_artist_id" CanonicalArtist ||--o{ ArtistData : "canonical_artist_id" CanonicalArtist ||--o{ Watchlist : "canonical_artist_id" CanonicalArtist ||--o{ Collection : "canonical_artist_id" CanonicalArtist ||--o{ AnalysisHistory : "canonical_artist_id" CanonicalArtist ||--o{ ArtworkLifecycleEvent : "canonical_artist_id" CanonicalArtist { int id PK string canonical_name string authority_id boolean verified jsonb structured_metadata vector embeddings }

Multi-Source Signal Aggregation

Confluence scoring requires data from 5 independent sources—impossible without canonical linking. For complete algorithm specification with weight derivation and contrasting examples, see Investment Science whitepaper Section 3.

# Conceptual illustration of multi-source confluence scoring
def calculate_confluence_score(canonical_id):
    # Single canonical ID enables cross-dataset queries
    signals = gather_signals_from_all_sources(canonical_id)

    # Apply proprietary tier-aware weighting
    weighted_signals = apply_tier_aware_weights(signals, artist_tier)

    # Multi-source confirmation bonus
    base_score = aggregate_weighted_signals(weighted_signals)
    confirmation_bonus = calculate_multi_source_bonus(len(signals))

    return normalize_to_scale(base_score * confirmation_bonus)  # 1-10 scale

Without canonical resolution: Would require querying each table with multiple name variants ("Pablo Picasso", "PICASSO, Pablo", "P. Picasso"), manual deduplication, potential double-counting, 10-100x slower execution.

Real-Time Deduplication

Canonical linking enables automatic deduplication across user activities. For example, a collector tracking "Pablo Picasso", "PICASSO, Pablo", and "P. Picasso" in their watchlist would see 3 separate entries without canonical resolution—but with canonical IDs, these automatically merge into a single artist record with unified engagement scoring.

5. Multi-Dimensional Vector Search

Egon creates 4 independent high-dimensional embeddings per canonical artist. The innovation: orthogonal dimensions enable weighted combination queries impossible on public web search.

4-Dimensional Embedding Architecture

Dimension	Training Data	Orthogonality Validation	Use Case
Aesthetic	Visual style, medium, movement descriptions, artwork imagery	Low correlation with market	"Find artists aesthetically similar to Basquiat"
Market	Auction prices, gallery tier, collector base, transaction volume	Low correlation with aesthetic	"Find artists at similar price points"
Institutional	Museum collections, retrospectives, awards, critical reception	Low correlation with investment	"Find artists with similar institutional validation"
Investment	Price trends, momentum signals, tier classification, liquidity	Low correlation with market	"Find artists with similar investment profiles"

Orthogonality Validation: Statistical correlation measured across thousands of artist pairs. Low correlation between dimensions confirms they capture independent artist attributes. Example: Aesthetic and market dimensions are largely independent—a lower-priced artist can have Blue-chip aesthetic quality (the basis for Discovery tier alpha opportunities).

Weighted Combination Queries

# Example 1: "Find artists aesthetically like Basquiat but cheaper"
search_results = vector_search_service.search_multi_dimensional(
    canonical_artist_id=basquiat_id,
    weights={
        'aesthetic': 'primary',    # Prioritize aesthetic similarity
        'market': 'secondary'      # Secondary factor: market positioning (cheaper = lower tier)
    },
    limit=10
)
# Returns: Artists with similar style (graffiti, neo-expressionism)
# but avg_hammer_price < $50K (vs Basquiat $500K+)

# Example 2: "Find Discovery tier artists with Blue-chip aesthetic"
search_results = vector_search_service.search_multi_dimensional(
    canonical_artist_id=blue_chip_artist_id,
    weights={
        'aesthetic': 'primary',      # Prioritize aesthetic match
        'investment': 'tertiary'      # Lower weight for investment profile (Discovery tier filter)
    },
    filters={'tier': 'Discovery'},
    limit=10
)
# Returns: Emerging artists ($5K-$25K) with established aesthetic appeal
# (undervalued opportunities before market recognition)

# Example 3: "Find artists with institutional validation but undervalued"
search_results = vector_search_service.search_multi_dimensional(
    canonical_artist_id=reference_artist_id,
    weights={
        'institutional': 'primary',    # Prioritize institutional similarity
        'market': 'secondary'          # Secondary factor: market position (undervalued filter)
    },
    filters={'avg_hammer_price__lt': 100000},
    limit=10
)
# Returns: Museum-validated artists (MoMA, Met, Tate) under $100K
# (institutional quality at accessible prices)

Why This Is Impossible on Public Web: Google/Bing cannot answer "aesthetically like X but cheaper" because they lack separate aesthetic and market embeddings. Web search collapses all dimensions into a single relevance score. Egon's 4-dimensional architecture enables precise control over which attributes matter for a given query.

Performance: Sub-200ms for weighted multi-dimensional search across 5,000+ canonical artists using optimized vector indexes.

6. Technical Architecture

Egon's data universe creates differentiated capabilities through three core technical innovations:

1. Source-Aware Resolution Routing

Authority skip for autocomplete: Fast resolution for user input vs. thorough verification for professional sources
Tiered caching strategy: Multiple cache tiers with TTLs optimized per data type for high cache hit rates
Confidence scoring: Variable confidence ranges affect downstream investment grading (low confidence = grade penalty)

2. Tier-Aware Signal Weighting

Discovery tier amplification: Gallery moves (highest weight), museum acquisitions (very high), retrospectives (high)
Blue-chip tier dampening: Auction records (high weight), market shifts (moderate), milestones (minimal)
Backtesting validation: Weights derived from 2,000+ artist trajectories, recalibrated quarterly

3. 4-Dimensional Orthogonal Embeddings

Independent dimensions: Low correlation between dimensions ensures weighted queries control specific attributes
Weighted combination queries: "Aesthetically like X but cheaper" impossible without orthogonal embeddings
Sub-200ms vector search: Optimized vector indexes for fast similarity queries

Link Rate

99.98%

4,980/4,981 auction records, 100% news signals

Resolution Speed

<300ms

Typical (excluding Getty, 90%+ cache hit)

Vector Search

<200ms

Weighted 4D queries across 5,000+ artists

Network Effects

More data sources → higher canonical link rate → better cross-dataset intelligence → more valuable insights → more users → more user interaction data → improved tier-aware signal weights

graph LR A[Add New Data Source] --> B[More Artist Name Variants] B --> C[Higher Canonical Link Rate
99.98%] C --> D[Better Cross-Dataset Intelligence
5-signal confluence] D --> E[More Valuable Insights
Momentum detection] E --> F[More Users] F --> G[More User Interaction Data
Watchlist, Collection, Analysis] G --> H[Improved Tier-Aware Weights
Backtesting recalibration] H --> A style C fill:#d4edda style D fill:#fff3cd style H fill:#cfe2ff

Key Differentiators

✅ 99.98% canonical link rate (4,980/1 auction records)
✅ Source-aware resolution routing (Getty skip for autocomplete, mandatory for professional sources)
✅ Tier-aware signal weighting (Discovery tier signals weighted significantly higher than Blue-chip expected performance)
✅ 4-dimensional orthogonal embeddings (r < 0.2 correlation enables weighted queries)
✅ Multi-dimensional semantic search ("aesthetically like X but cheaper" impossible on public web)
✅ Cross-dataset intelligence (5-signal confluence scoring across 10+ tables)
✅ Real-time deduplication (across all name variants via single canonical ID)

Continue reading:

AI Advisory Investment Science Aesthetic Systems

Questions? Contact us through Egon