The Art Market Graph
Unified Knowledge Architecture for an Opaque Market
Version 1.0 | Last Updated: January 2026
Abstract
Problem: The art market suffers from extreme opacity—no centralized price database (unlike stocks), fragmented data across auction houses/news/museums, and identity ambiguity across dozens of name variants per artist.
Solution: Egon creates a unified data universe through a 7-tier canonical artist resolution system achieving 99.98% link rate (4,980/1 orphan) across 10+ disparate sources, enabling cross-dataset intelligence impossible elsewhere.
Innovation: Source-aware resolution routing (Getty skip for speed vs. accuracy tradeoffs), tier-aware signal weighting (Discovery tier signals weighted significantly higher than Blue-chip expected performance), 4-dimensional orthogonal embeddings for weighted combination queries, <300ms resolution across 30+ data sources.
Impact: Multi-dimensional search queries impossible on public web ("aesthetically like X but cheaper"), real-time cross-source momentum detection (5-signal confluence scoring), universal deduplication across all name variants.
1. The Art Market Data Problem
The art market is opaque by design. Unlike public financial markets with centralized reporting (Bloomberg Terminal, real-time quotes, mandatory disclosure), the art market operates on controlled information asymmetry: private gallery sales, paywalled auction databases, selective data release.
The same artist appears in dozens of name variants across sources. Building canonical records that deduplicate artist identities across data sources enables unified intelligence—but requires sophisticated resolution logic to balance speed, accuracy, and data quality across professional sources (news, auctions) vs. user input (autocomplete).
The Core Challenge: Detecting emerging artist momentum requires unified intelligence across auction data, news signals, institutional validation, and collector behavior—before the broader market recognizes the pattern. This is only possible with canonical artist identities that link disparate data sources with high accuracy and sub-second latency.
2. Multi-Source Data Ingestion
Egon operates a multi-source ingestion layer: news intelligence pipeline (daily RSS extraction from curated art publications), auction data integration (hybrid scraping with fallback methods), institutional validation APIs (museum collections with two-stage false positive filtering), and authoritative identity services (Getty ULAN for biographical data and name variants).
Tier-Aware Signal Weighting
Different signals carry different predictive weight depending on artist tier. This tier-aware relevance system is the key innovation enabling early momentum detection:
| Signal Type | Discovery Tier Multiplier | Blue-chip Tier Multiplier | Rationale |
|---|---|---|---|
| gallery_move | Highest | Minimal | Rare for emerging artists, highly predictive of tier elevation |
| museum_acquisition | Very High | Low | Career-defining for Discovery tier, routine for established artists |
| auction_record | Low | High | Expected for Blue-chip, signals market strength when achieved |
| retrospective | High | Neutral | Exceptional institutional validation for emerging artists |
| price_trend | Moderate-High | Moderate | Discovery price appreciation signals momentum, Blue-chip expected |
| artist_milestone | Moderate | Minimal | Awards/residencies critical for emerging artists, less so for established |
| market_shift | Neutral | Moderate | Macro trends affect established artists more significantly |
Weight Derivation: Multipliers derived from backtesting 2,000+ artist trajectories over 5-year periods. Discovery tier weights optimized for <18-month predictive horizon (tier elevation events), Blue-chip weights optimized for market timing (buy/hold/sell signals). Weights recalibrated quarterly based on prediction accuracy.
Freshness Weighting: Signal age affects relevance:
- Breaking (0-7 days): Highest multiplier (recent events weighted more heavily)
- Current (8-30 days): Baseline multiplier
- Evergreen (31+ days): Reduced multiplier with progressive decay over time
Signal Extraction: Each extracted signal undergoes Claude NLP analysis to determine base strength (1-10 scale), tier relevance classification, and freshness scoring. Tier-aware multipliers and temporal decay are then applied via a proprietary weighting algorithm to generate final actionable scores.
Data Universe Architecture
(linked via canonical ID)"] end subgraph VectorLayer["Vector Search Layer"] VE["Multi-Dimensional Embeddings"] IDX["Optimized Indexes"] end RSS --> NF AUC --> AS MUS --> MD GET --> GS NF --> T1 AS --> T1 MD --> T1 GS --> T4 T1 --> T2 T2 --> T3 T3 --> T4 T4 --> T7 T7 --> CA CA --> LINKED CA --> VE VE --> IDX style Sources fill:#e8f4f8 style Ingestion fill:#fff4e6 style Resolution fill:#f0f8ff style Database fill:#f0fff4 style VectorLayer fill:#fff0f6
3. Canonical Artist Resolution System
The heart of Egon's data universe: a 7-tier matching pipeline achieving 99.98% link rate (4,980 AuctionArtist records linked, 1 orphan; 100% MarketSignal records linked).
Source-Aware Resolution Routing: The Latency/Accuracy Tradeoff
Authoritative verification provides high confidence but incurs latency. The key innovation: route authoritative lookups based on source type to optimize for speed (user autocomplete) or accuracy (professional data ingestion).
lowercase, strip accents, trim] NORM --> T1{Tier 1:
Exact Match?} T1 -->|Yes
High Confidence| DONE[Return canonical_artist_id] T1 -->|No| T2{Tier 2:
Alias Match?} T2 -->|Yes
High Confidence| DONE T2 -->|No| T3{Tier 3:
Fuzzy Similarity
Above Threshold?} T3 -->|Yes
Variable Confidence| DONE T3 -->|No| PROF{Source
Type?} PROF -->|Professional
news/auction| T4{Tier 4:
Authoritative
Match?} PROF -->|User Input
autocomplete| SKIP_GETTY[Skip Authority
speed optimization] T4 -->|Yes
Authoritative Confidence| CREATE_ALIAS[Create alias
from authority variants] T4 -->|No| T7[Tier 7:
Create New
CanonicalArtist] CREATE_ALIAS --> DONE SKIP_GETTY --> T7 T7 --> DONE style T1 fill:#d4edda style T2 fill:#d4edda style T3 fill:#fff3cd style T4 fill:#cfe2ff style T7 fill:#f8d7da style DONE fill:#d1ecf1
Data Freshness System (Automatic TTL Management)
Egon's sophisticated multi-tier TTL system automatically maintains data freshness across 10+ data sources without manual intervention:
| Data Type | TTL Strategy | Invalidation Trigger | Rationale |
|---|---|---|---|
| News Signals | Daily extraction | Daily RSS feed harvesting (9 sources) | Market events require real-time detection (gallery moves, auction records, museum acquisitions) |
| Auction Resolution Cache | Extended TTL | Automatic expiration on periodic schedule | Auction artists change slowly; high cache hit rate reduces authority API calls |
| Museum/Getty Data | Extended TTL | Automatic expiration + periodic authoritative updates | Institutional holdings stable; authoritative biographies rarely change |
| Full Analysis Cache | Tiered by user subscription | Automatic TTL + preference change + market event | Balances data freshness with API efficiency; user preference changes trigger immediate invalidation |
| Artwork Images (CDN) | Incremental collection | Periodic LLM-powered discovery (5 strategies) + continuous market data ingestion | Image library continuously expands with LLM-selected representative artwork examples for each artist; Cloudinary CDN with 4 optimized presets (thumbnail, display, analysis, archive) |
| Auction Context | Real-time (no cache) | Every query fetches latest | Auction performance trends are time-sensitive investment signals |
Why Automatic TTL Enables Custom Deep Research:
- Autonomous data invalidation: System determines when data is stale without manual intervention. News signals refresh daily, museum data every 90 days, analysis cache at user-tier-specific intervals.
- On-demand LLM research: When cache expires, Claude performs extensive multi-query OSINT research + database enrichment—research depth impossible with static pre-cached data.
- Cost-performance balance: 85-98% cache hit rates reduce API costs (Getty ULAN, museum APIs, image collection) while preserving freshness where market timing matters (news, auction context).
Trigram Threshold Tuning (Optimized Balance)
The trigram similarity threshold balances precision (avoiding false matches) vs. recall (catching valid typos):
- Production threshold: High precision (>99%), strong recall (>94%) — current optimized setting
- Lower threshold: Reduced precision (~95%), higher recall (~98%) — increased false positives like "Pablo Picasso" matching "Pablo Neruda"
- Higher threshold: Extreme precision (>99.5%), reduced recall (~89%) — misses valid typos like "Pablo Piccaso"
Why Search Canonical Names Only: Searching aliases creates false positives (e.g., "Pablo" matches both "Pablo Picasso" canonical name and "P. Picasso" alias → ambiguous). Canonical-only search ensures 1:1 matching.
Confidence Scoring System
Each resolution tier assigns confidence scores used by downstream logic (investment grading, user recommendations):
| Tier | Confidence Range | Downstream Use |
|---|---|---|
| Tier 1 (Exact) | 1.0 | No confidence penalty in investment scoring |
| Tier 2 (Alias) | 1.0 | Verified alias, full confidence |
| Tier 3 (Trigram) | Variable (threshold-based) | High similarity: full confidence. Low similarity (near threshold): flagged for manual review |
| Tier 4 (Getty) | Very High | Authoritative source (Getty ULAN), near-perfect confidence |
| Tier 7 (New) | Low (unverified) | New canonical artist, requires manual verification before high-confidence use |
Investment Grading Impact: Confidence < 0.7 downgrades investment grade by 1 letter (A+ → A). Confidence < 0.5 blocks investment grading entirely (returns "Insufficient Data"). See Investment Science whitepaper for complete grading methodology.
Performance Metrics (Production)
4. Cross-Dataset Intelligence
Canonical artist resolution enables sophisticated cross-dataset queries impossible with fragmented data. The key capability: single canonical ID links 10+ database tables for unified intelligence.
Universal Canonical Linking
Multi-Source Signal Aggregation
Confluence scoring requires data from 5 independent sources—impossible without canonical linking. For complete algorithm specification with weight derivation and contrasting examples, see Investment Science whitepaper Section 3.
# Conceptual illustration of multi-source confluence scoring
def calculate_confluence_score(canonical_id):
# Single canonical ID enables cross-dataset queries
signals = gather_signals_from_all_sources(canonical_id)
# Apply proprietary tier-aware weighting
weighted_signals = apply_tier_aware_weights(signals, artist_tier)
# Multi-source confirmation bonus
base_score = aggregate_weighted_signals(weighted_signals)
confirmation_bonus = calculate_multi_source_bonus(len(signals))
return normalize_to_scale(base_score * confirmation_bonus) # 1-10 scale
Without canonical resolution: Would require querying each table with multiple name variants ("Pablo Picasso", "PICASSO, Pablo", "P. Picasso"), manual deduplication, potential double-counting, 10-100x slower execution.
Real-Time Deduplication
Canonical linking enables automatic deduplication across user activities. For example, a collector tracking "Pablo Picasso", "PICASSO, Pablo", and "P. Picasso" in their watchlist would see 3 separate entries without canonical resolution—but with canonical IDs, these automatically merge into a single artist record with unified engagement scoring.
5. Multi-Dimensional Vector Search
Egon creates 4 independent high-dimensional embeddings per canonical artist. The innovation: orthogonal dimensions enable weighted combination queries impossible on public web search.
4-Dimensional Embedding Architecture
| Dimension | Training Data | Orthogonality Validation | Use Case |
|---|---|---|---|
| Aesthetic | Visual style, medium, movement descriptions, artwork imagery | Low correlation with market | "Find artists aesthetically similar to Basquiat" |
| Market | Auction prices, gallery tier, collector base, transaction volume | Low correlation with aesthetic | "Find artists at similar price points" |
| Institutional | Museum collections, retrospectives, awards, critical reception | Low correlation with investment | "Find artists with similar institutional validation" |
| Investment | Price trends, momentum signals, tier classification, liquidity | Low correlation with market | "Find artists with similar investment profiles" |
Orthogonality Validation: Statistical correlation measured across thousands of artist pairs. Low correlation between dimensions confirms they capture independent artist attributes. Example: Aesthetic and market dimensions are largely independent—a lower-priced artist can have Blue-chip aesthetic quality (the basis for Discovery tier alpha opportunities).
Weighted Combination Queries
# Example 1: "Find artists aesthetically like Basquiat but cheaper"
search_results = vector_search_service.search_multi_dimensional(
canonical_artist_id=basquiat_id,
weights={
'aesthetic': 'primary', # Prioritize aesthetic similarity
'market': 'secondary' # Secondary factor: market positioning (cheaper = lower tier)
},
limit=10
)
# Returns: Artists with similar style (graffiti, neo-expressionism)
# but avg_hammer_price < $50K (vs Basquiat $500K+)
# Example 2: "Find Discovery tier artists with Blue-chip aesthetic"
search_results = vector_search_service.search_multi_dimensional(
canonical_artist_id=blue_chip_artist_id,
weights={
'aesthetic': 'primary', # Prioritize aesthetic match
'investment': 'tertiary' # Lower weight for investment profile (Discovery tier filter)
},
filters={'tier': 'Discovery'},
limit=10
)
# Returns: Emerging artists ($5K-$25K) with established aesthetic appeal
# (undervalued opportunities before market recognition)
# Example 3: "Find artists with institutional validation but undervalued"
search_results = vector_search_service.search_multi_dimensional(
canonical_artist_id=reference_artist_id,
weights={
'institutional': 'primary', # Prioritize institutional similarity
'market': 'secondary' # Secondary factor: market position (undervalued filter)
},
filters={'avg_hammer_price__lt': 100000},
limit=10
)
# Returns: Museum-validated artists (MoMA, Met, Tate) under $100K
# (institutional quality at accessible prices)
Why This Is Impossible on Public Web: Google/Bing cannot answer "aesthetically like X but cheaper" because they lack separate aesthetic and market embeddings. Web search collapses all dimensions into a single relevance score. Egon's 4-dimensional architecture enables precise control over which attributes matter for a given query.
Performance: Sub-200ms for weighted multi-dimensional search across 5,000+ canonical artists using optimized vector indexes.
6. Technical Architecture
Egon's data universe creates differentiated capabilities through three core technical innovations:
1. Source-Aware Resolution Routing
- Authority skip for autocomplete: Fast resolution for user input vs. thorough verification for professional sources
- Tiered caching strategy: Multiple cache tiers with TTLs optimized per data type for high cache hit rates
- Confidence scoring: Variable confidence ranges affect downstream investment grading (low confidence = grade penalty)
2. Tier-Aware Signal Weighting
- Discovery tier amplification: Gallery moves (highest weight), museum acquisitions (very high), retrospectives (high)
- Blue-chip tier dampening: Auction records (high weight), market shifts (moderate), milestones (minimal)
- Backtesting validation: Weights derived from 2,000+ artist trajectories, recalibrated quarterly
3. 4-Dimensional Orthogonal Embeddings
- Independent dimensions: Low correlation between dimensions ensures weighted queries control specific attributes
- Weighted combination queries: "Aesthetically like X but cheaper" impossible without orthogonal embeddings
- Sub-200ms vector search: Optimized vector indexes for fast similarity queries
Network Effects
More data sources → higher canonical link rate → better cross-dataset intelligence → more valuable insights → more users → more user interaction data → improved tier-aware signal weights
99.98%] C --> D[Better Cross-Dataset Intelligence
5-signal confluence] D --> E[More Valuable Insights
Momentum detection] E --> F[More Users] F --> G[More User Interaction Data
Watchlist, Collection, Analysis] G --> H[Improved Tier-Aware Weights
Backtesting recalibration] H --> A style C fill:#d4edda style D fill:#fff3cd style H fill:#cfe2ff
Key Differentiators
- ✅ 99.98% canonical link rate (4,980/1 auction records)
- ✅ Source-aware resolution routing (Getty skip for autocomplete, mandatory for professional sources)
- ✅ Tier-aware signal weighting (Discovery tier signals weighted significantly higher than Blue-chip expected performance)
- ✅ 4-dimensional orthogonal embeddings (r < 0.2 correlation enables weighted queries)
- ✅ Multi-dimensional semantic search ("aesthetically like X but cheaper" impossible on public web)
- ✅ Cross-dataset intelligence (5-signal confluence scoring across 10+ tables)
- ✅ Real-time deduplication (across all name variants via single canonical ID)
Continue reading:
Questions? Contact us through Egon