Database Design

Overview

SupaScan uses ClickHouse as its primary analytical database, chosen for its exceptional performance with time-series data, columnar storage, and real-time analytics capabilities. The system leverages multiple data sources including RPC nodes, archival nodes, and custom indexing to provide comprehensive Solana blockchain analysis.

Architecture Overview

Data Sources and Storage Strategy

graph TD
    A[Solana RPC Nodes] --> B[Real-time Indexer]
    C[Archival Nodes] --> D[Historical Data Processor]
    E[Custom Parsers] --> F[Specialized Data Extractor]
    
    B --> G[ClickHouse Cluster]
    D --> G
    F --> G
    
    G --> H[Hot Storage - Recent Data]
    G --> I[Warm Storage - Historical Data]
    G --> J[Cold Storage - Archived Data]
    
    H --> K[Real-time Analytics]
    I --> L[Historical Analysis]
    J --> M[Long-term Storage]

ClickHouse Cluster Configuration

graph TD
    A[Load Balancer] --> B[ClickHouse Shard 1]
    A --> C[ClickHouse Shard 2]
    A --> D[ClickHouse Shard 3]
    
    B --> E[Replica 1.1]
    B --> F[Replica 1.2]
    C --> G[Replica 2.1]
    C --> H[Replica 2.2]
    D --> I[Replica 3.1]
    D --> J[Replica 3.2]
    
    K[ZooKeeper] --> B
    K --> C
    K --> D

Schema Design

Core Tables Structure

erDiagram
    TRANSACTIONS ||--o{ TOKEN_TRANSFERS : contains
    TRANSACTIONS ||--o{ TOKEN_SWAPS : contains
    TOKENS ||--o{ TOKEN_TRANSFERS : involved_in
    TOKENS ||--o{ TOKEN_SWAPS : traded_in
    WALLETS ||--o{ TRANSACTIONS : participates_in
    DEX_POOLS ||--o{ TOKEN_SWAPS : facilitates
    
    TRANSACTIONS {
        string signature PK
        uint64 slot
        uint64 block_time
        uint64 fee
        uint8 success
        array accounts
        array instructions
        array logs
        string signer
        string program_id
    }
    
    TOKEN_TRANSFERS {
        string signature FK
        string token_mint
        string from_address
        string to_address
        uint256 amount
        uint8 decimals
        float64 ui_amount
    }
    
    TOKEN_SWAPS {
        string signature FK
        string dex_protocol
        string token_in
        string token_out
        uint256 amount_in
        uint256 amount_out
        float64 price_impact
    }
    
    TOKENS {
        string mint_address PK
        string symbol
        string name
        uint8 decimals
        uint256 total_supply
        string metadata_uri
    }
    
    WALLETS {
        string address PK
        uint64 first_seen
        uint64 last_activity
        uint64 transaction_count
        float64 total_volume
    }

Table Definitions

1. transactions

Core table storing all Solana transactions with optimized partitioning.

CREATE TABLE transactions (
    signature String,
    slot UInt64,
    block_time UInt64,
    fee UInt64,
    success UInt8,
    accounts Array(String),
    instructions Array(String),
    logs Array(String),
    signer String,
    program_id String,
    compute_units_consumed UInt32,
    -- Partitioning by slot for optimal performance
    date Date MATERIALIZED toDate(block_time)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/transactions', '{replica}')
PARTITION BY toYYYYMM(date)
ORDER BY (slot, signature)
SETTINGS index_granularity = 8192;

Purpose: Stores all Solana transactions with full details.

Key Points:

  • Partitioned by month for efficient time-based queries
  • Ordered by slot for optimal range scans
  • Replicated across cluster for high availability

2. token_transfers

SPL token transfer events with decimal handling.

CREATE TABLE token_transfers (
    signature String,
    slot UInt64,
    block_time UInt64,
    token_mint String,
    from_address String,
    to_address String,
    amount UInt256,
    decimals UInt8,
    ui_amount Float64,
    date Date MATERIALIZED toDate(block_time)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/token_transfers', '{replica}')
PARTITION BY toYYYYMM(date)
ORDER BY (token_mint, block_time, signature)
SETTINGS index_granularity = 8192;

Purpose: Tracks all SPL token transfers with proper decimal handling.

3. token_swaps

DEX swap events across multiple protocols.

CREATE TABLE token_swaps (
    signature String,
    slot UInt64,
    block_time UInt64,
    dex_protocol String,
    token_in String,
    token_out String,
    amount_in UInt256,
    amount_out UInt256,
    price_impact Float64,
    pool_address String,
    date Date MATERIALIZED toDate(block_time)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/token_swaps', '{replica}')
PARTITION BY toYYYYMM(date)
ORDER BY (dex_protocol, block_time, signature)
SETTINGS index_granularity = 8192;

Purpose: Tracks DEX swaps from Raydium, Meteora, PumpFun, etc.

4. tokens

Token metadata and supply information.

CREATE TABLE tokens (
    mint_address String,
    symbol String,
    name String,
    decimals UInt8,
    total_supply UInt256,
    metadata_uri String,
    created_at UInt64,
    creator String,
    is_verified UInt8 DEFAULT 0
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/tokens', '{replica}')
ORDER BY mint_address
SETTINGS index_granularity = 8192;

Purpose: Token registry with metadata and verification status.

5. wallets

Wallet activity and statistics.

CREATE TABLE wallets (
    address String,
    first_seen UInt64,
    last_activity UInt64,
    transaction_count UInt64,
    total_volume Float64,
    unique_tokens UInt32,
    is_contract UInt8 DEFAULT 0
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/wallets', '{replica}')
ORDER BY address
SETTINGS index_granularity = 8192;

Purpose: Wallet profiles and activity metrics.

6. webhook_subscriptions

User webhook configurations.

CREATE TABLE webhook_subscriptions (
    webhook_id String,
    user_id String,
    webhook_url String,
    filters String, -- JSON filters
    status String,
    created_at DateTime DEFAULT now(),
    last_triggered DateTime
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/webhooks', '{replica}')
ORDER BY (user_id, created_at)
SETTINGS index_granularity = 8192;

Purpose: Manages user webhook subscriptions and filters.

Data Sources and Integration

RPC Node Strategy

SupaScan uses multiple RPC endpoints for redundancy and performance:

const rpcConfig = {
  primary: 'https://api.mainnet-beta.solana.com',
  fallbacks: [
    'https://solana-api.projectserum.com',
    'https://rpc.ankr.com/solana',
    'https://solana-mainnet.g.alchemy.com/v2/demo'
  ],
  archival: [
    'https://api.mainnet-beta.solana.com', // Full historical data
    'https://solana-mainnet.phantom.app'  // Extended history
  ]
};

Key Benefits:

  • Load Distribution: Spread requests across multiple providers
  • Fault Tolerance: Automatic failover if primary fails
  • Rate Limit Management: Distribute load to avoid limits
  • Historical Access: Dedicated archival nodes for old data

Data Ingestion Pipeline

sequenceDiagram
    participant RPC as RPC Nodes
    participant Parser as Data Parser
    participant Validator as Data Validator
    participant ClickHouse as ClickHouse Cluster
    participant Cache as Redis Cache
    
    RPC->>Parser: Raw block data
    Parser->>Parser: Extract transactions
    Parser->>Parser: Parse token transfers
    Parser->>Parser: Detect DEX swaps
    Parser->>Validator: Structured data
    Validator->>Validator: Validate schema
    Validator->>Validator: Check business logic
    Validator->>ClickHouse: Insert batch
    ClickHouse->>Cache: Update cache

ClickHouse Optimizations

Compression and Storage

-- Enable advanced compression
ALTER TABLE transactions MODIFY SETTING compression = 'lz4hc';
ALTER TABLE token_transfers MODIFY SETTING compression = 'zstd';

-- Optimize for time-series queries
ALTER TABLE transactions MODIFY SETTING 
  min_bytes_for_wide_part = 100000000,
  min_rows_for_wide_part = 1000000;

Materialized Views for Analytics

-- Real-time token volume aggregation
CREATE MATERIALIZED VIEW token_volume_1h
ENGINE = SummingMergeTree()
ORDER BY (token_mint, hour)
AS SELECT
    token_mint,
    toStartOfHour(toDateTime(block_time)) as hour,
    sum(ui_amount) as volume_usd,
    count() as transfer_count
FROM token_transfers
GROUP BY token_mint, hour;

-- Top wallets by activity
CREATE MATERIALIZED VIEW top_wallets_daily
ENGINE = ReplacingMergeTree()
ORDER BY (date, address)
AS SELECT
    toDate(block_time) as date,
    signer as address,
    count() as tx_count,
    sum(fee) as total_fees
FROM transactions
GROUP BY date, address;

Sparse Indexes for Performance

-- Bloom filter for token lookups
ALTER TABLE token_transfers ADD INDEX idx_token_mint token_mint TYPE bloom_filter GRANULARITY 1;

-- MinMax index for time ranges
ALTER TABLE transactions ADD INDEX idx_block_time block_time TYPE minmax GRANULARITY 1;

-- Set index for array columns
ALTER TABLE transactions ADD INDEX idx_accounts accounts TYPE set(100) GRANULARITY 1;

Query Patterns and Performance

Common Analytics Queries

-- Most liquid tokens (24h)
SELECT 
    token_mint,
    sum(ui_amount) as volume_24h,
    count() as transfer_count
FROM token_transfers
WHERE block_time > now() - INTERVAL 24 HOUR
GROUP BY token_mint
ORDER BY volume_24h DESC
LIMIT 100;

-- Wallet trading patterns
SELECT 
    signer,
    count() as tx_count,
    countIf(success = 1) as successful_txs,
    sum(fee) as total_fees,
    uniqExact(program_id) as unique_programs
FROM transactions
WHERE block_time > now() - INTERVAL 7 DAY
GROUP BY signer
HAVING tx_count > 100
ORDER BY tx_count DESC;

Time-Series Optimizations

-- Efficient time-based queries using partitioning
SELECT 
    toStartOfHour(toDateTime(block_time)) as hour,
    count() as tx_count,
    sum(fee) as total_fees
FROM transactions
WHERE date >= today() - 7  -- Uses partition pruning
GROUP BY hour
ORDER BY hour;

-- Token price impact analysis
SELECT 
    dex_protocol,
    avg(price_impact) as avg_impact,
    quantile(0.95)(price_impact) as p95_impact
FROM token_swaps
WHERE block_time > now() - INTERVAL 24 HOUR
GROUP BY dex_protocol;

Data Retention and Archival

TTL Policies

-- Automatic data retention
ALTER TABLE transactions MODIFY TTL block_time + INTERVAL 2 YEAR;
ALTER TABLE token_transfers MODIFY TTL block_time + INTERVAL 1 YEAR;
ALTER TABLE token_swaps MODIFY TTL block_time + INTERVAL 6 MONTH;

-- Move old data to cold storage
ALTER TABLE transactions MODIFY TTL 
    block_time + INTERVAL 1 YEAR TO VOLUME 'cold_storage';

Sharding Strategy

-- Shard by slot range for parallel processing
CREATE TABLE transactions_shard_1 AS transactions
ENGINE = Distributed('cluster', 'supascan', 'transactions', slot % 3 = 0);

CREATE TABLE transactions_shard_2 AS transactions
ENGINE = Distributed('cluster', 'supascan', 'transactions', slot % 3 = 1);

CREATE TABLE transactions_shard_3 AS transactions
ENGINE = Distributed('cluster', 'supascan', 'transactions', slot % 3 = 2);

Monitoring and Maintenance

Health Checks

-- Monitor cluster health
SELECT 
    hostName(),
    uptime(),
    version(),
    memory_usage,
    disk_space_used
FROM system.clusters
WHERE cluster = 'supascan_cluster';

-- Check replication lag
SELECT 
    table,
    replica_name,
    is_leader,
    log_pointer,
    absolute_delay
FROM system.replicas
WHERE database = 'supascan';

Performance Monitoring

-- Query performance analysis
SELECT 
    query,
    query_duration_ms,
    read_rows,
    read_bytes,
    memory_usage
FROM system.query_log
WHERE event_date = today()
ORDER BY query_duration_ms DESC
LIMIT 10;

-- Table size monitoring
SELECT 
    table,
    formatReadableSize(sum(bytes)) as size,
    count() as parts
FROM system.parts
WHERE database = 'supascan'
GROUP BY table
ORDER BY sum(bytes) DESC;

Best Practices

1. Use Appropriate Data Types

  • UInt64 for slots and timestamps
  • UInt256 for large token amounts
  • String for addresses and signatures
  • Array(String) for account lists

2. Optimize for Time-Series

  • Always partition by date
  • Use time-based ORDER BY clauses
  • Leverage partition pruning in WHERE clauses

3. Batch Operations

  • Insert data in batches of 1000-10000 rows
  • Use async inserts for better throughput
  • Compress data before transmission

4. Monitor Resource Usage

  • Track memory consumption per query
  • Monitor disk I/O patterns
  • Set appropriate limits for concurrent queries

5. Backup and Recovery

  • Regular automated backups
  • Test recovery procedures
  • Maintain multiple backup copies