Benchmark Guide: Running ClickHouse for Observability on Different Cloud Providers
benchmarksobservabilitydatabases

Benchmark Guide: Running ClickHouse for Observability on Different Cloud Providers

UUnknown
2026-03-05
11 min read
Advertisement

A practical 2026 benchmark plan for ClickHouse on AWS, GCP, and Azure VMs — workloads, metrics, and configurations for observability teams.

Hook: Why your observability ClickHouse benchmark must be objective and repeatable

Rising cloud bills, unpredictable query latency, and noisy, high-cardinality logs are the exact problems ops teams told us keep them up at night in 2026. If you plan to run ClickHouse for observability (logs and metrics) on cloud VMs, you need a pragmatic, repeatable benchmark plan that compares real-world workloads across AWS, GCP and Azure VM types — not vendor slides or anecdotal numbers.

Executive summary: What this guide delivers

This guide gives operations teams a complete, production-oriented benchmark plan for ClickHouse in 2026 across the three major clouds. You will get:

  • A clear benchmark matrix (VM families, disk options, cluster topologies, replication)
  • Three realistic observability workloads (high-cardinality logs, time-series metrics, mixed OLAP queries)
  • Concrete metrics to collect (ingest rate, P50/P95/P99 query latency, storage IO, CPU, memory, cost/throughput)
  • Practical tuning and configuration suggestions for ClickHouse and the cloud VMs
  • How to interpret results and make an informed choice for logs vs metrics

Context: Why 2025–2026 matters for ClickHouse benchmarks

ClickHouse continues to accelerate adoption for observability. In early 2026 the company closed a large funding round, underlining broad enterprise traction. That growth brings faster release cycles, new storage engine optimizations, and tighter integrations for cloud deployments. At the same time, cloud providers expanded high-IO instance types and ultra-low-latency block storage in late 2024–2025 — which changes the cost/performance equation for single-node NVMe vs distributed clusters.

"ClickHouse's rapid product and ecosystem growth in 2025–2026 means benchmarks must test realistic, multi-tenant workloads and include cost normalization."

Benchmark design principles

  1. Realism over synthetic extremes — use realistic message sizes, cardinality and query mixes reflective of production observability (logs + metrics).
  2. Isolate variables — change one factor at a time: VM family, disk type, or cluster size.
  3. Repeatability — use automation (Terraform, Ansible, or the cloud CLIs) and record software versions and configs.
  4. Measure cost alongside performance — report $/ingest, $/query, and $/storage.
  5. Warm-up and steady-state — run a warm-up phase until merges stabilize, then collect steady-state metrics over multiple hours.

Environment matrix: VM types and storage options to include

Start with a common cross-cloud matrix that targets the same role profile: write-heavy nodes for ingestion and memory-optimized nodes for query. Example families to test in 2026:

  • AWS: i4i/i4 (NVMe), r6i (memory), c7i (compute), with storage variants: local NVMe, EBS gp3, EBS io2/Block Express
  • GCP: C2D (compute), M2 (memory-optimized), N2D with local SSDs, Persistent SSD (pd-ssd) and Extreme Persistent Disks
  • Azure: Lsv3 (storage NVMe), M-series (memory), Dv5/Cv5 families and Ultra Disk / Ephemeral OS Disk

Topology matrix (start small and expand):

  • Single-node ClickHouse using local NVMe — baseline for max single-node throughput.
  • 3-node replicated cluster (ReplicatedMergeTree + ClickHouse Keeper) with local NVMe for ingestion and remote disks for replicas.
  • 6-node sharded cluster (2 shards × 3 replicas) to measure distributed query scaling and cross-node network impact.

Workloads: Three observability scenarios

Each workload contains ingest patterns, query patterns and cardinality assumptions. Use a data generator (guidance below) or tools like clickhouse-bench and Vector/Fluent Bit for realistic ingestion.

Workload A — High-cardinality logging (write-heavy)

Purpose: Validate sustained ingest throughput and storage IO behavior for logs with many tags and high cardinality.

  • Row format: JSONEachRow or Native with 20–40 fields (timestamp, service, host, level, message, 10 tag keys)
  • Message size: 400–2,000 bytes (avg ~800B)
  • Cardinality: 50k distinct tag values across hosts/services; heavy label cardinality (simulate microservices)
  • Target ingest rates: 50k, 200k, 500k events/sec (scale by cluster size)
  • Insert pattern: many small batch writers (100–10k rows per insert) to reflect agents
  • Queries (concurrent with writes): recent-window tail queries (last 5–15 minutes), group by tags with topN and P95 latency

Workload B — Time-series metrics (write + read)

Purpose: Measure high-rate metric ingestion and low-latency aggregated queries.

  • Row format: metric_name, timestamp (ms), labels map, value
  • Message size: 100–300 bytes
  • Cardinality: medium (10k–50k series)
  • Target ingest rates: 100k–1M samples/sec (batched inserts)
  • Queries: rollups (avg, sum), range queries across 1m/1h windows, label-based filters, and series retrievals (single-series read latencies)

Workload C — Mixed OLAP (historical analysis + ad hoc)

Purpose: Simulate analytics queries run by SREs for incident analysis and capacity planning on historical data.

  • Includes both logs and metrics data streams ingested according to A/B
  • Queries: heavy GROUP BY on high-cardinality columns, top-k over 24h windows, distributed JOINs and DISTINCT counts
  • Measure concurrent query throughput (QPS) and tail latencies while ingestion continues

Schema and ClickHouse configuration (practical defaults)

Use MergeTree family tables tuned for observability:

  • Create tables with Partition by day and primary key on (service, timestamp) for metrics, and (toDate(timestamp), service, host) for logs.
  • Use ReplicatedMergeTree in clusters with ClickHouse Keeper for replication and failover.
  • Compression: test LZ4 (fast) and ZSTD at levels 1–3 for a CPU vs IO tradeoff.
  • Use Granularity and primary key to control part sizes: aim for part sizes 256MB–2GB after merges.

Key server settings to tune (start values):

  • max_memory_usage: set per-node (e.g., 60–80% of available RAM)
  • max_threads: set to number of vCPUs or slightly higher
  • insert_threads: 4–8 (for parallel inserts)
  • max_insert_block_size: 1048576 (or tune to your batch sizes)
  • merge_tree_min_rows_for_wide_part: adjust to force wide/narrow parts depending on read/write workload

Ingest tooling and generators

Realistic ingestion is critical. Options:

  • Vector or Fluent Bit as the agent — forward logs to Kafka or directly to ClickHouse via HTTP or native TCP.
  • Kafka as a buffer for high ingestion bursts. Use a few producers to simulate many agents.
  • Generators: use clickhouse-benchmark for synthetic queries, and custom Go/Node producers for realistic payloads. Open-source tools like chbench or clickhouse-benchmark can be used to drive inserts and queries.

Batching guidance: batch 1k–10k rows per insert for metrics; 100–2k rows for logs (depending on agent behavior). Smaller batches raise request overhead; larger batches increase memory spikes.

Metrics to collect

Collect the following metrics for each test iteration and export them to a central monitoring system (Prometheus/Grafana):

  • Ingest: rows/sec, bytes/sec, insert latency (P50/P95/P99), failed inserts
  • Query: request QPS, latency P50/P95/P99, plan time, read rows/bytes
  • Storage I/O: disk throughput (MB/s), IOPS, disk latency (ms), queue depth
  • CPU/Memory: CPU usage per core, system context switches, memory usage, swap events
  • Network: bandwidth, retransmits, % utilization between nodes
  • ClickHouse internals: merges/sec, parts count, background pool wait times, number of active merges
  • Cost: hourly VM + storage + network charges; calculate $/ingest and $/query

Benchmark procedure: step-by-step

  1. Provision: Automate VM and disk provisioning with Terraform. Use identical OS images and kernel tuning across clouds.
  2. Baseline storage test: Run fio on each disk type to capture raw IOPS and throughput baseline.
  3. Deploy ClickHouse: Use the same ClickHouse version across tests (record version). Configure users, profiles and quotas identically.
  4. Warm-up: Start ingestion at 10% of target and increase over 30–60 minutes until merges stabilize.
  5. Steady-state run: Run target ingest + query patterns for 2–6 hours. Collect all metrics.
  6. Scale variations: Repeat for different batch sizes, replication factors, and shard counts.
  7. Repeatability: Run each scenario 3 times and report median and 95th percentile results.

Interpreting results: what to look for

When you compare clouds and VM types, normalize results in three ways:

  • Performance per vCPU — helps compare differently sized instance families.
  • Performance per dollar — essential to evaluate trade-offs between low-latency NVMe and cheaper networked SSDs.
  • Operational complexity — local NVMe single-node setups give high throughput but complicate scaling and failover; replicated clusters cost more but provide reliability.

Key failure modes to watch for:

  • High write latencies during merges when many small parts exist.
  • IO saturation leading to long tail query latencies (P99 spikes).
  • Network bottlenecks during distributed aggregations and merges.
  • Excessive CPU due to high compression levels causing lower ingestion throughput.

Practical tuning notes (common wins)

  • Prefer larger merge part targets (256MB–1GB) to reduce background merge pressure for write-heavy logs.
  • Use faster compression codecs (LZ4) on hot data, and ZSTD for older partitions to save storage.
  • Pin shards to NUMA nodes where applicable; test single-threaded read speed because ClickHouse relies on high single-core performance for parts read.
  • When using cloud block storage, prefer high IOPS options (EBS io2/io2 Block Express, GCP Extreme PD, Azure Ultra Disk) for write-heavy nodes.
  • Leverage ClickHouse Keeper (lightweight consensus) for replication instead of ZooKeeper for simpler ops on VMs.
  • Monitor merge queue depth and tune max_background_merges and background_pool_size to avoid saturated disks.

Example quick run: single-node NVMe baseline (AWS i4i)

To establish a single-node baseline, deploy an AWS i4i.8xlarge (example) with local NVMe, install ClickHouse, and run the following simple test:

# Insert generator (pseudo)
cat events.json | clickhouse-client --query='INSERT INTO logs FORMAT JSONEachRow'

# Run clickhouse-benchmark for concurrent queries
clickhouse-client --query='SELECT service, count() FROM logs WHERE event_time > now()-3600 GROUP BY service' --max_threads=8
  

Record rows/sec, disk latency, CPU. Repeat with gp3/io2 backed nodes to measure how much local NVMe buys you vs replicated EBS.

Interpreting a hypothetical result set (what decisions to make)

Suppose tests show:

  • Single-node NVMe: 600k inserts/sec, P95 query 120ms, storage footprint 45GB/day
  • EBS io2: 200k inserts/sec, P95 query 350ms, storage footprint 48GB/day
  • 3-node replica cluster on NVMe: 180k inserts/sec sustained with replication, P95 query 200ms

Conclusions:

  • If the goal is raw ingestion throughput (short-lived or ephemeral data) and you can accept single-node risk, NVMe single-node is best.
  • If durability + HA matters, a replicated cluster with NVMe gives balanced throughput with redundancy at higher cost per ingest.
  • Where budget is constrained but HA is required, use networked high-IO block storage and optimize merges/part sizes.

Recent trends in late 2025 and early 2026 you must account for:

  • ClickHouse projects and providers expanded cloud-native integrations and offered managed cloud versions — but managed vs self-hosted cost/performance differs and needs separate benchmarking.
  • Cloud providers launched more granular high-IO storage tiers and cheaper zonal NVMe instances — which shifts the ROI threshold for single-node NVMe.
  • Query acceleration and vectorized execution in ClickHouse continue to improve; measure across ClickHouse minor versions as optimizations can materially change results.

Recommendation: include at least one test per cloud of their latest high-IO instance family and one memory-optimized family, and rerun benchmarks after each ClickHouse minor release or cloud storage change.

Cost normalization and reporting

Report results with a simple set of normalized KPIs:

  • Rows/sec per $/hour
  • Avg query latency per $/hour
  • Storage bytes per row (uncompressed and compressed)
  • 99th percentile tail latency and its $ impact (e.g., SRE time and incident cost)

Include a short LCO (lifecycle cost) projection: index rebuilds, cluster upgrades, and snapshot storage costs. These are often overlooked but can double TCO over 12–24 months for observability stores.

Operational checklist before you run

  • Version-lock ClickHouse and record build IDs.
  • Ensure clocks are synced (chrony/ntp) across nodes to avoid timestamp skew.
  • Enable detailed OS-level metrics collection (iostat, vmstat, sar) and eBPF tracing for tail-case investigation.
  • Have an automated teardown to avoid surprises in cloud bills.

Limitations and what this guide does not cover

This guide focuses on VM-based ClickHouse deployments. If you use managed ClickHouse cloud offerings, run a separate set of tests to include provider-managed features such as auto-scaling, backup retention and integrated ingestion pipelines. Also, highly specialized hardware (bare metal racks, proprietary NICs) requires a different bench approach.

Actionable takeaways

  • Design your matrix in terms of roles (ingest-heavy vs query-heavy) and test matching VM families across clouds — this is more valuable than comparing instance names.
  • Always include storage IO baselines (fio) before ClickHouse tests — disk is a dominant factor for logs.
  • Measure cost per throughput and tail latencies, not just peak rows/sec.
  • Run steady-state tests for hours to observe merge behavior and tail latency spikes.
  • Automate everything — infrastructure, test runs, metric collection, and reports.

Next steps and call-to-action

Ready to benchmark? Start with our recommended baseline: one NVMe single-node, one 3-node replicated cluster on NVMe, and one cost-optimized network-disk cluster. Automate the matrix with Terraform modules and a standard test harness (we publish an example repo you can clone and run).

Want our benchmark kit (Terraform + ClickHouse configs + workload generators)? Download the repo, run the baseline, and compare results. Need help interpreting outcomes or running large-scale tests across clouds? Reach out to our benchmarking team at whata.cloud for hands-on assistance.

Advertisement

Related Topics

#benchmarks#observability#databases
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-03-05T01:15:25.298Z