Skip to main content

StarRocks: High-Performance OLAP Database

StarRocks is a high-performance, distributed OLAP database designed for real-time analytics and sub-second query performance. It features an MPP (Massively Parallel Processing) architecture optimized for analytical workloads with high concurrency support.

Overview

StarRocks is an open-source, distributed analytical database that provides:

  • MPP Architecture: Massively parallel processing for high performance
  • High Concurrency: Support for multiple concurrent queries
  • Real-time Ingestion: Stream data processing capabilities
  • SQL Compliance: Full SQL support for easier migration
  • Columnar Storage: Optimized for analytical queries
  • Secondary Indexes: Advanced indexing capabilities

Key Features

🚀 Performance Features

  • MPP Architecture: Distributed query processing across multiple nodes
  • Columnar Storage: Optimized for analytical workloads
  • Sub-second Queries: Designed for fast analytical queries
  • High Concurrency: Support for multiple simultaneous queries
  • Vectorized Execution: SIMD-optimized query execution

🔧 Operational Features

  • Real-time Ingestion: Stream data processing with low latency
  • SQL Compliance: Full ANSI SQL support
  • Secondary Indexes: Advanced indexing for complex queries
  • ACID Transactions: Better transaction support than ClickHouse
  • HDFS Integration: Can be integrated with HDFS storage

📊 Data Model Features

  • Star Schema Support: Optimized for data warehouse workloads
  • Materialized Views: Pre-computed aggregations
  • Partitioning: Efficient data partitioning strategies
  • Compression: Advanced data compression algorithms
  • Caching: Multi-level caching for performance

Architecture

Core Components

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Frontend │ │ Frontend │ │ Frontend │
│ (FE) Node │ │ (FE) Node │ │ (FE) Node │
└─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
│ │ │
└──────────────────────┼──────────────────────┘

┌─────────────┴─────────────┐
│ Load Balancer │
└─────────────┬─────────────┘

┌─────────────────────────┼─────────────────────────┐
│ │ │
┌───────▼────────┐ ┌───────────▼──────────┐ ┌───────▼────────┐
│ Backend │ │ Backend │ │ Backend │
│ (BE) Node │ │ (BE) Node │ │ (BE) Node │
└────────────────┘ └──────────────────────┘ └────────────────┘

Component Roles

  • Frontend (FE): Query planning, metadata management, coordination
  • Backend (BE): Data storage, query execution, data ingestion
  • Load Balancer: Request distribution and failover

Installation

Docker Installation

# Pull StarRocks image
docker pull starrocks/starrocks:latest

# Run FE node
docker run -d --name starrocks-fe \
-p 9030:9030 \
-p 8030:8030 \
-p 9020:9020 \
-p 9010:9010 \
-e FE_SERVERS="starrocks-fe:9010" \
starrocks/starrocks:latest fe

# Run BE node
docker run -d --name starrocks-be \
-p 9060:9060 \
-p 8040:8040 \
-e FE_SERVERS="starrocks-fe:9010" \
starrocks/starrocks:latest be

Native Installation

# Download StarRocks
wget https://downloads.starrocks.com/starrocks-2.5.0.tar.gz
tar -xzf starrocks-2.5.0.tar.gz
cd starrocks-2.5.0

# Start FE
cd fe && bin/start_fe.sh --daemon

# Start BE
cd ../be && bin/start_be.sh --daemon

Configuration

FE Configuration (conf/fe.conf)

# Network configuration
priority_networks = 192.168.1.0/24
http_port = 8030
rpc_port = 9020
query_port = 9030
edit_log_port = 9010

# Memory configuration
JVM_XMX = 8g
JVM_XMS = 8g

# Metadata configuration
meta_dir = /opt/starrocks/fe/meta
log_dir = /opt/starrocks/fe/log

# Query configuration
max_conn_per_be = 1024
max_query_timeout = 300

BE Configuration (conf/be.conf)

# Network configuration
priority_networks = 192.168.1.0/24
be_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060

# Storage configuration
storage_root_path = /opt/starrocks/be/storage
log_dir = /opt/starrocks/be/log

# Memory configuration
mem_limit = 80%

# Query configuration
max_compaction_threads = 10
max_tablet_num_per_shard = 1024

Data Modeling

Database and Table Creation

-- Create database
CREATE DATABASE analytics_db;

-- Use database
USE analytics_db;

-- Create table with partitioning
CREATE TABLE sales_facts (
sale_id BIGINT,
customer_id INT,
product_id INT,
sale_date DATE,
sale_amount DECIMAL(10,2),
region VARCHAR(50)
)
DUPLICATE KEY(sale_id)
PARTITION BY RANGE(sale_date) (
PARTITION p2023 VALUES LESS THAN ('2024-01-01'),
PARTITION p2024 VALUES LESS THAN ('2025-01-01')
)
DISTRIBUTED BY HASH(customer_id) BUCKETS 32
PROPERTIES (
"replication_num" = "3",
"storage_format" = "DEFAULT"
);

Materialized Views

-- Create materialized view for aggregations
CREATE MATERIALIZED VIEW sales_summary AS
SELECT
region,
DATE_TRUNC('month', sale_date) as month,
COUNT(*) as total_sales,
SUM(sale_amount) as total_revenue
FROM sales_facts
GROUP BY region, DATE_TRUNC('month', sale_date);

-- Create materialized view for customer analytics
CREATE MATERIALIZED VIEW customer_analytics AS
SELECT
customer_id,
COUNT(*) as purchase_count,
SUM(sale_amount) as total_spent,
AVG(sale_amount) as avg_purchase
FROM sales_facts
GROUP BY customer_id;

SQL Operations

Basic Queries

-- Simple SELECT
SELECT * FROM sales_facts WHERE sale_date >= '2024-01-01';

-- Aggregation queries
SELECT
region,
COUNT(*) as sales_count,
SUM(sale_amount) as total_revenue
FROM sales_facts
WHERE sale_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;

-- Window functions
SELECT
customer_id,
sale_date,
sale_amount,
SUM(sale_amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
ROWS UNBOUNDED PRECEDING
) as running_total
FROM sales_facts
ORDER BY customer_id, sale_date;

Advanced Analytics

-- Time-series analysis
SELECT
DATE_TRUNC('day', sale_date) as day,
COUNT(*) as daily_sales,
AVG(sale_amount) as avg_sale,
STDDEV(sale_amount) as sale_stddev
FROM sales_facts
WHERE sale_date >= '2024-01-01'
GROUP BY DATE_TRUNC('day', sale_date)
ORDER BY day;

-- Customer segmentation
SELECT
customer_id,
CASE
WHEN total_spent >= 10000 THEN 'Premium'
WHEN total_spent >= 5000 THEN 'Gold'
WHEN total_spent >= 1000 THEN 'Silver'
ELSE 'Bronze'
END as customer_tier
FROM (
SELECT
customer_id,
SUM(sale_amount) as total_spent
FROM sales_facts
GROUP BY customer_id
) customer_totals;

Data Ingestion

Stream Load

-- Load data from file
LOAD LABEL analytics_db.label_20240101
(
DATA INFILE('hdfs://namenode:9000/data/sales_20240101.csv')
INTO TABLE sales_facts
COLUMNS TERMINATED BY ','
(sale_id, customer_id, product_id, sale_date, sale_amount, region)
)
WITH BROKER 'broker_name'
PROPERTIES (
'timeout' = '3600',
'max_filter_ratio' = '0.1'
);

Routine Load

-- Create routine load for continuous ingestion
CREATE ROUTINE LOAD analytics_db.sales_stream_load
ON sales_facts
COLUMNS(sale_id, customer_id, product_id, sale_date, sale_amount, region)
PROPERTIES (
'desired_concurrent_number' = '3',
'max_batch_interval' = '20',
'max_batch_rows' = '300000',
'max_batch_size' = '209715200'
)
FROM KAFKA (
'kafka_broker_list' = 'kafka1:9092,kafka2:9092',
'kafka_topic' = 'sales_topic',
'kafka_partitions' = '0,1,2',
'kafka_offsets' = 'OFFSET_BEGINNING,OFFSET_BEGINNING,OFFSET_BEGINNING'
);

Performance Optimization

Query Optimization

-- Use appropriate partition pruning
SELECT * FROM sales_facts
WHERE sale_date >= '2024-01-01' AND sale_date < '2024-02-01';

-- Use materialized views
SELECT region, SUM(sale_amount) as total_revenue
FROM sales_summary
WHERE month >= '2024-01-01'
GROUP BY region;

-- Use proper distribution keys
SELECT customer_id, COUNT(*) as purchase_count
FROM sales_facts
WHERE customer_id IN (1001, 1002, 1003)
GROUP BY customer_id;

Indexing Strategies

-- Create bitmap index for low-cardinality columns
CREATE INDEX idx_region ON sales_facts (region) USING BITMAP;

-- Create bloom filter for high-cardinality columns
CREATE INDEX idx_customer_id ON sales_facts (customer_id) USING BLOOM_FILTER;

-- Create inverted index for text search
CREATE INDEX idx_product_name ON products (product_name) USING INVERTED;

Monitoring and Maintenance

System Queries

-- Check cluster status
SHOW PROC '/backends';

-- Monitor query performance
SHOW PROC '/queries';

-- Check table statistics
SHOW PROC '/statistic';

-- Monitor resource usage
SHOW PROC '/resources';

Health Checks

# Check FE status
curl -u root: http://localhost:8030/api/show_proc?path=/backends

# Check BE status
curl -u root: http://localhost:8040/api/health

# Check cluster health
mysql -h localhost -P 9030 -u root -p -e "SHOW PROC '/backends';"

Integration Examples

Python Integration

import pymysql
import pandas as pd

# Connect to StarRocks
connection = pymysql.connect(
host='localhost',
port=9030,
user='root',
password='password',
database='analytics_db'
)

# Execute query
query = """
SELECT region, SUM(sale_amount) as total_revenue
FROM sales_facts
WHERE sale_date >= '2024-01-01'
GROUP BY region
"""

df = pd.read_sql(query, connection)
print(df)

connection.close()

Java Integration

import java.sql.*;

public class StarRocksExample {
public static void main(String[] args) {
String url = "jdbc:mysql://localhost:9030/analytics_db";
String user = "root";
String password = "password";

try (Connection conn = DriverManager.getConnection(url, user, password)) {
String sql = "SELECT region, SUM(sale_amount) as total_revenue " +
"FROM sales_facts " +
"WHERE sale_date >= '2024-01-01' " +
"GROUP BY region";

try (Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(sql)) {

while (rs.next()) {
String region = rs.getString("region");
double revenue = rs.getDouble("total_revenue");
System.out.println(region + ": " + revenue);
}
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}

Best Practices

Data Modeling

  1. Choose Appropriate Distribution Keys: Use columns with good cardinality
  2. Use Partitioning: Partition by time or other frequently filtered columns
  3. Create Materialized Views: Pre-compute common aggregations
  4. Optimize Data Types: Use appropriate data types for storage efficiency

Performance

  1. Use Prepared Statements: Avoid query parsing overhead
  2. Monitor Query Plans: Use EXPLAIN to understand query execution
  3. Use Appropriate Indexes: Create indexes based on query patterns
  4. Batch Operations: Group related operations for better performance

Operations

  1. Regular Maintenance: Monitor and optimize table statistics
  2. Backup Strategy: Implement regular backup procedures
  3. Resource Monitoring: Track CPU, memory, and disk usage
  4. Query Optimization: Regularly review and optimize slow queries

Troubleshooting

Common Issues

  1. Connection Issues

    # Check if StarRocks is running
    ps aux | grep starrocks

    # Check network connectivity
    telnet localhost 9030
  2. Performance Issues

    -- Check slow queries
    SHOW PROC '/queries';

    -- Check resource usage
    SHOW PROC '/resources';
  3. Data Consistency Issues

    -- Check table health
    SHOW PROC '/statistic';

    -- Repair table
    REPAIR TABLE sales_facts;

Resources and References

Official Resources

  • StarRocks Manager: Cluster management tool
  • StarRocks Studio: Web-based management interface
  • StarRocks Connectors: Various data source connectors

Learning Resources

  • SQL Reference: Complete SQL documentation
  • Performance Tuning Guide: Optimization best practices
  • Architecture Guide: Deep dive into StarRocks architecture