StarRocks: High-Performance OLAP Database
StarRocks is a high-performance, distributed OLAP database designed for real-time analytics and sub-second query performance. It features an MPP (Massively Parallel Processing) architecture optimized for analytical workloads with high concurrency support.
Overview
StarRocks is an open-source, distributed analytical database that provides:
- MPP Architecture: Massively parallel processing for high performance
- High Concurrency: Support for multiple concurrent queries
- Real-time Ingestion: Stream data processing capabilities
- SQL Compliance: Full SQL support for easier migration
- Columnar Storage: Optimized for analytical queries
- Secondary Indexes: Advanced indexing capabilities
Key Features
🚀 Performance Features
- MPP Architecture: Distributed query processing across multiple nodes
- Columnar Storage: Optimized for analytical workloads
- Sub-second Queries: Designed for fast analytical queries
- High Concurrency: Support for multiple simultaneous queries
- Vectorized Execution: SIMD-optimized query execution
🔧 Operational Features
- Real-time Ingestion: Stream data processing with low latency
- SQL Compliance: Full ANSI SQL support
- Secondary Indexes: Advanced indexing for complex queries
- ACID Transactions: Better transaction support than ClickHouse
- HDFS Integration: Can be integrated with HDFS storage
📊 Data Model Features
- Star Schema Support: Optimized for data warehouse workloads
- Materialized Views: Pre-computed aggregations
- Partitioning: Efficient data partitioning strategies
- Compression: Advanced data compression algorithms
- Caching: Multi-level caching for performance
Architecture
Core Components
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ Frontend │ │ Frontend │
│ (FE) Node │ │ (FE) Node │ │ (FE) Node │
└─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
┌─────────────┴─────────────┐
│ Load Balancer │
└─────────────┬─────────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ │ │
┌───────▼────────┐ ┌───────────▼──────────┐ ┌───────▼────────┐
│ Backend │ │ Backend │ │ Backend │
│ (BE) Node │ │ (BE) Node │ │ (BE) Node │
└────────────────┘ └──────────────────────┘ └────────────────┘
Component Roles
- Frontend (FE): Query planning, metadata management, coordination
- Backend (BE): Data storage, query execution, data ingestion
- Load Balancer: Request distribution and failover
Installation
Docker Installation
# Pull StarRocks image
docker pull starrocks/starrocks:latest
# Run FE node
docker run -d --name starrocks-fe \
-p 9030:9030 \
-p 8030:8030 \
-p 9020:9020 \
-p 9010:9010 \
-e FE_SERVERS="starrocks-fe:9010" \
starrocks/starrocks:latest fe
# Run BE node
docker run -d --name starrocks-be \
-p 9060:9060 \
-p 8040:8040 \
-e FE_SERVERS="starrocks-fe:9010" \
starrocks/starrocks:latest be
Native Installation
# Download StarRocks
wget https://downloads.starrocks.com/starrocks-2.5.0.tar.gz
tar -xzf starrocks-2.5.0.tar.gz
cd starrocks-2.5.0
# Start FE
cd fe && bin/start_fe.sh --daemon
# Start BE
cd ../be && bin/start_be.sh --daemon
Configuration
FE Configuration (conf/fe.conf
)
# Network configuration
priority_networks = 192.168.1.0/24
http_port = 8030
rpc_port = 9020
query_port = 9030
edit_log_port = 9010
# Memory configuration
JVM_XMX = 8g
JVM_XMS = 8g
# Metadata configuration
meta_dir = /opt/starrocks/fe/meta
log_dir = /opt/starrocks/fe/log
# Query configuration
max_conn_per_be = 1024
max_query_timeout = 300
BE Configuration (conf/be.conf
)
# Network configuration
priority_networks = 192.168.1.0/24
be_port = 9060
webserver_port = 8040
heartbeat_service_port = 9050
brpc_port = 8060
# Storage configuration
storage_root_path = /opt/starrocks/be/storage
log_dir = /opt/starrocks/be/log
# Memory configuration
mem_limit = 80%
# Query configuration
max_compaction_threads = 10
max_tablet_num_per_shard = 1024
Data Modeling
Database and Table Creation
-- Create database
CREATE DATABASE analytics_db;
-- Use database
USE analytics_db;
-- Create table with partitioning
CREATE TABLE sales_facts (
sale_id BIGINT,
customer_id INT,
product_id INT,
sale_date DATE,
sale_amount DECIMAL(10,2),
region VARCHAR(50)
)
DUPLICATE KEY(sale_id)
PARTITION BY RANGE(sale_date) (
PARTITION p2023 VALUES LESS THAN ('2024-01-01'),
PARTITION p2024 VALUES LESS THAN ('2025-01-01')
)
DISTRIBUTED BY HASH(customer_id) BUCKETS 32
PROPERTIES (
"replication_num" = "3",
"storage_format" = "DEFAULT"
);
Materialized Views
-- Create materialized view for aggregations
CREATE MATERIALIZED VIEW sales_summary AS
SELECT
region,
DATE_TRUNC('month', sale_date) as month,
COUNT(*) as total_sales,
SUM(sale_amount) as total_revenue
FROM sales_facts
GROUP BY region, DATE_TRUNC('month', sale_date);
-- Create materialized view for customer analytics
CREATE MATERIALIZED VIEW customer_analytics AS
SELECT
customer_id,
COUNT(*) as purchase_count,
SUM(sale_amount) as total_spent,
AVG(sale_amount) as avg_purchase
FROM sales_facts
GROUP BY customer_id;
SQL Operations
Basic Queries
-- Simple SELECT
SELECT * FROM sales_facts WHERE sale_date >= '2024-01-01';
-- Aggregation queries
SELECT
region,
COUNT(*) as sales_count,
SUM(sale_amount) as total_revenue
FROM sales_facts
WHERE sale_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC;
-- Window functions
SELECT
customer_id,
sale_date,
sale_amount,
SUM(sale_amount) OVER (
PARTITION BY customer_id
ORDER BY sale_date
ROWS UNBOUNDED PRECEDING
) as running_total
FROM sales_facts
ORDER BY customer_id, sale_date;
Advanced Analytics
-- Time-series analysis
SELECT
DATE_TRUNC('day', sale_date) as day,
COUNT(*) as daily_sales,
AVG(sale_amount) as avg_sale,
STDDEV(sale_amount) as sale_stddev
FROM sales_facts
WHERE sale_date >= '2024-01-01'
GROUP BY DATE_TRUNC('day', sale_date)
ORDER BY day;
-- Customer segmentation
SELECT
customer_id,
CASE
WHEN total_spent >= 10000 THEN 'Premium'
WHEN total_spent >= 5000 THEN 'Gold'
WHEN total_spent >= 1000 THEN 'Silver'
ELSE 'Bronze'
END as customer_tier
FROM (
SELECT
customer_id,
SUM(sale_amount) as total_spent
FROM sales_facts
GROUP BY customer_id
) customer_totals;
Data Ingestion
Stream Load
-- Load data from file
LOAD LABEL analytics_db.label_20240101
(
DATA INFILE('hdfs://namenode:9000/data/sales_20240101.csv')
INTO TABLE sales_facts
COLUMNS TERMINATED BY ','
(sale_id, customer_id, product_id, sale_date, sale_amount, region)
)
WITH BROKER 'broker_name'
PROPERTIES (
'timeout' = '3600',
'max_filter_ratio' = '0.1'
);
Routine Load
-- Create routine load for continuous ingestion
CREATE ROUTINE LOAD analytics_db.sales_stream_load
ON sales_facts
COLUMNS(sale_id, customer_id, product_id, sale_date, sale_amount, region)
PROPERTIES (
'desired_concurrent_number' = '3',
'max_batch_interval' = '20',
'max_batch_rows' = '300000',
'max_batch_size' = '209715200'
)
FROM KAFKA (
'kafka_broker_list' = 'kafka1:9092,kafka2:9092',
'kafka_topic' = 'sales_topic',
'kafka_partitions' = '0,1,2',
'kafka_offsets' = 'OFFSET_BEGINNING,OFFSET_BEGINNING,OFFSET_BEGINNING'
);
Performance Optimization
Query Optimization
-- Use appropriate partition pruning
SELECT * FROM sales_facts
WHERE sale_date >= '2024-01-01' AND sale_date < '2024-02-01';
-- Use materialized views
SELECT region, SUM(sale_amount) as total_revenue
FROM sales_summary
WHERE month >= '2024-01-01'
GROUP BY region;
-- Use proper distribution keys
SELECT customer_id, COUNT(*) as purchase_count
FROM sales_facts
WHERE customer_id IN (1001, 1002, 1003)
GROUP BY customer_id;
Indexing Strategies
-- Create bitmap index for low-cardinality columns
CREATE INDEX idx_region ON sales_facts (region) USING BITMAP;
-- Create bloom filter for high-cardinality columns
CREATE INDEX idx_customer_id ON sales_facts (customer_id) USING BLOOM_FILTER;
-- Create inverted index for text search
CREATE INDEX idx_product_name ON products (product_name) USING INVERTED;
Monitoring and Maintenance
System Queries
-- Check cluster status
SHOW PROC '/backends';
-- Monitor query performance
SHOW PROC '/queries';
-- Check table statistics
SHOW PROC '/statistic';
-- Monitor resource usage
SHOW PROC '/resources';
Health Checks
# Check FE status
curl -u root: http://localhost:8030/api/show_proc?path=/backends
# Check BE status
curl -u root: http://localhost:8040/api/health
# Check cluster health
mysql -h localhost -P 9030 -u root -p -e "SHOW PROC '/backends';"
Integration Examples
Python Integration
import pymysql
import pandas as pd
# Connect to StarRocks
connection = pymysql.connect(
host='localhost',
port=9030,
user='root',
password='password',
database='analytics_db'
)
# Execute query
query = """
SELECT region, SUM(sale_amount) as total_revenue
FROM sales_facts
WHERE sale_date >= '2024-01-01'
GROUP BY region
"""
df = pd.read_sql(query, connection)
print(df)
connection.close()
Java Integration
import java.sql.*;
public class StarRocksExample {
public static void main(String[] args) {
String url = "jdbc:mysql://localhost:9030/analytics_db";
String user = "root";
String password = "password";
try (Connection conn = DriverManager.getConnection(url, user, password)) {
String sql = "SELECT region, SUM(sale_amount) as total_revenue " +
"FROM sales_facts " +
"WHERE sale_date >= '2024-01-01' " +
"GROUP BY region";
try (Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery(sql)) {
while (rs.next()) {
String region = rs.getString("region");
double revenue = rs.getDouble("total_revenue");
System.out.println(region + ": " + revenue);
}
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
Best Practices
Data Modeling
- Choose Appropriate Distribution Keys: Use columns with good cardinality
- Use Partitioning: Partition by time or other frequently filtered columns
- Create Materialized Views: Pre-compute common aggregations
- Optimize Data Types: Use appropriate data types for storage efficiency
Performance
- Use Prepared Statements: Avoid query parsing overhead
- Monitor Query Plans: Use EXPLAIN to understand query execution
- Use Appropriate Indexes: Create indexes based on query patterns
- Batch Operations: Group related operations for better performance
Operations
- Regular Maintenance: Monitor and optimize table statistics
- Backup Strategy: Implement regular backup procedures
- Resource Monitoring: Track CPU, memory, and disk usage
- Query Optimization: Regularly review and optimize slow queries
Troubleshooting
Common Issues
-
Connection Issues
# Check if StarRocks is running
ps aux | grep starrocks
# Check network connectivity
telnet localhost 9030 -
Performance Issues
-- Check slow queries
SHOW PROC '/queries';
-- Check resource usage
SHOW PROC '/resources'; -
Data Consistency Issues
-- Check table health
SHOW PROC '/statistic';
-- Repair table
REPAIR TABLE sales_facts;
Resources and References
Official Resources
- Documentation: docs.starrocks.io
- GitHub: github.com/StarRocks/starrocks
- Community: community.starrocks.io
Related Tools
- StarRocks Manager: Cluster management tool
- StarRocks Studio: Web-based management interface
- StarRocks Connectors: Various data source connectors
Learning Resources
- SQL Reference: Complete SQL documentation
- Performance Tuning Guide: Optimization best practices
- Architecture Guide: Deep dive into StarRocks architecture