Skip to main content

ScyllaDB: High-Performance NoSQL Database

ScyllaDB is a high-performance, distributed NoSQL database that's compatible with Apache Cassandra and Amazon DynamoDB. It's designed to provide ultra-low latency and high throughput for modern applications that require massive scale.

Overview

ScyllaDB is a drop-in replacement for Apache Cassandra that offers significantly better performance through its C++ implementation and shared-nothing architecture. It's designed for applications that need:

  • Ultra-low latency: Sub-millisecond response times
  • High throughput: Millions of operations per second
  • Linear scalability: Add nodes to increase capacity
  • Fault tolerance: Built-in replication and consistency
  • Cassandra compatibility: Drop-in replacement for existing Cassandra applications

Key Features

🚀 Performance Features

  • C++ Implementation: Native performance without JVM overhead
  • Shared-Nothing Architecture: No shared resources between nodes
  • Async I/O: Non-blocking operations for maximum concurrency
  • Memory-First Design: Optimized for modern hardware
  • Smart Caching: Intelligent data caching strategies

🔧 Operational Features

  • Cassandra Compatibility: Drop-in replacement for existing applications
  • DynamoDB Compatibility: ScyllaDB Cloud supports DynamoDB API
  • Multi-DC Support: Geographic distribution and disaster recovery
  • Backup & Restore: Point-in-time recovery capabilities
  • Monitoring: Built-in metrics and observability

📊 Data Model Features

  • Wide-Column Store: Flexible schema design
  • Time-Series Support: Optimized for time-based data
  • JSON Support: Native JSON data type
  • Counter Support: Distributed counters
  • TTL Support: Automatic data expiration

Architecture

Core Components

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Client Node │ │ Client Node │ │ Client Node │
└─────────┬───────┘ └─────────┬───────┘ └─────────┬───────┘
│ │ │
└──────────────────────┼──────────────────────┘

┌─────────────┴─────────────┐
│ Load Balancer │
└─────────────┬─────────────┘

┌─────────────────────────┼─────────────────────────┐
│ │ │
┌───────▼────────┐ ┌───────────▼──────────┐ ┌───────▼────────┐
│ ScyllaDB Node │ │ ScyllaDB Node │ │ ScyllaDB Node │
│ (Data Center) │ │ (Data Center) │ │ (Data Center) │
└────────────────┘ └──────────────────────┘ └────────────────┘

Data Distribution

  • Partitioning: Consistent hashing for data distribution
  • Replication: Configurable replication factor per keyspace
  • Consistency Levels: Tunable consistency for CAP theorem trade-offs
  • Snitch: Network topology awareness for optimal routing

Installation

Docker Installation

# Pull ScyllaDB image
docker pull scylladb/scylla:latest

# Run single-node cluster
docker run --name scylla-node \
-p 9042:9042 \
-p 7000:7000 \
-p 7001:7001 \
-p 9160:9160 \
-p 9180:9180 \
-p 10000:10000 \
scylladb/scylla:latest \
--smp 1 --memory 750M --overprovisioned 1

Multi-Node Cluster

# Node 1
docker run --name scylla-node1 \
-p 9042:9042 \
-p 7000:7000 \
-p 7001:7001 \
-p 9160:9160 \
-p 9180:9180 \
-p 10001:10000 \
-e SEEDS="scylla-node1" \
scylladb/scylla:latest \
--smp 1 --memory 750M --overprovisioned 1

# Node 2
docker run --name scylla-node2 \
-p 9043:9042 \
-p 7002:7000 \
-p 7003:7001 \
-p 9161:9160 \
-p 9181:9180 \
-p 10001:10000 \
-e SEEDS="scylla-node1,scylla-node2" \
scylladb/scylla:latest \
--smp 1 --memory 750M --overprovisioned 1

Native Installation (Ubuntu/Debian)

# Add ScyllaDB repository
curl -fsSL https://downloads.scylladb.com/deb/ubuntu/scylladb-2023.1.key | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/scylladb-2023.1.gpg
echo "deb [arch=amd64] https://downloads.scylladb.com/deb/ubuntu jammy scylladb-2023.1" | sudo tee /etc/apt/sources.list.d/scylladb.list

# Install ScyllaDB
sudo apt update
sudo apt install scylla

# Configure and start
sudo scylla_setup
sudo systemctl start scylla-server
sudo systemctl enable scylla-server

Configuration

Basic Configuration (/etc/scylla/scylla.yaml)

# Cluster configuration
cluster_name: "MyCluster"
seeds: "192.168.1.10,192.168.1.11,192.168.1.12"

# Network configuration
listen_address: 192.168.1.10
rpc_address: 192.168.1.10
broadcast_address: 192.168.1.10
broadcast_rpc_address: 192.168.1.10

# Performance tuning
num_tokens: 256
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
data_file_directories:
- /var/lib/scylla/data
commitlog_directory: /var/lib/scylla/commitlog
saved_caches_directory: /var/lib/scylla/saved_caches

# Memory configuration
memtable_total_space_in_mb: 2048
memtable_flush_writers: 2

# Compaction configuration
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: false

# Security
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

Advanced Configuration

# Performance optimizations
concurrent_reads: 32
concurrent_writes: 32
concurrent_counter_writes: 32
concurrent_materialized_view_writes: 32

# Memory management
memtable_heap_space_in_mb: 2048
memtable_offheap_space_in_mb: 2048

# Caching
key_cache_size_in_mb: 100
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0

# Logging
logback_conf: /etc/scylla/logback.xml

Data Modeling

Keyspace Creation

-- Create keyspace with replication strategy
CREATE KEYSPACE my_keyspace
WITH replication = {
'class': 'NetworkTopologyStrategy',
'datacenter1': 3,
'datacenter2': 2
};

-- Use the keyspace
USE my_keyspace;

Table Design

-- User profiles table
CREATE TABLE users (
user_id uuid PRIMARY KEY,
username text,
email text,
first_name text,
last_name text,
created_at timestamp,
updated_at timestamp
);

-- User sessions with clustering key
CREATE TABLE user_sessions (
user_id uuid,
session_id uuid,
login_time timestamp,
logout_time timestamp,
ip_address inet,
user_agent text,
PRIMARY KEY (user_id, session_id)
);

-- Time-series data
CREATE TABLE sensor_readings (
sensor_id text,
timestamp timestamp,
temperature double,
humidity double,
pressure double,
PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Secondary Indexes

-- Create secondary index
CREATE INDEX ON users (email);
CREATE INDEX ON users (username);

-- Create custom index
CREATE CUSTOM INDEX user_email_idx ON users (email)
USING 'org.apache.cassandra.index.sasi.SASIIndex';

CQL (Cassandra Query Language)

Basic Operations

-- Insert data
INSERT INTO users (user_id, username, email, first_name, last_name, created_at)
VALUES (uuid(), 'john_doe', '[email protected]', 'John', 'Doe', toTimestamp(now()));

-- Select data
SELECT * FROM users WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Update data
UPDATE users
SET email = '[email protected]', updated_at = toTimestamp(now())
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- Delete data
DELETE FROM users WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

Batch Operations

-- Batch insert
BEGIN BATCH
INSERT INTO users (user_id, username, email) VALUES (uuid(), 'user1', '[email protected]');
INSERT INTO users (user_id, username, email) VALUES (uuid(), 'user2', '[email protected]');
INSERT INTO users (user_id, username, email) VALUES (uuid(), 'user3', '[email protected]');
APPLY BATCH;

Aggregation Queries

-- Count users
SELECT COUNT(*) FROM users;

-- Group by with aggregation
SELECT sensor_id, AVG(temperature) as avg_temp, MAX(temperature) as max_temp
FROM sensor_readings
WHERE timestamp > '2023-01-01'
GROUP BY sensor_id;

Application Integration

Python with ScyllaDB

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import uuid
from datetime import datetime

# Connect to ScyllaDB
cluster = Cluster(['localhost'], port=9042)
session = cluster.connect()

# Create keyspace and table
session.execute("""
CREATE KEYSPACE IF NOT EXISTS my_app
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}
""")

session.execute("USE my_app")

session.execute("""
CREATE TABLE IF NOT EXISTS users (
user_id uuid PRIMARY KEY,
username text,
email text,
created_at timestamp
)
""")

# Insert data
user_id = uuid.uuid4()
session.execute("""
INSERT INTO users (user_id, username, email, created_at)
VALUES (%s, %s, %s, %s)
""", (user_id, 'john_doe', '[email protected]', datetime.now()))

# Query data
rows = session.execute("SELECT * FROM users WHERE user_id = %s", (user_id,))
for row in rows:
print(f"User: {row.username}, Email: {row.email}")

Java with ScyllaDB

import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.*;
import java.util.UUID;

public class ScyllaDBExample {
public static void main(String[] args) {
// Connect to ScyllaDB
CqlSession session = CqlSession.builder()
.withKeyspace("my_app")
.build();

// Insert data
UUID userId = UUID.randomUUID();
PreparedStatement insertStmt = session.prepare(
"INSERT INTO users (user_id, username, email, created_at) VALUES (?, ?, ?, ?)"
);

session.execute(insertStmt.bind(
userId,
"john_doe",
"[email protected]",
java.time.Instant.now()
));

// Query data
PreparedStatement selectStmt = session.prepare(
"SELECT * FROM users WHERE user_id = ?"
);

ResultSet rs = session.execute(selectStmt.bind(userId));
for (Row row : rs) {
System.out.println("User: " + row.getString("username"));
}
}
}

Node.js with ScyllaDB

const cassandra = require('cassandra-driver');
const { v4: uuidv4 } = require('uuid');

// Connect to ScyllaDB
const client = new cassandra.Client({
contactPoints: ['localhost'],
localDataCenter: 'datacenter1',
keyspace: 'my_app'
});

async function main() {
await client.connect();

// Insert data
const userId = uuidv4();
const query = 'INSERT INTO users (user_id, username, email, created_at) VALUES (?, ?, ?, ?)';
await client.execute(query, [userId, 'john_doe', 'john@example.com, new Date()], { prepare: true });

// Query data
const selectQuery = 'SELECT * FROM users WHERE user_id = ?';
const result = await client.execute(selectQuery, [userId], { prepare: true });

result.rows.forEach(row => {
console.log(`User: ${row.username}, Email: ${row.email}`);
});

await client.shutdown();
}

main().catch(console.error);

Performance Optimization

Query Optimization

-- Use ALLOW FILTERING sparingly
SELECT * FROM users WHERE email = '[email protected]' ALLOW FILTERING;

-- Use IN clause for multiple partition keys
SELECT * FROM users WHERE user_id IN (uuid1, uuid2, uuid3);

-- Use LIMIT for pagination
SELECT * FROM user_sessions
WHERE user_id = ?
ORDER BY session_id
LIMIT 10;

Indexing Strategies

-- Create materialized views for complex queries
CREATE MATERIALIZED VIEW user_by_email AS
SELECT user_id, username, email, created_at
FROM users
WHERE email IS NOT NULL AND user_id IS NOT NULL
PRIMARY KEY (email, user_id);

-- Use SASI indexes for text search
CREATE CUSTOM INDEX user_name_idx ON users (username)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'CONTAINS',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer'
};

Consistency Levels

from cassandra import ConsistencyLevel

# Strong consistency
session.execute("INSERT INTO users (user_id, username) VALUES (?, ?)",
(user_id, username),
consistency_level=ConsistencyLevel.QUORUM)

# Eventual consistency for better performance
session.execute("INSERT INTO users (user_id, username) VALUES (?, ?)",
(user_id, username),
consistency_level=ConsistencyLevel.ONE)

Monitoring and Maintenance

Health Checks

# Check node status
nodetool status

# Check cluster information
nodetool info

# Check table statistics
nodetool tablestats my_keyspace.users

# Check compaction status
nodetool compactionstats

Backup and Restore

# Create snapshot
nodetool snapshot my_keyspace

# Create incremental backup
nodetool backup my_keyspace

# Restore from snapshot
sstableloader /path/to/snapshot/files

Performance Monitoring

-- Check table metrics
SELECT * FROM system_metrics.table_metrics;

-- Monitor query performance
SELECT * FROM system_traces.sessions;

-- Check cluster health
SELECT * FROM system.local;

Security

Authentication and Authorization

-- Create user
CREATE USER john_doe WITH PASSWORD 'secure_password';

-- Grant permissions
GRANT ALL PERMISSIONS ON KEYSPACE my_keyspace TO john_doe;
GRANT SELECT ON TABLE my_keyspace.users TO john_doe;

-- Create role
CREATE ROLE app_user;
GRANT SELECT, INSERT, UPDATE ON TABLE my_keyspace.users TO app_user;

SSL/TLS Configuration

# Enable SSL
server_encryption_options:
internode_encryption: all
keystore: /etc/scylla/keystore.jks
keystore_password: keystore_password
truststore: /etc/scylla/truststore.jks
truststore_password: truststore_password

client_encryption_options:
enabled: true
optional: false
keystore: /etc/scylla/keystore.jks
keystore_password: keystore_password

Troubleshooting

Common Issues

  1. Connection Issues

    # Check if ScyllaDB is running
    sudo systemctl status scylla-server

    # Check network connectivity
    telnet localhost 9042
  2. Performance Issues

    # Check memory usage
    nodetool info | grep "Heap Memory"

    # Check disk usage
    df -h /var/lib/scylla/

    # Check compaction status
    nodetool compactionstats
  3. Data Consistency Issues

    # Repair data
    nodetool repair my_keyspace

    # Check data consistency
    nodetool scrub my_keyspace

Log Analysis

# Check ScyllaDB logs
sudo tail -f /var/log/scylla/scylla.log

# Check system logs
sudo journalctl -u scylla-server -f

# Check GC logs
sudo tail -f /var/log/scylla/gc.log

Best Practices

Data Modeling

  1. Design for Queries: Model data based on access patterns
  2. Avoid Wide Partitions: Keep partition sizes manageable
  3. Use Appropriate Data Types: Choose efficient data types
  4. Plan for Growth: Design for future data volume

Performance

  1. Use Prepared Statements: Avoid query parsing overhead
  2. Batch Operations: Group related operations
  3. Monitor Metrics: Track performance indicators
  4. Tune Consistency: Balance consistency vs performance

Operations

  1. Regular Backups: Implement automated backup strategies
  2. Monitor Health: Set up comprehensive monitoring
  3. Plan Scaling: Design for horizontal scaling
  4. Test Recovery: Regularly test backup and restore procedures

Resources and References

Official Resources

  • ScyllaDB Cloud: Managed ScyllaDB service
  • ScyllaDB Manager: Cluster management tool
  • ScyllaDB Monitoring: Built-in monitoring stack
  • ScyllaDB Tools: Utility tools for operations

Learning Resources

  • CQL Reference: Complete CQL documentation
  • Performance Tuning Guide: Optimization best practices
  • Migration Guide: Cassandra to ScyllaDB migration
  • Architecture Guide: Deep dive into ScyllaDB architecture