Skip to main content

Apache Pulsar

Apache Pulsar is a distributed, open-source messaging and streaming platform designed for high-performance, scalability, and data consistency. It combines the features of traditional message queues with those of publish-subscribe systems, providing unified platform for real-time event streaming and messaging.

Key Concepts:

  • Topics: Named channels through which messages are published and consumed. Like Kafka topics, Pulsar topics are used to categorize messages, but Pulsar's architecture allows for more flexible topic management.

  • Partitions: Topics can be partitioned to allow parallel processing. Each partition is handled by a single broker.

  • Namespaces: A logical grouping of topics. They are the administrative units within a Pulsar tenant and used for resource quota management and access control.

  • Tenants: The highest-level administrative unit in Pulsar. Tenants can span multiple clusters.

  • Brokers: Servers that handle the storage and delivery of messages. Pulsar brokers are stateless and rely on Apache BookKeeper for durable storage.

  • BookKeeper: A distributed, durable, and consistent ledger storage system used by Pulsar to store messages persistently. BookKeeper ensures that messages are durably stored and replicated across multiple bookies (storage nodes).

  • Bookies: Storage nodes in the BookKeeper cluster. They are responsible for storing the actual message data.

  • Producers: Applications that publish messages to Pulsar topics.

  • Consumers: Applications that subscribe to Pulsar topics and receive messages.

  • Subscriptions: A named configuration, associated with a topic, that determines how consumers receive messages. Pulsar offers several subscription modes:

    • Exclusive: Only one consumer can subscribe to a topic at a time.
    • Shared: Multiple consumers can subscribe to a topic, and messages are distributed among them in a round-robin fashion.
    • Failover: Multiple consumers can subscribe to a topic, but only one consumer is active at a time. If the active consumer fails, another consumer takes over.
    • Key-Shared: Messages with the same key are delivered to the same consumer. This guarantees message ordering for specific keys.
  • Message Acknowledgement: Consumers must acknowledge messages after they have been successfully processed. Pulsar supports both cumulative and individual acknowledgements. Cumulative acknowledgement acknowledges all messages up to a certain point, while individual acknowledgement acknowledges each message separately.

Architecture:

Pulsar adopts a layered architecture:

  1. Producers publish messages to Topics.
  2. Topics are managed within Namespaces and Tenants.
  3. Brokers receive messages from producers and route them to BookKeeper ensembles.
  4. BookKeeper stores messages persistently on Bookies, ensuring fault tolerance through replication.
  5. Consumers subscribe to Topics via different Subscription Modes.

Key Features and Benefits:

  • High Performance: Pulsar is designed for high throughput and low latency, suitable for real-time data streams.

  • Scalability: Pulsar can be easily scaled horizontally by adding more brokers and bookies to the cluster.

  • Fault Tolerance: Pulsar replicates data across multiple bookies to ensure data durability and availability in case of node failures. BookKeeper's architecture allows for tolerance of both broker and storage node failures.

  • Data Consistency: BookKeeper provides strong data consistency guarantees, ensuring that messages are delivered reliably and in order.

  • Flexible Messaging Models: Pulsar supports various messaging models, including publish-subscribe, queuing, and streaming, allowing it to support a wide range of applications.

  • Multi-Tenancy: Pulsar supports multi-tenancy, allowing multiple independent applications or organizations to share the same cluster.

  • Geo-Replication: Pulsar supports geo-replication, allowing data to be replicated across multiple geographic regions for disaster recovery and low-latency access.

  • Tiered Storage: Pulsar provides tiered storage capabilities, allowing older data to be stored on cheaper storage tiers for cost optimization.

  • Schema Registry: Pulsar has a built-in schema registry for managing message schemas and ensuring data consistency.

  • SQL-Based Querying (Pulsar SQL): Allows querying data stored in Pulsar topics using SQL.

Use Cases:

  • Real-Time Analytics: Processing and analyzing real-time data streams for insights and decision-making.

  • IoT Data Ingestion: Ingesting and processing data from IoT devices at scale.

  • Financial Transactions: Processing financial transactions in real-time for fraud detection and risk management.

  • Log Aggregation: Collecting and aggregating logs from multiple applications and servers for centralized monitoring and analysis.

  • Event-Driven Microservices: Building event-driven microservices architectures where services communicate through events.

  • Gaming: Real-time gaming applications requiring low latency and high throughput.

  • Personalization: Providing personalized experiences based on real-time user behavior and preferences.

Comparison with other Message Queues:

FeatureApache PulsarApache KafkaRabbitMQ
ArchitectureLayered (Brokers + BookKeeper)Distributed, Log-basedCentralized, Broker-based
StorageSeparated Compute and StorageIntegrated Compute and StorageIntegrated Compute and Storage
ScalabilityHigh ScalabilityHigh ScalabilityScalable with clustering
ThroughputHigh ThroughputHigh ThroughputModerate Throughput
Data ConsistencyStrong ConsistencyEventual ConsistencyConfigurable
Multi-TenancyNative SupportLimited SupportLimited Support
Geo-ReplicationNative SupportSupported via MirrorMakerSupported via plugins
Tiered StorageNative SupportNot SupportedNot Supported
Messaging ModelsPub-Sub, Queuing, StreamingPub-SubQueuing
ComplexityMore Complex to Setup and ManageMore Complex to Setup and ManageEasier to Setup and Manage

Getting Started:

  1. Download Pulsar: Download the latest version of Pulsar from the Apache Pulsar website.
  2. Start Pulsar: Start the Pulsar cluster using the pulsar standalone command (for a single-node deployment) or by configuring a distributed cluster. This will automatically start BookKeeper as well.
  3. Create a Tenant: Create a tenant using the pulsar-admin tenants create command.
  4. Create a Namespace: Create a namespace within the tenant using the pulsar-admin namespaces create command.
  5. Create a Topic: Create a topic within the namespace using the pulsar-admin topics create command.
  6. Start a Producer: Start a producer using the Pulsar client library (available for various languages) and send messages to the topic.
  7. Start a Consumer: Start a consumer using the Pulsar client library and subscribe to the topic to receive messages.

Example Command (Create a Topic):

./bin/pulsar-admin topics create persistent://my-tenant/my-namespace/my-topic -p 1

Conclusion:

Apache Pulsar is a modern messaging and streaming platform that offers high performance, scalability, data consistency, and flexible messaging models. Its layered architecture, separation of compute and storage, and support for multi-tenancy and geo-replication make it well-suited for building modern, cloud-native applications that require real-time data processing and analysis. While it may be more complex to set up and manage than some simpler message queues, its advanced features and capabilities make it a compelling choice for demanding use cases. When deciding between Pulsar and Kafka, consider if you need features like strong consistency, multi-tenancy, tiered storage, and built-in geo-replication. Pulsar's architecture, with its separation of compute and storage, can also offer advantages in terms of resource utilization and scaling.