Entire website open-sourced at: https://github.com/grokit/website_grokit_ca/. Feel free to make pull requests if you find mistakes!

# Apache Hadoop Explained: Kafka, ZooKeeper, HDFS and Cassandra.

Apache Hadoop is a suite of open-source components which serve as the building blocks of large distributed systems.

Some of the world's best apps and websites are powered by Hadoop. If you want to understand how to build world-scale web services, understanding the principal Hadoop components is a good place to start.

## Kafka

Kafka is a publish-subscribe distributed messaging system implemented on a log abstraction. Publishers write to topics, any number of clients can subscribe to that topic and get notified when a new event happen.

The base abstraction in Kafka is a distributed log. This log allows to define an authoritative order for the events in the system as well as allowing buffering between producer consumer (a producer can produce N messages, the consumer does not have to immediately consume them). However, that log has a maximum size and retention period, so if the events are not consumed within a period of time, they vanish.

Kafka is organized by topics, which are a name that producer and consumer agree to use for a specific event stream. For example, you could think of a topic being "Picture7564_Comments". The producers write to that topic when a user comments on the picture, and clients subscribed to that topic get a notification whenever a new comment is added on that picture.

The topics are broken-down by partition. Partitions are used to provide scaling, parallelism and ordering guarantees. Messages in the same partition are totally ordered on the server, and deliver in-order to the client. However, between partitions, message on a topic have no ordering defined. A partition is attributed to a server, so having only one partition per topic limits the throughput.

Kafka replicates the data on multiple servers. Replication factor is a configurable setting, setting it to one will yield the best performance but risk permanent message loss on a hard-disk failure. When the replication factor is N, it guarantees that unless N-1 nodes fail at the same time, the message will not be lost and be readable by the consumer.

Kafka is designed to be a real-time system; the delay between an event being published and consumed is expected to be less than 100ms.

Similar system: Rabbit MQ

## ZooKeeper

ZooKeeper is a distributed coordination service which is used when nodes in a distributed system need a single source of truth. Most of Hadoop's other components (e.g. Kafka, HDFS, ...) use ZooKeeper internally.

It is implemented as a single (movable) master with N coordinated nodes. A majority of nodes (n/2+1) must agree on a change before the change is accepted. The write requests have to be processed by the master (so the write rate is severely limited), but the read requests can be answered by any of the coordinated nodes.

The basic unit of interaction are znodes which are file-like entities addressable under a path:

ZooKeeper provides a low-level API that allows to do just a few fundamental operations on those znodes. For example (but not limited to):

• getData(path, watch)

• setData(path, data, version)

Guarantees: The clients are not guaranteed to all have the same view at the same time (because of network delays and network segmentation, some clients might have information from a previous state). But the clients are guaranteed to all view the changes in the same order.

Example Use-Cases: distributed lock, group membership and configuration management.

ZooKeeper provides very low-level abstraction, and implementing scenarios like leader election can be really hard to do well. Usually people will use the Curator library which implements higher-level concept on top of the ZooKeeper API.

Similar system: Google's Chubby.

## HDFS

Distributed data-store which is highly available and guarantees no data-loss even in case of multiple server and hard-disk failures. Think of HDFS as a distributed file store that guarantees high-throughput for read and writes and well as durability. HDFS gives a file-like interface: you can read, write and delete files which are organized under an hierarchical folder structure.

Internally HDFS uses a master / coordinated nodes design. The master node is called the NameNode and the coordinates nodes are called DataNode. Files stored in HDFS are broken down in blocks (up to \~64MB), and the blocks bytes are stored in the DataNodes. The NameNode is responsible for managing the filesystem namespace and regulating access to files by clients. It stores the block to DataNode relation. DataNode are responsible for storing the data and directly handling read / write streams. So when a client wants to read a file, his request has to first go to the MasterNode to know which DataNode hold the blocks for this file. Then the client connects directly to the DataNodes and reads the data.

It may seen odd that all read / writes need to go through the unique NameNode, this introduces an obvious performance bottleneck. However, the operations that the NameNode has to do (looking up file -> block information from RAM and filesystem metadata changes) are light, and the costly operations (actually reading / writing the data) are done by the DataNodes, so the bottleneck is not that severe. The NameNode uses a synchronously-replicated transaction log for metadata changes so that HDFS can recover from a NameNode permanent failure.

HDFS maintains N (usually 3) copy of all blocks at all times. It verifies that N copies are always healthy, and if some of the copies have vanished (e.g. hard-disk failure) it will create new copies of the data until the replication factor goes back to N.

Similar systems: GFS, Amazon S3, Azure Blob storage

## Cassandra

Cassandra is a distributed database which scales linearly as nodes are added to the cluster. Cassandra has more structure than a simple key-value store but less structure and guarantess than a full-on relational database. It provides an SQL-like language (called CQL) for data-operations. Cassandra is optimized for fast writes (reads are more expensive) and is eventually consistent. There is support for atomic operations for data in the same node. In contrast to most storage systems, all nodes are homogeneous -- there is only one type of node in Cassandra and no underlying dependency on HDFS or ZooKeeper.

Data in Cassandra is organized in tables, which are themselves broken down in rows and columns. A (table, key, columnName) represents the address of data. Unlike a relational database, Cassandra does not force all keys in a table to have the same set of columns. Tables in Cassandra are sharded by primary key (which identifies a row of data) and clustered by clustering key. The shard defines which machine owns the row. Clustering defines the order the data is written in. Rows can have time limited data (set TTL on write) and specify triggers (action that happen when data changes).

Cassandra is designed for high write throughput, at the cost of read speed. This is due to using a log-structured merge-tree (LSM-Tree) as opposed to B+ tree which is ubiquitous in most other databases (e.g. MySQL). Briefly, a LSMT database strive to write data sequentially since sequential write is much faster than random write for both HDD and SSD. Data is kept in a cache and a set of files which contains sorted data. When the number of files gets too large, compaction merges N-files together. The cache can return read operations for most recently used data, but reading colder data is costly since it must be binary-searched in the sorted files. Writing is very fast because it is mostly just appending data to a file.

Cassandra allows fine-tuning of data replication settings on a per-table basis. The default mode is quorum, where a R + W > N consensus is necessary in order to accept a write or server a read. Lightweight transactions are also possible through the IF clause, but (of course, there is no free lunch) at the cost of giving-up availability.

Unlike Amazon Dynamo, Cassandra does not use vector clocks. Rather, it relies on a last-write-wins scheme and hopes that clocks skew is minimal. This is a weak point of Cassandra, in the case of network partition, two clients can end-up writing a key on different nodes. If the nodes do not have a synchronized clock, last-write wins means that one of the two client (not necessarily the one that wrote data last) will have his data overwritten and not be aware of it.

Similar system: Amazon Dynamo, Google BigTable, Azure Table, HBase, Riak.

A few distinctions between the systems:

• Cassandra came 2 years after Amazon Dynamo and can be considered an hybrid between BigTable and Dynamo.
• Amazon Dynamo is offered as a platform (par-per-use), Cassandra and HBase are Apache open-source projects: you can pick a provider for pay-per-use or deploy to your own data-center.
• Both Cassandra and HBase are partition-tolerant systems, but Cassandra focuses on availability at the cost of consistency; HBase focuses on consistency at the cost of availability. In CAP-parlance, Cassandra is an AP-system and HBase is an AC-system.

## Honorable Mentions (May Add Sections for Those Later)

• Thrift: RPC framework.
• Spark & Flink: mapreduce-type analytics.