Introduction

Cassandra is a distributed, no single point of failure, continuously available and scalable. NoSQL database that manages a large amount of data across many data centres and cloud servers. It offers both operation simplicity and capacity to scale linearly. While MySQL is the world’s most popular, cost-effective, high-performance relational database(Kumar, 2016). It comes with a range of suits of tools and drivers to manage business critical applications. It can also be used as an operational data store for Cassandra. This technical report discusses the different characteristics, architecture and transaction management between the two databases. It also details setting up the benchmark in OpenStack to find out which database is suitable for an application. The results show that Cassandra is suitable for situations where fast read or write performance is required. It is also suitable for reliable cross datacentre replication with the ability to scale as required. MySQL is suitable for atomic, consistent, isolated and durable transactions with unrestricted row-level locking and server-enforced referential integrity. However, it does not scale as quickly as Cassandra and have a rigid schema.

Key Characteristics

Cassandra is a linearly and horizontally scalable, wide-row-store database that uses a highly denormalized model designed to capture and query data. Its data modelling techniques combine properties of a key-value database with the column-oriented one.

It implements a peer-to-peer architecture in which data is distributed across clusters based on a single token. Unlike a relational database that penalizes the use of many columns in a table, it executes tables that have thousands or even tens of thousands of columns. It provides helpful data modelling abstractions to make this model approachable for the developer. It also uses partition key to distribute data across the cluster and avoid bottlenecks. For successful implementation, right data model and hardware is required. (Datastax Academy, 2017).

On the other hand, MySQL is a storage-engine architecture and suitable for a wide range of purposes such as web applications, online transaction processing(OLTP) and data warehouses. The design is separated into data storage, query processing and data retrieval.

Cassandra	MySQL
Manages data from multiple locations with high velocity	Manages data from one or few locations with moderate velocity
Handles all type of data with high volume	Mostly handles structured data with moderate volume
Constant uptime with horizontal scaling and decentralized deployments	Single points of failure with vertical scaling and centralized deployment
Sustains simple transactions	Maintains nested or complex transactions
Supports read and write scalability	Supports read scalability

Table 1: Different Characteristics between Cassandra and MySQL

Architecture

Cassandra

Cassandra has a shared-nothing or master less “ring” architecture that results in its ability to scale, perform and continuous uptime. The data is partitioned evenly across several “rings” or clusters of nodes using an internal component called a partitioner that uses a consistent hashing algorithm. The hashing algorithm determines which node holds a specific data and maps row keys to physical nodes. It also assigns a token range.

All nodes have an identical role and communicate with each other via a “gossip” protocol. The protocol helps in failure detection because each node share information about their and maximum of 3 other nodes state with each other over a period. To add more capacity new nodes can be added without taking an existing cluster down(Mehra, 2016).

Figure 1: Cassandra Ring(Mehra, 2016)

Cassandra also provides users an ability to configure the consistency level, several replicas in a cluster for successful read or write operations in a data centre (centralized place to house servers and networking systems).

It creates copies of data to avoid single point of failure according to either Simple or Network Topology Strategy. The simple strategy places successive copies on the next node in a clockwise manner. While Network Topology strategy uses “snitches” to ensure replicas use efficient inter-node requests and do not store data on the same rack in the datacentre(Tinoco, 2017).

It has one or more keyspaces which are like schemas in MySQL. Each keyspace has different replication model with a strategy and factor (number of nodes that the data replicates). Each keyspace also has a column family which is a sorted map with a row providing access to a set of columns. A row or partition key has several columns associated with it.

Figure 2: Cassandra Keyspace (Zaforas, 2017)

MySQL

MySQL architecture is a tiered architecture consisting of client- server system. The applications connecting to it are clients and the database is the server. The top layer is application layer which provides services to most servers need such as:

Connection handling: A client gets its own thread when the client connects to a server. All queries are executed within the same specific thread. The thread is also cached by the server.
Authentication is performed on the server side based on the username, host of the client and password of the client user.
Security through checking if the client has the correct rights to issue certain queries.(Vadim Tkachenko, Peter Zaitsev, 2012)

Figure 3: MySQL Architecture(Kumar, 2016)

The second layer is the server layer which handles all the logical functionalities including query parsing, optimization analysis, built-in functions (e.g. dates, times) and caching. This layer can be divided into sub-components which are:

Services and utilities for administration and maintenance including Backup & recovery, replication, partitioning, security and workbench.
Interface is used for data definition and manipulation with functionality provided across storage engines such as views, stored procedures and triggers. Structured Query Language (SQL) is a tool used to query server and client.
Parser behaves like a single pass compiler and is used to create an internal structure with lexical and syntactic analysis.
Optimizer applies different techniques such as ordering of scanning tables, rewriting the query and choosing the right index.
Caches & buffers stores the complete output(MySql, 2018).

The third layer contains storage engines for various requirements and situations. This layer is responsible for recovering and storing data and responding to requests from the server. It is not responsible for parsing or communicating with each other. Some of the supported engines are:

Comma Separated Values (CSV)
InnoDB
Blackhole
MyISAM
Memory(Kumar, 2016)

Transaction Management

A transaction is a group of queries that are treated as a single unit of work. Both Cassandra and MySQL manage transactions differently.

Cassandra support AID (Atomic, Isolated and Durable) transactions with acceptable consistency. There is no concept of joins or foreign keys or referential integrity or rollback or locking mechanism. The user can decide on the strength or eventual consistency for a transaction by selecting either complete the query when one node or multiple nodes respond is marked complete.

It supports transactional atomicity and isolation at the row-level but prefers fast write preference and high availability. Insertions or updates of two or more rows in the same partition are treated as one write operation. Further, a delete operation is atomic at partition level. It uses client-side timestamps to establish the most recent update to a column even though multiple client sessions update the same columns in a row concurrently. It also performs write and delete operations with full row-level isolation. All write, updates in a batch operation belonging to only a single given partition key is only visible to the client performing operations. The writes are durable with replica nodes being logged in memory and in commit log before documented as success. In case of a crash or failure, the commit logs are replayed on restart.It also delivers compare-and-set mechanism or lightweight transactions. It offers an isolation level similar to the relational database’s serializable level by using and extending the Paxos consensus (Datastax, 2017).

MySQL manages ACID transactions by the underlying storage engines rather than at the server level. It functions in AUTOCOMMIT mode by default. This mode automatically executes each query in a separate transaction and can be either disabled or enabled by setting a variable. There are several keywords such as BEGIN WORK, COMMIT, and ROLLBACK to allow transactions. The server knows that the declarations are part of the transaction with BEGIN WORK till either a COMMIT or ROLLBACK statement is executed. COMMIT writes the changes to the database and the data cannot be changed without another COMMIT command. No concurrently run SQL statements or crashes affect the data. ROLLBACK statement tells the database to revert to the point before the start of the transactions. All non-committed transactions are rolled back automatically in event of a crash. Each transaction is isolated from the other by using a locking scheme. Locking prevents all access to the data of a single client and prevents other clients or threads from altering data.

MySQL provides two transactional engines apart from several other third-party engines: InnoDB and NDB Cluster. InnoDB supports all four ANSI standard isolation levels. It provides row-level locking and locks view of the data at a specific point in time. It supports READ COMMITTED, READ UNCOMMITTED, REPEATABLE READ and SERIALIZABLE. However, NDB Cluster only supports READ COMMITED on a per-row basis. Also, the data which is uncommitted is never returned(MySql, 2018).

Performance Test Set up

For the analysis, Yahoo! Cloud Serving Benchmark (YCSB) was set up to evaluate and compare the performance of MySQL and Cassandra databases. YCSB is an open source program which helps to evaluate and compare the maintenance and recovery capabilities. It consists of two parts: a data generator and a set of performance tests consisting of read and insert operations. It includes a set of core test scenarios called workloads that define a set of features for different systems to understand the performance trade-offs. The core workloads consist of six different workloads:

Workload A has a mix of 50/50 reads and writes. It is also update heavy. An application it can be suitable for is storing recorded recent events.
Workload B is Read mostly and has 95/5 reads/write mix. An application it can be suitable for is reading photo tags from Instagram.
Workload C is Read-only. An application it can be suitable for is user profile cache.
Workload D is Read latest and new records where the most popular inserted records are the most recent. An application it can be suitable for is user status updates.
Workload E is when short ranges of records are queried, instead of individual records. An application it can be suitable for is threaded conversations.
Workload F is Read-modify-write. The client read, modifies and write back the changes to the record(Cooper, 2016).

Figure 4: YCSB(Cooper et al., 2010)

All six workloads have a similar data set with workloads D and E inserting records during the test run.

The focus of the analysis was to find the performance of the two databases under different workloads. The test was performed on 10000,20000,40000 and 50000 records and operations. All the tests were executed on OpenStack with m1. medium flavour,2 VCPUs, 4GB RAM, 40GB of disk space. The operating system was Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-119-generic x86_64).

During the experimental evaluation, these were the versions which were tested:

Apache Cassandra: 3.11.2
MySQL: 14.14 Distrib 5.7.21, for Linux (x86_64) using EditLine wrapper
Java: 1.8.0_161
YCSB: 0.12.0

Workload Method

For each iteration of the test, the following steps were run:

Use database (MySQL) or keyspace (Cassandra)
Create an empty table
Load with a count of the records
Run the OPS load with the same operation count
Clear the database by removing all data files and commit logs

Cassandra’s keyspace was created with simple strategy and a replication factor of 3. Further, it was run on: Datacenter: datacenter1; Host: localhost/127.0.0.1; Rack: rack1

MySQL was run on: Shard node URL: jdbc: mysql://localhost:3306/BenchTest

Evaluation and Results

The output from YCSB shows separate time series for different run of loads and operations. An example is the sample output shown below which can be interpreted as the total execution time for 10,000 operations was 0.23 minutes. The average throughput was 720.9 across all threads. There were 4889 read and 5111 update operations which were successful. The table shows average, minimum, maximum, 99th and 95th Percentile latency in microseconds. Latency is the amount of time for network latency to the server and time of execution for interface and database.773.71 microseconds and 690.06 microseconds is the end to end latency for read and update operations against the database respectively(Cooper, 2016).

[OVERALL]	RunTime(ms)	13871
[OVERALL]	Throughput(ops/sec)	720.928556
[TOTAL_GCs]	Count	11
[TOTAL_GC_TIME]	Time(ms)	71
[TOTAL_GC_TIME_%]	Time(%)	0.511859275
[READ]	Operations	4889
[READ]	AverageLatency(us)	773.7181428
[READ]	MinLatency(us)	440
[READ]	MaxLatency(us)	75455
[READ]	95thPercentileLatency(us)	1420
[READ]	99thPercentileLatency(us)	2953
[READ]	Return=OK	4889
[UPDATE]	Operations	5111
[UPDATE]	AverageLatency(us)	690.0610448
[UPDATE]	MinLatency(us)	397
[UPDATE]	MaxLatency(us)	14031
[UPDATE]	95thPercentileLatency(us)	1274
[UPDATE]	99thPercentileLatency(us)	2557
DATE]	Return=OK	5111

Table 2: Sample Output after running Workload A in Cassandra

Figure 5: Runtime for Cassandra (Load)

Runtime is the length of time a program takes to execute(Endpoint, 2015). The plots for loading data into Cassandra shows that the runtime increases as the record count increases. Workload A and F take the most time. The time is also very similar for the two workloads. Workload C takes the least time to load. There is an anomaly in the plot for 50000 records in which workload D takes 38683.0 milliseconds to load.

Figure 6: Runtime for MySQL (Load)

The plots show that the runtime for loading data into MySQL is significantly higher than Cassandra. For Example, it takes 20 times longer to load data for workload A in MySQL than Cassandra. Further, the runtime increases almost 3 times the number of records to load initially. Runtime for workload A, B, C, D, E is almost the same for 40,000 records. However, runtime increases significantly for workload A, B, C as the records increases to 50,000. It follows the same pattern for workload E and D.

Figure 7: Runtime for Cassandra (Run)

The runtime plots for performing operations in Cassandra demonstrate that only workload A is consistent with the number increase. It takes 720.9 ms,17086.0 ms,30152.0 ms, 37356.0 ms for 10000,20000,40000 and 50000 operations respectively. For workload B, C, D, larger operation counts result in a longer runtime. Workload E takes the most time to run. Hence, Cassandra is suitable for applications which are update heavy.

Figure 8: Runtime for MySQL (Run)

The runtime for performing the number of operations follows the same pattern for all the workloads. Workload A and F take the most time while C and D take the least amount of time. It also takes almost twice for the operation to perform when the number increase from 20000 to 40000. It also takes almost 17 times more for MySQL than Cassandra to run for Workload A with 600,000 milliseconds and 35,000 milliseconds respectively.

Figure 9: Throughput Vs Average latency for Read Operation (Cassandra)

Figure 10: Throughput Vs Average latency for Read Operation (MySQL)

Throughput is defined as the number of units of information the database can process over time. The plots show the average latency for read operation with the different throughput for different workloads and number of operations. For small number of operations, workload A and F (MySQL)and F(Cassandra) have no effect on average latency for read operations as the throughput increases. For other workloads, latency decreases as throughput increases.

Figure 11: Throughput Vs Average latency for Inserting Records (Cassandra)

Figure 12: Throughput Vs Average latency for Inserting Records (MySQL)

The latency for inserting records for MYSQL is the same even though the throughput increases. Conversely, latency for insertion decreases for Cassandra as the throughput increases.

Figure 13: Throughput Vs Average latency for Update Operation (Cassandra)

Figure 14: Throughput Vs Average latency for Update Operation (MySQL)

Latency for update operation for workload A, B and F in Cassandra decreases as the throughput increases. However, there is no effect of throughput on latency for MySQL.

Figure 15: Throughput for Cassandra and MySQL (Load)

The above plot shows that Cassandra is better than MySQL for any workload when loading data. Cassandra’s throughput increases as the number of records are increased. Conversely, MySQL performance decreases as the number of records increases.

Figure 16: Throughput for Cassandra and MySQL (Run)

The plots show that Cassandra has a higher throughput for Workload A, B and F than MySQL. Conversely, MySQL has higher throughput for other three workloads. For workload C and 50000 operations, MySQL has a significantly higher throughput. Hence, it is more efficient in read-only operations than Cassandra.

Figure 17: Average Latency for Inserting Records

The plots confirm that MySQL takes considerably long time to load data than Cassandra. The time taken to load data increases for Cassandra for each workload. However, for MySQL, workload D takes less time as the records increase.

Figure 18: Average Latency for Read and Scan Operation

The plot shows that Cassandra has a smaller delay to transfer the data for the read operation. It is also faster for workload A, B, C, D and F as the operation count increases while MySQL’s average latency for the instruction stays the same. Therefore, Cassandra scales better than MySQL.

Figure 19: Average Latency for Update and Read-Modify-Write Operation

The update operation is available only for the three workloads. The plots show that there is a huge difference for average latency between the two databases for updating. Still, increase in the number of instructions did not affect the latency for both. Plus, the latency for read-write-modify operation for Cassandra follows a similar pattern as update operation while MySQL’s latency increases slightly. Insert operation failed for Cassandra failed running the experiment.

Figure 20: Insert-Failed (Cassandra)

Conclusion

Cassandra is continuously available with peer-to-peer or master-less architecture. Its “ring” architecture allows linear scalability in throughput and nodes addition without complex decisions. It can manage data through data centres in multiple locations and failure is handled through gossip protocol. The data can be both structured or unstructured. It does not need special hardware or caching layer. Durability is ensured due to append-only commit logs.

Conversely, MySQL is suitable for ACID complaint transactions with referential integrity. The data is mostly structured with low to moderate volumes. Further, it is suitable for architectures which are primarily read with no need to write to multiple servers. MySQL is suffering from the rapid increase of big data. The master-slave architecture is preventing it from being available to receive data from different geological locations and hence lacks massive scalability. It is also time-consuming to manually correct failback situations.

Therefore, the two databases provide a choice appropriate on the use case as well as requirements of the business.

References

Cooper, B. F. et al. (2010) ‘Benchmarking Cloud Serving Systems with YCSB’. Available at: https://www2.cs.duke.edu/courses/fall13/cps296.4/838-CloudPapers/ycsb.pdf (Accessed: 18 March 2018).

Cooper, B. F. (2016) YCSB – Yahoo! Cloud Serving Benchmark, GitHub. Available at: https://github.com/brianfrankcooper/YCSB (Accessed: 21 April 2018).

Datastax (2017) How are Cassandra transactions different from RDBMS transactions? | Apache Cassandraâ„¢ 3.0, Datastax. Available at: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlTransactionsDiffer.html (Accessed: 15 April 2018).

Datastax Academy (2017) A Brief Introduction to Apache Cassandra | DataStax Academy: Free Cassandra Tutorials and Training, Datastax Academy. Available at: https://academy.datastax.com/resources/brief-introduction-apache-cassandra (Accessed: 3 March 2018).

Endpoint (2015) Benchmarking Top NoSQL Databases. Available at: http://www.endpoint.com/ (Accessed: 15 April 2018).

Kumar, R. (2016) Understanding MySQL Architecture ~ Database blog by Rathish kumar, Rathish Kumar. Available at: http://www.rathishkumar.in/2016/04/understanding-mysql-architecture.html (Accessed: 3 March 2018).

Mehra, A. (2016) Introduction to Apache Cassandra’s Architecture – DZone Database, DZone. Available at: https://dzone.com/articles/introduction-apache-cassandras (Accessed: 3 March 2018).

MySql (2018) MySQL :: MySQL 5.7 Reference Manual :: 15.11 Overview of MySQL Storage Engine Architecture, MySql. Available at: https://dev.mysql.com/doc/refman/5.7/en/pluggable-storage-overview.html (Accessed: 3 March 2018).

Tinoco, B. R. (2017) Set up a basic Apache Cassandra architecture, IBM developer works. Available at: https://www.ibm.com/developerworks/library/ba-set-up-apache-cassandra-architecture/index.html (Accessed: 3 March 2018).

Vadim Tkachenko, Peter Zaitsev, B. S. (2012) ‘1. MySQL Architecture and History – High Performance MySQL, 3rd Edition [Book]’, in High Performance MySQL. Available at: https://www.safaribooksonline.com/library/view/high-performance-mysql/9781449332471/ch01.html (Accessed: 3 March 2018).

Zaforas, M. (2017) Cassandra, the lady of NoSQL databases – Paradigma Digital, Paradigma. Available at: https://en.paradigmadigital.com/dev/cassandra-lady-nosql-databases/ (Accessed: 3 March 2018).

Tagged Data Engineering, Data Warehousing

A Comparison between Cassandra and MySQL