Categories
Data Analysis Resources Spark

Solutions to Support Real-Time Data Analytics

Organisations embracing big data use non-traditional strategies and technologies to gather, organize, process and gather insights from large datasets. These solutions do not support real-time analytics. Real-time analytics require technology to handle data that is generated at high velocity and send by the sources simultaneously in small sizes. Data is required to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows. This article focuses on different types of popular storage options and processing frameworks available to process time sensitive analytics.

Introduction

Information derived from real-time analysis gives companies visibility into many aspects of their business and customer activity. It is reported that a 10 percent boost in data storage and management can result in a return of $65 million in revenue for Fortune 1000 companies (Ring, 2016).However, the problem with disk-based data storage devices and NoSQL databases are that they do not handle low latency data access. Hence, unable to offer IOPS (Input/output operations per second) required to deliver data to analytics tools.

Data Storage

The technologies available for data storage available for real-time analytics can be considered from two perspectives: logical (i.e. the model) and physical data storage.

Following as some of the models available for real-time analytics:

Wide Column Stores collect data in records with dynamic columns. The column names and keys are not fixed and are schema free. An example for wide column store which is suitable for real-time analytics is Apache Cassandra. Cassandra is a distributed NoSQL Database Management System(DBMS). It is scalable, constantly available with all types of data arriving from many centres. It supports simple transactions with constant uptime and decentralized deployments in horizontal scale out fashion(Datastax Academy, 2017).  

Relational DBMS support table oriented or relational model. A relation consists of a set of uniform entities. An entity is the row in the table and consists of a value for each feature. The table name with a fixed number of features as columns and fixed data type constitute the schema(DB_engines, 2018). Few examples are Apache Hive, MariaDB and MemSQL(Estarda, 2018). Hive is built on Hadoop, MariaDB is a replication cluster and MemSQL is distributed in-memory database system.

Key-value Stores store pairs of keys and values and retrieves values when the key is known. This simplicity is enough for complex applications such as for high performance in-process databases. An example is Redis(Estarda, 2018), which is a distributed, built-in replication, automatic partitioning and in-memory data structure store(Redis, 2018).

Time Series DBMS is specially optimized for time series as each high-volume transaction is connected to a timestamp. For example, Riak(Estarda, 2018). It is distributed, highly reliable and scalable NoSQL database optimized for the Internet of Things(IoT)

Document Stores are schema-free systems in which the types of values of individual columns can be either be different for each record or can have an array or have a nested structure. It uses internal notations such as JSON (JavaScript Object Notation). Few examples are Couchbase ( multi-model NoSQL database with shared-nothing architecture and optimized for interactive applications(Couchbase, 2018)), Amazon DynamoDB (proprietary, fully managed NoSQL document-oriented database) and MongoDB (cross-platform NoSQL database which uses JSON-like documents as schemas(MongoDB, 2018).

Graph DBMS displays data as graph structures. It consists of edges, nodes and the relationship between nodes. Usually, these do not offer indexes on all nodes. For example, Neo4J is an ACID (Atomicity, Consistency, Isolation, Durability) compliant transactional database (Neo4J, 2018).

Following are the physical storage solutions available for real-time analytics:

Direct-Attached Storage (DAS) are individual disk drives which are external to the server and are directly attached through SCSI (Small Computer System Interfaces), SATA (Serial Advanced Technology Attachment) and SAS (Serial Attached SCSI) interfaces. It can provide high performance because it reads and writes without crossing through the network(DB_engines, 2018).

Externalized HDFS (Hadoop Distributed File System) as an API or protocol helps in processing clusters are made of computer nodes while storage is provided by SAN (Storage Area Network) or NAS (Network attached storage) devices. NAS is a file storage device which provides access to data through local area network through an Ethernet connection. It can provide access to same data to multiple clients. While SAN is high-speed network that interconnects and presents pools of storage to multiple servers. A host can send block-based access request for the storage device. The data access using NFS (Network File System) or HDFS can scale out to meet capacity or increased compute requirement and uses parallel file system that is distributed across many storage nodes. Examples of products are DDN and EMC Isilon(Rubens, 2016).

Software-defined storage (SDS) solutions such as EMC’s ViPR software and DriveScale allows managing data storage resources and functionality using a software without any dependencies on the physical hardware and ability to scale across cluster on a fly. It is also resulting in hyperconverged infrastructure( includes storage as well compute and networking) such as hScaler appliance from DDN(Rubens, 2016).

Virtualization and Containerization

Hyperconverged infrastructure is resulting in virtualized environments by spinning up more computer node or cluster for “each user or each application while keeping nodes running on same physical hardware and implementing QoS (Quality of Service) while sharing data. It helps Hadoop in having zero latency when traffic between nodes is local(DB_engines, 2018).

Containers offer isolated run-time environments. Hence, these are light-weight and use fewer resources and quicker to spin. Unlike virtualization, these are immutable and stateless and easier to create, kill and restart as long as storage is not part of the container such as Flocker. Another option is to move setup and management container to the cloud and keeping storage container on-premise such as Galactic Exchange(Rubens, 2016).

All- flash Array (AFA) or Solid-State Array (SSA) is storage infrastructure that contains flash memory with no moving parts. It is a non-volatile memory and transfers data to and from solid-state drives (SSD) quicker than other drives (Rubens, 2016).

Object Storage is a method of manipulating storage as discrete units. These units or objects are kept in a single repository and not nested. It gives each file a unique identifier and indexing the data and its location (Adshead, 2013).

Different solutions have different capabilities with no out-of-the-box configuration. It is important to be flexible and experiment to find the ideal technology which suits the use-case.

Processing Streaming Data

Processing streaming data requires model or framework that handles each individual item of the “unbounded” (ever-growing, nearly unlimited amount) data as it passes through the system by maintaining minimal state between records. It needs two layers: Storage and Processing(Akidau, 2015). The storage layer maintains consistency and record order to provide inexpensive and fast streams of data. The processing layer takes in the data from the storage layer and processes it till notifying the storage layer to delete data which has been processed already.

Stream Processing Systems

Apache Storm is a distributed real-time computation system which can handle large quantities of data with minimum delay and concentrates on extremely low latency. It has many applications such as real-time analytics, machine learning and continuous computations. It is highly scalable and has been benchmarked at processing over one million tuples per second per node. It provides processing job guarantees by making sure that each message is being processed at least once. It is often used when processing time affects the user’s experience directly(Elingwood, 2016). It runs on top of Hadoop YARN and can be used with Flume to store data on HDFS (Shahrivari and Jalili, 2014). However, it’s process is low throughput, does not ensure that the messages will be processed in order nor that there will be no duplicates in case of some failures(Zapletal Petr, 2016).

Trident is an abstraction built on top of Storm which offers increased latency and state processing. It is a higher level micro-batch system instead of an item-by-item pure streaming with added higher-level operations such as windowing, state management and aggregations not normally found in Storm. It offers the users a choice of manipulating the tool to an intentional use with greater flexibility. It offers exactly once delivery and ordering between batches, but not within as compare to Storm’s at most once guarantees. It is often used when system cannot handle duplicate messages and needs to maintain the state between items(Elingwood, 2016). 

Apache Samza is a distributed stream-processing framework that offers high level abstraction to store state using a fault-tolerant checking point system by implementing a local key-value store. It deals with immutable streams and is partitioned at every level. It is based on Apache Kafka and YARN (Yet Another Resource Negotiator). It uses Kafka’s architecture to ensure ordered, re-played, buffering, state storage and fault tolerance streams. It allows at-least delivery guarantee while eliminating the issue of back-pressure and data loss. The data is held for long periods so that the components can process at their own suitability and can be restarted without penalties. The intermediate results are written to Kafka as well and can be consumed by downstream stages independently. However, it does not offer accurate recovery of aggregated state in case of failure because data is being delivered more than once. It uses YARN for resource negotiation by migrating tasks from one machine to another transparently(Zheng et al., 2015).

Flume is a distributed and reliable system for efficiently collecting, aggregating and moving sizeable amounts of event data such as network traffic data, social-media-generated data, and email messages. It consists of multiple agents running in a separate Java Virtual Machine (JVM). The agent consists of three different parts:

  • Source collects incoming data as events
  • Sink writes events out
  • Channel connect sink and source. This is both file-based and memory-based. And functions as the data buffer in case of failure or shutdown(Elingwood, 2016)

Batch Processing

Batch processing is when a set of data is collected over time and fed into an analytic system model at a scheduled time or on the as-needed basis. A batch job is a program that reads a large file and generates a report. The systems used to process batch jobs include queries over different datasets or when legacy systems are unable to deliver data in streams.  An example is the data generated by mainframes. The dataset is normally “bounded” that is represent a finite collection of data. Batch processing also is useful when there is no need for time-significant analytics and more important to process large volumes of data than fast results.

Apache Hadoop

Apache Hadoop is a batch processing framework that reimplemented the algorithms and component stack to make large-scale batch processing more accessible. It involves reading and writing a task multiple time. It has following components or layers that work together to process data:

HDFS is the distributed filesystem layer is used as the source of data, store intermediate processing results and final calculated results. It coordinates storage and replication across the cluster nodes. It ensures the availability of data despite host failures.

YARN is the cluster coordinating component that is responsible for coordinating and managing the underlying resources and scheduling jobs to be run. It acts as an interface to the cluster resources and likelihood to run diverse workloads.

MapReduce is the native batch processing engine with mapping, reducing and shuffling algorithms using key-value pairs. It involves dividing the dataset into chunks after reading it from HDFS filesystem. The chunks are distributed among the available nodes. The computation is applied to each node and intermediate results are written to HDFS before redistributing the results to group by key. The values of each key are reduced by summarizing and combining the results calculated by individual nodes and the final calculated results are written back to HDFS.

Hadoop is normally used when there is a need for running massive datasets on inexpensive hardware. It is highly scalable as it does not attempt to store everything in memory. It is also integrated with other software to utilize HDFS and YARN manager. However, it can be slow with a steep learning curve which inhibits the idea of implementing the cluster quickly.

Hybrid Processing Systems: Batch and Stream Processors

Some frameworks manage both batch and stream workloads for both structured and unstructured data by allowing the related or similar elements and APIs. The frameworks offer general solutions for processing dataset as compared to the specific use-cases. These have their own integration, libraries and tools for data processing such as machine learning, graph analysis, and interactive querying(Shahrivari and Jalili, 2014).

Apache Spark is a batch processing framework with stream processing capabilities. It focuses on speeding up workloads by offering in-memory computation strategy, advanced DAG (Directed Acyclic Graph) scheduling and processing optimization. It uses a single cluster to handle multiple processing styles. It is multipurpose which can be deployed either as a standalone cluster or integrated with a current Hadoop cluster as an alternative to the MapReduce engine.

However, it can cost more than the disk-based system as it uses RAM for its in-memory design. The scarcity of resources is also there when deployed on shared clusters. It uses more resources as compared to Hadoop and can interfere with each other tasks. Spark Streaming is also not suitable for low latency processing(Elingwood, 2016).

Batch Processing Model

Spark uses storage layer at the start of the process to load the data and to save the result after completion of the process. In between, it handles the data in-memory only using a model named as Resilient Distributed Dataset(RDD). RDD represent immutable structures with collections of data on which operations result in new RDDs. There is a hierarchy in the model which results in each RDD being able to track the data to the storage layer. It is a mechanism for Spark to maintain fault tolerance without the need to write back to disk after each operation. It also achieves optimization through analysing complete sets of tasks using DAG. DAG includes the operations on the datasets along with relationships between them. Adapting the batch processing model for streaming involves buffering data as it enters the system. Buffering facilitates handling high volume of incoming data with both increased throughput and latency.

Stream Processing Model

Spark is designed with batch-oriented workloads. However, stream processing buffers the workload in micro-batches for every sub-second increments. These results in small fixed datasets that can fit the native semantics of batch processing engine.

Impala is real-time analytics system for Hadoop which is a scalable and has interactive ad-hoc query system for data analytics that provides the fastest time-to-insight. It provides the SQL-like engine to execute queries against the data in HDFS and HBase. The daemons run on all nodes and process the data in memory. Thus, it does not offer fault tolerance. It uses meta-store of Hive to save metadata and provides a unified platform for both batch-oriented and real-time queries. It is normally used for executing SQL queries on large datasets for inter-active analytics, unlike Hive and MapReduce.

Apache Beam runs on any execution engine for both batch and streaming data processing by offering choices to swap in and out of the real-time compute engines. It includes abstractions for streams, sources, specifying data pipeline, transformation processes and targets.

Apache Flink is a stream-first approach framework that also handles batch tasks. The stream-first approach is also called Kappa or fast data architecture. It is a simplification of the Lambda architecture with batch processing system removed and data is simply fed through the streaming system.

Lambda Architecture consists of a hybrid model that uses batch layer(MapReduce on Hadoop) for large-scale analytics over historical data, a speed layer(consisting of Storm or a similar stream processing engine) for low-latency processing of newly arrived data (often with approximate results) and a serving layer (Cassandra or similar NoSQL database) to provide a query/view capability that unifies the batch and speed layers. Data is processed by the speed layer and made available for queries by the server layer as soon as data flows into the system. Any errors are later dealt by batch processing. The architecture is suitable for applications with high-throughput, low-latency, fault-tolerance, and data accuracy. ETL is avoided through this architecture. However, the systems are very complex, less responsive and expensive to operate and maintain. Furthermore, maintaining three systems is a huge task for developers and administrators.

Therefore, Lambda architecture use batching as the primary processing method with streams used to offer early but unrefined results. While Kappa architecture handles streams for everything by considering batches to be simple data streams with finite boundaries. Hence, batch processing is considered as a subset of stream processing(Akidau, 2015).

Flink offers high throughput, low latency and real entry-by-entry processing by handling its own memory rather than depending on the native Java garbage collection mechanism for performance reason. It does not need manual adjustment and modification when the features of data changes. Further, it handles caching and data partitioning automatically. It examines its work and optimizes tasks by mapping out the most effective way to fulfil a task.  It also parallelizes stages while bringing data together for blocking stages. It does the computation on the nodes where data is stored. It can be used within Hadoop stack and fits well with HDFS, YARN and Kafka. It maintains stateful applications and time stamping of events, therefore applications can be replayed or rolled back. It uses the identical runtime for both processing models(Davies, 2017).

Batch Processing Model

The model reads an optimized “bounded” dataset from storage layer as a stream. For example, it removes snapshotting from batch loads because data is backed by persistent storage. Hence, the computing completes faster.

Stream Processing Model

The model handles incoming data on an item-by-item basis while taking snapshots at set points to use for recovery in case of failure. Further, it captures an exact time when an event occurred resulting in guarantee ordering and grouping. The basic components are sources or entry points through which streams (immutable, unbounded datasets) enters the system on which operations are done to produce other streams and flows out of the system using sinks (database or connector)(Zapletal Petr, 2016).

Conclusion

There are several choices available to support real-time data analytics within a big data system. Apache Cassandra is one of the most balanced choices to handle real-time data alongside other database solutions. For physical and virtual solutions, DAS, SAN, SDS, virtualization, cloud and containers are some of the available options to choose from.

Storm, Trident, Samza and Flume are good for stream only workloads. Storm delivers duplicates and very low latency but does not guarantee order. Trident offers higher latency than Storm and state processing. Samza offers flexibility, replication, state management and multi-team usage. Whereas, Flume is a reliable distributed system for event management. Spark and Flink provide both batch processing and stream processing. Spark has a wide community support, libraries and tooling. Flink, on the other hand, is heavily optimized for true stream processing with batch processing support.

Therefore, the best option of storing and processing data depends on the organisation’s needs, state of the data to process, time sensitivity and the results required. There are also new innovative, collaborative solutions coming to the market each day.

References

Adshead, A. (2013) Big data storage: Defining big data and the type of storage it needs. Available at: http://www.computerweekly.com/podcast/Big-data-storage-Defining-big-data-and-the-type-of-storage-it-needs (Accessed: 19 March 2018).

Akidau, T. (2015) The world beyond batch: Streaming 101 – O’Reilly Media, Oreilly. Available at: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 (Accessed: 2 March 2018).

Couchbase (2018) NoSQL Engagement Database | Couchbase. Available at: https://www.couchbase.com/ (Accessed: 19 March 2018).

Datastax Academy (2017) A Brief Introduction to Apache Cassandra | DataStax Academy: Free Cassandra Tutorials and Training, Datastax Academy. Available at: https://academy.datastax.com/resources/brief-introduction-apache-cassandra (Accessed: 3 March 2018).

Davies, B. (2017) 5 Best Data Processing Frameworks, Knowledgehut. Available at: https://www.knowledgehut.com/blog/information-technology/5-best-data-processing-frameworks (Accessed: 2 March 2018).

DB_engines (2018) Document Stores – DB-Engines Encyclopedia. Available at: https://db-engines.com/en/article/Document+Stores (Accessed: 19 March 2018).

Elingwood, J. (2016) Hadoop, Storm, Samza, Spark, and Flink: Big Data Frameworks Compared | DigitalOcean, Digital Ocean. Available at: https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared (Accessed: 2 March 2018).

Estarda, R. (2018) From big data to fast data – O’Reilly Media, Oreilly. Available at: https://www.oreilly.com/ideas/from-big-data-to-fast-data (Accessed: 19 March 2018).

MongoDB (2018) What Is MongoDB? | MongoDB. Available at: https://www.mongodb.com/what-is-mongodb (Accessed: 19 March 2018).

Neo4J (2018) The Neo4j Graph Platform – The #1 Platform for Connected Data. Available at: https://neo4j.com/ (Accessed: 19 March 2018).

Redis (2018) Redis, Redis. Available at: https://redis.io/ (Accessed: 19 March 2018).

Ring, J. (2016) How object storage manages big data in the IoT | ITProPortal. Available at: https://www.itproportal.com/2016/08/19/how-object-storage-manages-big-data-in-the-iot/ (Accessed: 19 March 2018).

Rubens, P. (2016) Big Data Storage Solutions: Options Abound, Infostr. Available at: http://www.infostor.com/storage-management/big-data-storage-solutions-options-abound.html (Accessed: 2 March 2018).

Shahrivari, S. and Jalili, S. (2014) ‘Beyond Batch Processing: Towards Real-Time and Streaming Big Data’, Arxiv. Available at: https://arxiv.org/ftp/arxiv/papers/1403/1403.3375.pdf (Accessed: 2 March 2018).

Zapletal Petr (2016) Comparison of Apache Stream Processing Frameworks: Part 1, CakeSolutions. Available at: https://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 (Accessed: 2 March 2018).

Zheng, Z. et al. (2015) ‘Real-Time Big Data Processing Framework: Challenges and Solutions’, Appl. Math. Inf. Sci, 9(6), pp. 3169–3190. doi: 10.12785/amis/090646.

Leave a Reply