Category Archives: Big Data

Architecture Overview to handle high volume data use cases

The architecture takes into account a real time IoT scenario.

Events are raised by Industrial Assets/Sensors on a periodic basis. These events will be intercepted using any standard MQTT broker [RabbitMQ, Apache Kafka, etc ] The Message Broker through its subscribing mechanism will ingest the streaming data in a Cassandra Database using Apache SPARK Stream. The Apache SPARK Streaming will ingest the data on a periodic basis the data coming from the assets/sensors. Here as a data store – Hbase can also be utilized instead of Cassandra.

Based on the frequency and the amount of data that these sensors emit, the volume of data ingested can be significantly high. To handle such huge volume of data, an archival system will be developed. Map Reduce Batch Jobs along with custom Java application will push to Hadoop’s File System (HDFS) from the Cassandra. These Map Reduce jobs will be scheduled on a periodic basis and various scheduling tools like Apache Oozie can be utilized. Since we are planning to develop in Cloudera environment, Cloudera Navigator can also be utilized in this regard.

A reporting system will be there to view the health and another Predictive information of the data coming from these industrial assets. The reporting system can query either the Cassandra data store or the HDFS to show the necessary information to the user. The reports queried from the Cassandra system will be of low latency ones while those queried from the HDFS will be of high latency ones.  Hive (HQL) / Pig / Map Reduce Jobs will be used to display the reports to the end users. For real time reporting Cassandra Query Language (CQL) can be utilized.

The entire workflow’s Security, Meta-Data Management will be handled by a Data Governance system. Tools like Apache ATLAS/ FALCON will be utilized to create this data governance system.

The cloud environment used here will be AWS.

The architecture diagram below provides more insights of the entire system:


Architecture Overview to handle low to medium volume data use cases

Data can be raised from various sources like an IoT Device, IP/Non IP enabled industrial Asset, Streaming sources (stock exchange/Social Media) or from any COTS [ ]  source. These streaming source of data can be intercepted by standard Stream Brokers like KAFKA, RabbitMQ or by Apache Flume. The Stream Broker systems will pass the control to a SPARK System for realtime  data processing. The Stream Broker System at the same time will help to store the structured/unstructured data in a Hadoop system. Different database systems like Apache Parquet, Apache Avro, S3 or a Blob store can be used for different use cases. The SPARK system can be utilized to run any complex analytics and store the rolled up/ aggregated data in the Hadoop system for analytical processing. The processed data can be saved in a data reservoir for faster data access.  

***** Apache TEZ can be used instead of the Apache SPARK system


The architecture diagram below provides more insights of the entire system: