Big Data at Flipkart

Flipkart Data Platform is a service-oriented architecture that is capable of computing batch data as well as streaming data. This platform comprises of various micro-services that promote user experience through efficient product listings, optimization of prices, maintaining various types of data domains – Redis, HBase, SQL, etc. This FDP is capable of storing 35 PetaBytes of data and is capable of managing 800+ Hadoop nodes on the server. This is just a brief of how Big Data is helping Flipkart. Below I am sharing a detailed explanation of Flipkart data platform architecture that will help you to understand the process better.

The Architecture of Flipkart Data Platform

To know how Flipkart is using Big Data, you need to understand the flow of data or Flipkart’s data platform architecture which is explained through the below flow chart-



How Big Data is helping Flipkart?

Let’s take a tour to the complete process of how Flipkart works on Big Data. Starting with the FDP ingestion system –

1. FPD Ingestion System

A Big Data Ingestion System is the first place where all the variables start their journey into the data system. It is a process that involves the import and storage of data in a database. This data can either be taken in the form of batches or real-time streams. Simply speaking, batch consists of a collection of data points that are grouped in a specific time interval. On the contrary, streaming data has to deal with a continuous flow of data. Batch Data has greater latency than streaming data which is less than sub-seconds. There are three ways in which ingestion can be performed –

  • Specter – This is a Java library that is used for sending the draft to Kafka.
  • Dart Service – This is a REST service which allows the payload to be sent over HTTP.
  • File Ingestor – With this, we can make use of the CLI tool to dump data into the HDFS.

Then, the user creates a schema for which the corresponding Kafka topic is created. Using Specter, data is then ingested into the FDP. The payload in the HDFS file is stored in the form of HIVE tables.

2. Batch Compute

This part of the big data ecosystem is used for computing and processing data that is present in batches. Batch Compute is an efficient method for processing large scale data that is present in the form of transactions that are collected over a period of time. These batches can be computed at the end of the day when the data is collected in large volumes, only to be processed once. This is the time you need to explore Big Data as much as possible. Here is the free Big Data tutorials series which will help you to master the technology.

3. Streaming Platform

The streaming platforms process the data that is generated in sub-seconds. Apache Flink is one of the most popular real-time streaming platforms that are used to produce fast-paced analytical results. It provides a distributed, fault-tolerant and scalable data streaming capabilities that can be used by the industries to process a million transactions at one time without any latency.

4. Messaging Queue

Messaging Queue acts like a buffer or a temporary storage system for messages when the destination is busy or not connected. The message can be in the form of a plain message, a byte array consisting of headers or a prompt that commands the messaging queue to process a task. There are two components in the Messaging Queue Architecture – Producer and Consumer. A Producer generates the messages and delivers them to the messaging queue. A Consumer is the end destination of the message where the message is processed.