Netflix spent $16 billion on content production in 2020. In Jan 2021, the Netflix mobile app (iOS and Android) was downloaded 19 million times and a month later, the company announced that it had hit 203.66 million subscribers worldwide. It’s safe to assume that the scale of data the company collects and processes is massive. The question is –
How does Netflix process billions of data records and events to make critical business decisions?
With an annual content budget worth $16 billion, decision-makers at Netflix aren’t going to make content-related decisions based on intuition. Instead, their content curators use cutting-edge technology to make sense of massive amounts of data on subscriber behavior, user content preferences, content production costs, types of content that work, etc. This list goes on.
Netflix users spend an average of 3.2 hours a day on their platform and are constantly fed with the latest recommendations by Netflix’s proprietary recommendation engine. This ensures that subscriber churn is low and entices new subscribers to sign up. Data-driven content delivery is at the front and center of this.
So, What lies under the hood from a data processing perspective?
In other words, how did Netflix build a technology backbone that enabled data-driven decision-making at such a massive scale? How does one make sense of the user behavior of 203 million subscribers?
Netflix uses what it calls the Keystone Data Pipeline. In 2016, this pipeline was processing 500 billion events per day. These events included error logs, user viewing activities, UI activities, troubleshooting events and many other valuable data sets.
According to Netflix, as published in its tech blog:
The Keystone pipeline is a unified event publishing, collection, and routing infrastructure for both batch and stream processing.
Kafka clusters are a core part of the Keystone Data Pipeline at Netflix. In 2016, the Netflix pipeline used 36 Kafka clusters to process billions of messages per day.
So, what is Apache Kafka? And, why has it become so popular?
Apache Kafka is an open-source streaming platform that enables the development of applications that ingest a high volume of real-time data. It was originally built by the geniuses at LinkedIn and is now used at Netflix, Pinterest and Airbnb to name a few.
Kafka specifically does Four things:
- It enables applications to publish or subscribe to data or event streams
- It stores data records accurately and is highly fault-tolerant
- It is capable of real-time, high-volume data processing.
- It is able to take in and process trillions of data records per day, without any performance issues
Software development teams are able to leverage Kafka’s capabilities with the following APIs:
- Producer API: This API enables a microservice or application to publish a data stream to a particular Kafka Topic. A Kafka topic is a log that stores data and event records in the order in which they occurred.
- Consumer API: This API allows an application to subscribe to data streams from a Kafka topic. Using the consumer API, applications can ingest and process the data stream, which will serve as input to the specified application.
- Streams API: This API is critical for sophisticated data and event streaming applications. Essentially, it consumes data streams from various Kafka topics and is able to process or transform this as needed. Post-processing, this data stream is published to another Kafka topic to be used downstream and/or transform an existing topic.
- Connector API: In modern applications, there is a constant need to reuse producers or consumers and automatically integrate a data source into a Kafka cluster. Kafka Connect makes this unnecessary by is connecting Kafka to external systems.
Key Benefits of Kafka
According to the Kafka website, 80% of all Fortune 100 companies use Kafka. One of the biggest reasons for this is that it fits in well with mission-critical applications.
Major companies are using Kafka for the following reasons:
- It allows the decoupling of data streams and systems with ease
- It is designed to be distributed, resilient and fault-tolerant
- The horizontal scalability of Kafka is one of its biggest advantages. It can scale to 100s of clusters and millions of messages per second
- It enables high-performance real-time data streaming, a critical need in large scale, data-driven applications
Ways Kafka is used to optimise data processing
Kafka is being used across industries for a variety of purposes, including but not limited to the following
- Real-time Data Processing: In addition to its use in technology companies, Kafka is an integral part of real-time data processing in the manufacturing industry, where high-volume data comes from a large number of IoT devices and sensors
- Website Monitoring At Scale: Kafka is used for tracking user behavior and site activity in high-traffic websites. It helps with real-time monitoring, processing, connecting with Hadoop, and offline data warehousing
- Tracking Key Metrics: As Kafka can be used to aggregate data from different applications to a centralized feed, it facilitates the monitoring of high-volume operational data
- Log Aggregation: It allows data from multiple sources to be aggregated into a log to get clarity on distributed consumption
- Messaging system: It automates large-scale message processing applications
- Stream Processing: After Kafka topics are consumed as raw data in processing pipelines at various stages, It is aggregated, enriched, or otherwise transformed into new topics for further consumption or processing
- De-coupling system dependencies
- Integratations with Spark, Flink, Storm, Hadoop, and other Big Data technologies
Companies that use Kafka to process data
As a result of its versatility and functionality,Kafka is used by some of the world’s fastest-growing technology companies for various purposes:
- Uber – Gather a user, taxi, and trip data in real-time to compute and forecast demand and compute surge pricing in real-time
- LinkedIn – Prevents spam and collects user interactions to make better connection recommendations in real-time
- Twitter – Part of its Storm stream processing infrastructure
- Spotify – Part of its log delivery system
- Pinterest – Part of its log collection pipeline
- Airbnb – Event pipeline, exception tracking, etc.
- Cisco – For OpenSOC (Security Operations Center)
Merit Group’s Expertise in Kafka
At Merit Group, we work with some of the world’s leading B2B intelligence companies like Wilmington, Dow Jones, Glenigan, and Haymarket. Our data and engineering teams work closely with our clients to build data products and business intelligence tools. Our work directly impacts business growth by helping our clients to identify high-growth opportunities.
Our specific services include high-volume data collection, data transformation using AI and ML, web watching, and customized application development.
Our team also brings to the table deep expertise in building real-time data streaming and data processing applications. Our expertise in Kafka is especially useful in this context.
Related Case Studies
Automotive Data Aggregation Using Cutting Edge Tech Tools
An award-winning automotive client whose product allows the valuation of vehicles anywhere in the world and tracks millions of price points and specification details across a large range of vehicles.
Sales & Marketing Data Analysis and Build for Increased Market Share
A leading provider of insights, business intelligence, and worldwide B2B events organiser wanted to understand their market share/penetration in the global market for six of their core target industry sectors. This challenge was apparent due to the client not having relevant tech tools or the resources to source and analyse data.