How Apache Kafka takes streaming data mainstream

Source: How Apache Kafka takes streaming data mainstream

data.jpg

Image: iStockphoto/Pinkypills

One of the most exciting open source projects to emerge from the big data movement is Apache Kafka. Originally hatched at LinkedIn, Kafka is now an increasingly mainstream part of a broad open source development community. In fact, Kafka has reached a pivotal moment as it’s being used as a central platform for managing streaming data in organizations, including: IoT operations, fraud and security in the financial services industry, and tracking store inventory in the retail industry, among others.

Kafka is one example of how LinkedIn is a poster child for shepherding internal code into vibrant open source communities.

Neha Narkhede, co-founder and CTO of Confluent, and former lead of streams infrastructure at LinkedIn, spoke with TechRepublic about enterprise adoption of Kafka and optimal ways to manage streaming data.

TechRepublic: How has Apache Kafka been mainstreaming in enterprises?

Narkhede: According to a recent Kafka community survey, 68% of Kafka users plan to incorporate more stream processing over the next six to 12 months, and 65% of responding organizations plan to hire employees with Kafka skills in the next 12 months as the number of applications using Kafka continues to grow.

SEE Apache Kafka is booming, but should you use it? (TechRepublic)

At the recent Kafka Summit, we heard from companies like Uber, Netflix, Dropbox, HomeAway, Goldman Sachs, and more who are all using Kafka to make business decisions in real time.

For example, Uber evolved its stream processing system to handle a number of use cases in Uber Marketplace, and Kafka played an important role in building a robust and efficient data pipeline. One of the most commonly known examples is surge pricing. Imagine getting all the data to make this happen in real time: From user demand to the number of cars on the road, and making decisions on what the price should be minute by minute.

This is a great example of a real time data pipeline in action.

HomeAway is another great example. As the leader for vacation rentals, they have over one million listings (and growing). With Kafka, HomeAway connects disparate data sources, enabling a variety of use cases, including SLA monitoring, A/B testing, visitor segmentation, fraud detection, real-time ETL and more.

Confluent, the company I left LinkedIn to co-found, is focused on extending Apache Kafka with Confluent Platform to meet the needs of enterprises that need to manage data at scale and speed. This includes tools like Kafka Streams, Kafka Connect, and a Control Center for a new level of visibility and operational strength to your Kafka cluster at scale.

TechRepublic: What’s the situation where Kafka is absolutely the best fit as a framework? Which use case?

Narkhede: The most common Kafka use cases are for real-time data transport, integration, and real-time stream processing.

For data transport and integration, users apply Kafka Connect to connect data to applications so all systems have access to the most up-to-date data. This includes things like log data, database changes, sensors and device data, monitoring streams, call data records, and stock ticker data.

SEE Could Concord topple Apache Spark from its big data throne? (TechRepublic)

For real-time stream processing, Kafka Streams is an extension of the Kafka core that allows an application developer to write continuous queries, transformations, event-triggered alerts, and similar functions without requiring a dedicated stream processing framework. These functions are often used in security monitoring, real-time operations (like Uber), and asynchronous applications such as inventory checks for a retailer.

TechRepublic: How important is data locality when you’re running these types of real-time data pipelines? I’ve heard something to the effect that by running on DC/OS, Kafka can read data locally between Kafka-Cassandra. How would you describe the opportunity to run complementary frameworks on the same cluster, in terms of a reason why the industry is moving towards abstractions made possible by Mesosphere DC/OS?

Narkhede: Managing services at data center scale presents a lot of optimization opportunities that are much more difficult to access when you’re managing each service individually. Though the ability to co-locate related services is an obvious benefit, there are cases where that does not make sense and, instead, what you need is the ability to allocate dedicated resources to stateful applications for isolation. The latter is what you need while deploying stateful applications, like Kafka and Cassandra, and Mesos has added support for expressing such advanced deployment needs that are needed for managing stateful applications at scale.

We made sure the Mesos deployment of Confluent Platform maintained data locality where required (by the brokers themselves). Our components, Kafka REST Proxy and Schema Registry, are effectively stateless and can be run in those types of frameworks. While stateful services, like Kafka brokers, can be managed differently. Both classes of service are required for the whole Confluent Platform. By supporting the complete platform, it enables more flexibility for customers.

TechRepublic: What’s the importance of Mesosphere DC/OS two level scheduler? Why is it in a different position to attract partner/ecosystem support (like from the Confluent/DataStax crowd)?

Narkhede: Different services have different requirements regarding cluster resources and deployment. The two-level scheduler meets the deployment needs of stateful applications like Kafka where there is a need to optimize for data locality, whenever possible, to save network and I/O bandwidth. This offers a better operational experience to customers without sacrificing the performance that Kafka offers.

Also see