blog by OSO

Kafka for End to End Analytics

Sion Smith 12 July 2023
kafka-for-end-to-end

Kafka and real-time analytics are the perfect match, we want to share our experience of using Apache Kafka to enhance analytics capabilities. We’re going to cover the challenges faced, the solutions implemented, and the lessons learned throughout the process of how we used Apache Kafka to solve our business problems. We are going to learn why we use Kafka for End to End Analytics.

Kafka end to end analytics

Kafka for End to End Analytics: Why did we use Apache Kafka 

Why did we use Apache Kafka for end to end analytics? At OSO we focus on reliability and efficiency are crucial when it comes to data analytics pipelines. In an effort to improve our analytics infrastructure, business processes and the fact we love to build everything on Apache Kafka we explored how we could build a self service platform for everyone in the business.  As Kafka is a distributed streaming platform that allows for the ingestion, storage, and processing of real-time data streams, Kafka presented as the perfect tool to boost analytics capabilities and easily connect all our data sources in one centralised place.

Kafka for End to End Analytics: Challenges of making event data visible 

Taking raw data and being able to make meaningful decisions off that data is a real hard problem to solve, especially at scale. The four key pillars of success which we initially focused on are:

  1. Permissioning and UI

One challenge encountered was permissioning. It was sometimes difficult to determine whether the data was coming from Snowflake or Confluent Cloud. The user interface (UI) also posed difficulties, as it was not always clear which platform was being worked with. This led to challenges in navigation and troubleshooting issues.

  1. Testing

Testing posed another set of challenges. Time constraints prevented the implementation of unit testing and schema registry, which are important for ensuring the quality and consistency of the data pipeline. Given more time, these testing practices would certainly be incorporated to enhance the reliability of the pipeline. Using the Open Source Kafka test containers project helped tremendously here.

  1. Documentation and Version Control

Documentation and version control were essential aspects of the project. Everything was documented in the DBT project and version controlled by connecting to a GitHub repository. This method allowed for change tracking, collaboration with team members, and assurance that everyone was working with the latest version of the code.

  1. Scalability

Scalability was an important consideration for the project. While the pipeline was relatively small, the appropriate Snowflake warehouse size was utilised to handle incoming data. There was also flexibility to dynamically change the warehouse size if needed. As the project grows and more data is added, scalability will be a key factor to consider.

Kafka for End to End Analytics: OSO core architecture principles

As we are small Apache Kafka consultancy, it was important to publish the valuable lessons were gleaned throughout this project:

  1. Version Control Infrastructure: Version controlling infrastructure is essential for maintaining consistency and reproducibility. It enables easy tracking of changes and collaboration with team members.
  2. Testing is Crucial: Implementing thorough testing practices, such as unit testing and schema registry, is important for ensuring the quality and reliability of the data pipeline. It helps catch errors and inconsistencies early on.
  3. Documentation is Key: Documenting the pipeline and its components is crucial for knowledge sharing and maintaining a clear understanding of the infrastructure. It aids team members in staying on the same page and troubleshooting issues more effectively.
  4. Scalability Considerations: As the project grows and more data is added, scalability becomes a critical factor. It’s important to choose appropriate resources and infrastructure to handle the increasing workload.

Kafka for End to End Analytics: Key takeaways from using Apache Kafka in analytics platforms

The experience of using Apache Kafka proves to be a valuable tool for analysts. It provides a better understanding of dependencies and allows for more control over the data pipeline. Analysts can own the business logic and work with developers to ensure the pipeline’s integrity. Additionally, technical tools and knowledge, such as SQL and command line skills, are beneficial for effectively parsing and exploring data.

However, it’s crucial to note that analysts should not be solely responsible for owning the infrastructure. Data engineering work is essential and should be handled by data engineers. Analysts can contribute by owning the business logic and ensuring data contracts are in place.

To overcome blockers and work effectively as a team, communication and collaboration are key. Analysts need the ability to communicate and raise issues, and teams need to work together to make decisions and prevent disruptions to the pipeline.

Overall, the experience with Apache Kafka was valuable and enhanced the analytics capabilities. By leveraging the power of Kafka, reliable and efficient data pipelines that support analytics needs can be built.

Fore more content:

How to take your Kafka projects to the next level with a Confluent preferred partner

Event driven Architecture: A Simple Guide

Watch Our Kafka Summit Talk: Offering Kafka as a Service in Your Organisation

Successfully Reduce AWS Costs: 4 Powerful Ways

Protecting Kafka Cluster

Get started with OSO professional services for Apache Kafka

Have a conversation with a Kafka expert to discover how we help your adopt of Apache Kafka in your business.

Contact Us