Platform Engineering Playbook: A Kafka Journey on Automation, Efficiency, and Developer Experience

October 3, 2024

undefined
undefined
Platform engineering provides self-service infrastructure automation tools that optimize the developer experience and accelerate the delivery of business value for security and efficiency. The importance of platform engineering will only increase as we scale artificial intelligence (AI) for business as platform engineering provides the automation, tooling, and scalability that MLOps requires to efficiently deploy and iterate for the reliability needed in business operations. 
 
At Target, we’ve embraced platform thinking and established ourselves as strong practitioners in this space. By adopting and enhancing open-source technologies with centralized management, we reduce complexity and streamline the developer experience. Operating at such a massive scale as a general retailer, we understand the importance of efficient systems. In a couple years of investment in platform engineering, we drastically improved platform reliability. An impressive 90% of the team's deployment support hours were reduced. In this post, we’ll share our successful transformative journey in managing Kafka as a platform service in a hybrid cloud environment to provide a practical playbook to illustrate platform engineering in action.  
 
Kafka Platform in Target  
 
Kafka provides a high throughput, low latency stream processing, and a messaging platform. It is one of the most widely adopted open-source technologies. Our Kafka platform team manages many Kafka and mirroring clusters in our data center, in GCP, for different business criticality, security, and compliance needs with a 24 by 7 uptime guarantee. They can handle the scale we need during retail peak season, which usually elevates to many times of regular traffic. In addition to adopting the open-source technology, our platform team also built additional capabilities, such as a management portal and control-plane APIs for developers to self-service their infrastructure needs. We also designed and implemented a platform control plane to centralize the automation and infrastructure management. 
 
Our team encountered many challenges:  
  • Kafka's rapid adoption means we manage the infrastructure with manual scripts, which are labor-intensive and unreliable.  
  • When we started the automation, we realized that the open-source distribution binary lacked the detailed observability for automation.  
  • The large user base will need a more effective way to scale best practices.  
  • We also need to provide resource utilization feedback for the developers to manage the economic infrastructure for the business needs.  
 
We decided to focus on the following strategic initiatives: Improve observability and adopt test-driven development in platform automation, have a strong opinion on best practices and create tools for easy adoption, and make resource utilization visible to equip our developers to make the right choices.   
  
Methodology – Deep Dive 
 
Improve observability, adopt test-driven development, and implement comprehensive exception handling 
 
First, we transform manually executed batch scripts into Java or Kotlin code for testability. Next, we enhanced observability to ensure we get all platform health metrics to enhance self-healing capability. Afterward, we initiated our platform automation control plane by reorganizing the well-tested maintenance scripts into an API for modularity. It is easier to extend and maintain in the long term.  
 
To further enhance the release quality, we developed comprehensive, end-to-end, functional tests. All new functionalities must go through the rigorous automated testing suite before we promote the updated image to production to ensure no breaking change in provisioning and upgrade operations.
A diagram showing functional testing steps
To ensure the platform is available during change, we check critical health metrics during and after deployment and implement automated retry and error handling.  
 
Zookeeper is a key dependency because it stores Kafka cluster metadata, including topic offsets. However, we discovered a critical flaw in the open-source Zookeeper quorum management. To remediate the problem, we redesigned the ensemble discovery process and persisted the quorum configuration source of truth in the database, which eliminated the Zookeeper data corruption and significantly improved system reliability. 
A diagram showing a new process to eliminate zookeeper data corruption
Have a strong opinion on best practices and create tools for easy adoption 
 
In platform engineering, it is important to take an opinionated approach, prioritizing consistency and standardization for development speed. We created Kafka metrics library by adopting enterprise metrics standards. Developers can get client performance metrics and default dashboards without instrumentation with the code themselves. We also published recommended Kafka topic and client library configurations for higher reliability and data durability. Instead of asking developers to manually adopt them by reading checklists, we created client libraries to automate the configuration validation and rolled out topic configuration standardization on the brokers.  
 
We use mTLS and certificates to secure Kafka topics. Even though we provide a self-service tool on the custom-built portal, getting the ACL appropriately configured to have the exact match of the certificate subject name introduced abundant operational load and troubleshooting effort for developers. To solve this problem, we introduced application identity and created a certificate per application cluster at birth with a deterministic certificate subject name. The developer uses the application identity in the portal. The platform then transforms the application identity into a deterministic certificate subject name behind the scenes to configure ACL for the application to gain access to the topics. We greatly simplify developers' configuration efforts by abstracting away the complexity of technology.  
 
Furthermore, we also enhanced the Kafka Portal to encapsulate technical details with behavior descriptions. For example, if the user marks their topic "lossy," the platform will set the replication factor as 1 to maximize speed.   
 
Make resource utilization transparent to promote efficiency 
 
Resource management has been a pain point for Self-Service Platforms. Developers usually over-allocate or abandon resources unintentionally. Providing showback data is a critical capability for Target infrastructure platforms. Kafka chose to bill the most cost-sensitive resource, disk usage. Guarded by the bill, developers can make conscious decisions on data retention policy and replication factors or change data ingestion frequency to balance meeting business needs and costs. In addition to providing a leader-oriented view, we are creating an app-centric view in the Target Cloud Platform console so that resource efficiency is at the top of application teams' minds. 
 
Outcome  
 
Our results are substantial: Reliability and observability improved significantly, operational tasks became more streamlined and an impressive 90% of the team’s deployment support hours were reduced. Additionally, redeployment time saw a 50% decrease, and overall production issues were cut in half. Over 500 teams adopted our platform libraries, resulting in time savings for developers.  The showback data gives the platform team new insights to prioritize resource allocation conversations. More importantly, our platform team’s morale has risen significantly. Platform engineers feel more innovative because they spend more time designing and writing code to improve the platform than manually executing scripts and watching for success or failure.   
 
In addition to product health and adoption, there are substantial benefits in talent and culture: We upskilled our engineers from sysadmins to infra software engineers. Platform engineers are more creative than before and have more bandwidth to solve developers’ problems. Our team members are more motivated because they are no longer bogged down by repetitive work or burned out by incidents caused by platform reliability challenges.   
 
Learning and Takeaways 
 
In addition to our three strategic approaches, we have another two takeaways to share:   
 
The focus on engineering maturity is fundamental.  
 
We took additional time to train and coach our platform team on the importance of engineering fundamentals:  
  • Automated unit and functional testing 
  • Prioritized observability 
  • Being thorough on exception handling.  
When we rolled out, the team ensured we addressed the observability pain points as the first step, prioritized creating the functional suite, and introduced additional exception handling and testing scenarios when an edge case is discovered by deploying to a lower environment. Solid engineering practice ensured the smooth production rollout, so automation did not introduce more risks.  
 
Community collaboration is the key to success.  
 
We established strong partnerships with our user community. We also established several community advocates to vet our opinionated approach. This way, we maintained a balanced view of the platform and user perspective. We also established a group of strong allies to help promote.   
 
Our journey on platform engineering, using managing Kafka as a use case, can provide a real-world playbook for other practitioners to start or refine their platform engineering approach. 

RELATED POSTS

Firmware Updates: What They Are and Why They Matter

By undefined, November 14, 2023
Learn how Target Infrastructure engineers built a fully automated firmware update system.

Target’s Cloud Journey

By undefined, October 21, 2022
Target began its cloud journey nearly a decade ago. Since then, Target tech teams have expanded our technology footprint to a hybrid-multi-cloud architecture.