GitHub has recently implemented significant improvements to its push processing system, enhancing both the efficiency and reliability for developers. This update addresses several issues that previously hindered developers' workflow, according to The GitHub Blog.
The Problem
Historically, GitHub's push processing was managed by a single, massive background job known as RepositoryPushJob
. This job encompassed over 60 different logic pieces owned by 20 different services, leading to various problems:
- Complexity: The job's size made it difficult to retry specific tasks, often resulting in the entire process being repeated from the start.
- Retries: Due to its complexity, retries were largely avoided, leading to crucial parts of push processing occasionally being skipped.
- Dependency Issues: The tight coupling of many tasks increased the risk of widespread issues if any single component failed.
- Latency: The sequential nature of tasks led to unnecessary delays, impacting user-facing tasks like pull request synchronization.
New Approach
To address these challenges, GitHub has restructured its push processing system into multiple isolated, parallel processes using Kafka. The new approach involves:
- Publishing an event for each push to a new Kafka topic.
- Grouping tasks by owning service or logical relationships and creating new background jobs with appropriate retry configurations.
- Configuring these jobs to be enqueued in response to Kafka events using an internal system at GitHub.
This new architecture required several investments, such as a reliable Kafka event publisher, a dedicated pool of job workers, improved observability, and a system for consistent feature flagging.
Results
The improvements have yielded several benefits:
- Reduced Blast Radius: Issues with one piece of logic no longer impact the entire process, reducing dependencies and improving system resilience.
- Lower Latency: Parallel processing of jobs has significantly decreased the time required for tasks, particularly for pull request synchronization.
- Improved Observability: Breaking tasks into smaller jobs has enhanced monitoring capabilities, allowing for quicker identification and resolution of issues.
- Increased Reliability: The new system allows for more appropriate retry configurations, ensuring pushes are processed more reliably. The fully processed push rate has improved from 99.897% to 99.999%.
Conclusion
GitHub's enhancements to push processing mark a significant step forward in improving developer interactions with the platform. By decoupling and parallelizing push tasks, GitHub has created a more efficient and reliable system, ensuring that developers' pushes are handled more effectively.
Image source: Shutterstock