Scale Readiness
Building an eCommerce platform at scale is a deep endeavour and requires deeper engineering efforts across the spectrum of Performance Engineering, Site Reliability & Incident Management. This section describes our approach to building and operating large scale retail platforms on JC
Traffic modelling
We start by drawing a simulation models of all possible user journeys, funnels & conversion percentage for an eCommerce Store-front
Scalability Scenarios
The traffic model is then augmented with scenarios which simulate the actual user behavior
Scale Readiness Example
The example shows how these techniques are applied by our SRE Teams. We use the traffic model to simulate synthetic traffic executed against a replica of the production environmentScenario: Checkout flow Server side Key Performance metrics
Scenario: Checkout flow user arrival rate
Scenario: Checkout flow response time
Monitoring and Observability Tech Stack
Our Monitoring tool-chain helps us to watch and understand system’s state using predefined set of metrics
Metrics Pipeline Architecture
Our metrics pipeline is central to our approach and uses open source proven components which record real time metrics at scale. This helps teams stay on top of issues and get pro-active feedback.
Examples of metrics we monitor
- CPU Utilization - Percentage of CPU Utilization
- Database Connections - Number of client network connection to the` instance
- Read + Write IOPS - Average number of disk read+write IOPS/Sec
- Disk Queue Depth - Number of read/write iops waiting for disk access
Alerting Layer
The Alerting tool-chain notifies teams about critical events or exceeding threshold limits
Alert Examples
- Grafana Slack & Pagerduty alerts condition: Threshold breaching/1Min
- Code Error rate > 10 errors/1Min Sentry sends alerts via Slack
- Business KPIs: Order loss/1Min, OMS Processing Lag > 1k/5 Min
Traffic Flow Debugging
We’ve also created a custom blueprint helps in faster analysis and identify the exact issue between multiple points in the Infrastructure. Here’s an example -
Production Readiness
We define Production Readiness as the process that ensures sure each feature push is production ready for live customers. We follow a checklist which audits each platform service and the cloud infrastructure. A high level PR process is mentioned below -
Platform Services
- Monitoring Layer
- Observability Layer
- Alerting Layer
- Pre-warming Data
- Regression Test Report
- Define backup plan
Infrastructure
- Monitoring Layer
- Observability Layer
- Alerting Layer
- Auditing
Peak Event Planning
Make sure systems are scaled enough to handle desired traffic
- SLA Calculation for 5x of expected traffic
- Conducting Performance Engg cycles
- Chaos Engineering
- Production resource upgradation
- Pre-warming data
- War Room setups with 24x7 support team
- Setup sale dashboard for unified view of events