AI Assisted Software Development#
Software engineers spend their time across a wide range of tasks — from firefighting incidents at 3am to carefully designing complex features over weeks. AI tooling is increasingly present across all of these, but its usefulness varies greatly depending on the nature of the task. This page is a structured analysis of where AI assistance fits into the day-to-day reality of software engineering, what the current friction looks like, and where the real opportunities lie.
TODO#
- Outline of different kinds of tasks a software engineer does
- Expand section 1.1: 24x7 Incident Response
- Expand section 1.2: Support Issues
- Expand section 1.3: Bug Fixes
- Expand section 1.4: Load Tests and Other Systemic Testing
- Expand section 2.1: Complex Features (> 2 weeks)
- Expand section 2.2: Smaller Features (< 2 weeks)
- Jot down some high level industry practices which are adopted
Outline of Different Kinds of Software Engineering Tasks#
1. Operational Excellence#
1.1 24x7 Incident Response#
| Name | Description | Current Reality and Impediments | Opportunity Gain | Solution Complexity | Existing Solutions in Zalando | Existing Industrial Solutions |
|---|---|---|---|---|---|---|
| Paging or alerting | A system degradation triggers an automated alert or page, attempting to reach the on-call incident responder. | Not a lot. | LOW | LOW | AWS CloudWatch Alarms + SNS, Amazon DevOps Guru, PagerDuty | |
| Incident responder analyzes the issue | The responder, often woken up, investigates relevant dashboards and monitoring systems to identify the likely root cause of the degradation. | Requires deep familiarity with the systems. The responder must manually correlate data across Scalyr logs, monitoring graphs, and traces, while also cross-referencing known playbooks to form a judgement. | HIGH | HIGH | Amazon DevOps Guru (ML-based anomaly detection and root cause analysis), Amazon Q Developer, Datadog AI, Dynatrace Davis AI | |
| Fix for the issue applied | The responder consults relevant playbooks to apply a known fix, or escalates to the appropriate team if the issue falls outside their scope. | Fixes often involve running manual scripts, monitoring the system post-fix to confirm recovery, and keeping stakeholders informed throughout. | MEDIUM | MEDIUM | AWS Systems Manager Automation (runbooks as code), AWS Incident Manager, PagerDuty Automation Actions | |
| Collecting data for future analysis | The responder communicates the incident to the broader team, providing a summary and explanation of the issue the following day. | Drafting clear communication takes effort, as does collating the relevant data. For significant incidents, this also involves writing a post-mortem, which is time-consuming but critical. | HIGH | LOW | AWS Incident Manager (incident timelines + post-incident reports), Amazon Q (draft summarisation), FireHydrant, Blameless |
1.2 Support Issues#
| Name | Description | Current Reality and Impediments | Opportunity Gain | Solution Complexity | Existing Solutions in Zalando | Existing Industrial Solutions |
|---|---|---|---|---|---|---|
| Issue intake and triage | A support request arrives via Google Chat or a ticketing system and needs to be understood, prioritised, and routed to the support engineer of the week. | Requests are often vague and lack necessary context. Triaging across fragmented channels (Google Chat, GHE Issues, Jira) is time-consuming and easy to drop. | HIGH | LOW | GHE Issues, Jira (may be replaced by Linear) | Amazon Q (support triage), Linear Asks (AI-assisted triage), Jira AI (auto-classification and routing) |
| Reproducing the issue | The engineer attempts to reproduce the reported problem in a lower environment or by tracing the exact user journey. | Reproduction often requires specific data states, environment configurations, or customer-specific context that is difficult to replicate reliably. | MEDIUM | HIGH | AWS CodeBuild (environment replication), Datadog Session Replay, LaunchDarkly | |
| Root cause investigation | The engineer digs into logs, traces, and code history to identify the underlying cause of the reported issue. | Requires broad knowledge across systems. Unlike incidents, support issues may lack urgency cues, making it easy to underinvest in deep investigation. | HIGH | HIGH | Amazon DevOps Guru, Amazon Q Developer, Datadog AI, Elastic Observability | |
| Resolution and communication | The engineer applies a fix or workaround and communicates the outcome back to the requester, then closes the ticket. | Writing clear, user-friendly responses is effortful. Documentation of the resolution for future reference is frequently skipped under time pressure. | HIGH | LOW | GHE Issues, Jira (may be replaced by Linear) | Amazon Q (response drafting), GitHub Copilot (code fix suggestions), Linear (linked issues to track resolution) |
| Aggregating support data for planning | Periodically analysing the volume and patterns of support issues to identify recurring problem areas and prioritise automation or product investment. | Data is often spread across multiple tools and rarely aggregated systematically. This step is frequently skipped, meaning systemic issues go unaddressed. | HIGH | MEDIUM | GHE Issues, Jira (may be replaced by Linear) | Linear (analytics and issue tagging for trend analysis), AWS QuickSight (dashboarding support data), Metabase |
1.3 Bug Fixes#
| Name | Description | Current Reality and Impediments | Opportunity Gain | Solution Complexity | Existing Solutions in Zalando | Existing Industrial Solutions |
|---|---|---|---|---|---|---|
1.4 Load Tests and Other Systemic Testing#
| Name | Description | Current Reality and Impediments | Opportunity Gain | Solution Complexity | Existing Solutions in Zalando | Existing Industrial Solutions |
|---|---|---|---|---|---|---|
2. Software Development#
2.1 Complex Features (> 2 weeks)#
| Name | Description | Current Reality and Impediments | Opportunity Gain | Solution Complexity | Existing Solutions in Zalando | Existing Industrial Solutions |
|---|---|---|---|---|---|---|
2.2 Smaller Features (< 2 weeks)#
| Name | Description | Current Reality and Impediments | Opportunity Gain | Solution Complexity | Existing Solutions in Zalando | Existing Industrial Solutions |
|---|---|---|---|---|---|---|