Senior SRE · AI-Augmented Operations · Agentic Systems
Production systems I designed and shipped — not side projects.
AI agent that autonomously scans Slack incident channels, queries live Grafana dashboards via MCP, reads incident tickets and produces structured postmortem reports — eliminating hours of manual post-incident work per P1.
Agentic AI system querying live signals across 13 microservices simultaneously for real-time health analysis and risk identification. Reduced MTTD by 75% on P1/IMF incidents for infrastructure and network-layer issues.
AI agent processing 45+ Opsgenie alerts/month — sorting by severity and blast radius, cross-referencing Confluence runbooks, and proposing remediation steps with a self-improving feedback loop.
AI-powered tool ingesting service configs and existing alert definitions, identifying observability blind spots, and auto-generating dashboard scaffolding and alert recommendations across the Governance & Compliance stack.
Suite of agentic Claude Code skills automating high-toil SRE tasks: capacity analysis, security vulnerability scanning, ticket validation, automated Jira updates, and Grafana dashboard generation.
End-to-end metrics pipeline sourcing from the company Datalake via AWS Athena into QuickSight dashboards — connecting infrastructure telemetry to product and revenue outcomes used by leadership.
Owned end-to-end reliability across 3 teams on a cloud-native SASE platform serving 12M+ enterprise customers. Pioneered GenAI in SRE — shipping 4 production AI agents for autonomous RCA, triage, and troubleshooting. Managed SLOs, error budgets, full observability stack (OTel, VictoriaMetrics, Prometheus, Grafana), multi-team on-call, and Kubernetes capacity planning.
24x7x365 stability of enterprise security infrastructure for Apple's global product ecosystem including the JMET platform. Mentored 9 engineers, led cross-timezone reliability initiatives, built zero-downtime CI/CD pipelines with Jenkins, Spinnaker, and Ansible, and defined SLIs/SLAs across multiple product lines.
Contributed to Apple's Identity Management & Provisioning System. Reduced critical defect backlog by 40% and led end-to-end onboarding of a new enterprise system.
Built file processing pipelines for Bank of America at Accenture, and developed inventory reconciliation modules for a Telecom Lifecycle Management System at Turning Point Global.
Recommendations from colleagues and managers — also visible on LinkedIn ↗
As Director of the Platform Engineering SRE team, I frequently collaborated with Soma in his role within the Engineering SRE team. I always found Soma to be strong technically, knowledgeable in his domain, diligent in his work, pleasant to work with, and an excellent communicator. Soma is an asset to his organization and I would be thrilled to hire a candidate of his caliber for an open role on my team.
17+ years of SRE experience, agentic AI systems, and platform reliability — one document.
Exploring Senior / Staff SRE and AI Infrastructure roles in the Bay Area.
Remote-friendly considered.