Open to Senior / Staff SRE & AI Infrastructure · Bay Area, CA

Somasundar
Venkatesh

Senior SRE  ·  AI-Augmented Operations  ·  Agentic Systems

soma@sre ~ %
$ ./deploy
MTTD reduced by 75% · 12M+ customers protected · 4 AI agents shipped
17+ Years Experience
75% MTTD Reduction
20M+ Enterprise Customers
4 AI Agents Shipped
Previously at 🍎 Apple  ·  9 years ☁️ Netskope  ·  4 years
Featured Work

AI Agents & Platform Projects

Production systems I designed and shipped — not side projects.

🔍
13 Microservices

Platform Troubleshooting System

Agentic AI system querying live signals across 13 microservices simultaneously for real-time health analysis and risk identification. Reduced MTTD by 75% on P1/IMF incidents for infrastructure and network-layer issues.

LangChainRAGVictoriaMetricsPrometheusPython
🤖
45+ Alerts/Month

On-Call Triage Bot

AI agent processing 45+ Opsgenie alerts/month — sorting by severity and blast radius, cross-referencing Confluence runbooks, and proposing remediation steps with a self-improving feedback loop.

Opsgenie APIConfluenceClaudePythonFeedback Loop
📊
3 Teams Onboarded

Monitoring Gap Analysis Tool

AI-powered tool ingesting service configs and existing alert definitions, identifying observability blind spots, and auto-generating dashboard scaffolding and alert recommendations across the Governance & Compliance stack.

PrometheusGrafanaOTelPythonSLO Design
🛠️
Toil → Zero

Claude Code SRE Skills Suite

Suite of agentic Claude Code skills automating high-toil SRE tasks: capacity analysis, security vulnerability scanning, ticket validation, automated Jira updates, and Grafana dashboard generation.

Claude CodePythonJira APIGrafanaAWS
📈
C-Suite Dashboards

Business Metrics Pipeline

End-to-end metrics pipeline sourcing from the company Datalake via AWS Athena into QuickSight dashboards — connecting infrastructure telemetry to product and revenue outcomes used by leadership.

AWS AthenaQuickSightDatalakeSQLPython
Technical Stack

Skills & Tools

🤖

AI & Automation

  • Agentic AI Systems
  • LangChain / LangGraph
  • Claude Code & MCP
  • GenAI Workflow Design
  • LLM-powered Tooling
  • Prompt Engineering
📡

SRE & Observability

  • Grafana / Prometheus / PromQL
  • VictoriaMetrics / OTel / Telegraf
  • PagerDuty / Opsgenie / Alertmanager
  • Splunk / Sumo Logic
  • SLO/SLI & Error Budget Design
  • Incident Management & RCA
☁️

Infrastructure

  • Kubernetes & Docker
  • AWS (Athena, QuickSight)
  • HAProxy / Nginx
  • Ansible / Jenkins / Spinnaker
  • Linux / Bash / CI/CD
  • Production Readiness
💾

Data & Languages

  • Python / Bash / Java / SQL
  • Kafka / Cassandra / MongoDB
  • MariaDB / CouchBase / Redis
  • ClickHouse
  • AWS Athena / QuickSight
  • Git / Scrum / CI/CD
Career

Experience

Feb 2022 – Present Santa Clara, CA
Netskope

Senior Site Reliability Engineer

Owned end-to-end reliability across 3 teams on a cloud-native SASE platform serving 12M+ enterprise customers. Pioneered GenAI in SRE — shipping 4 production AI agents for autonomous RCA, triage, and troubleshooting. Managed SLOs, error budgets, full observability stack (OTel, VictoriaMetrics, Prometheus, Grafana), multi-team on-call, and Kubernetes capacity planning.

Agentic AISLO/SLIKubernetesObservabilityAWS
Nov 2015 – Feb 2022 Sunnyvale, CA
Apple

Site Reliability Engineer / Technology Lead

24x7x365 stability of enterprise security infrastructure for Apple's global product ecosystem including the JMET platform. Mentored 9 engineers, led cross-timezone reliability initiatives, built zero-downtime CI/CD pipelines with Jenkins, Spinnaker, and Ansible, and defined SLIs/SLAs across multiple product lines.

JMET PlatformCI/CDSLI/SLAMentorshipIAM
Dec 2012 – Oct 2015 Mangalore, India
Apple

Technology Analyst

Contributed to Apple's Identity Management & Provisioning System. Reduced critical defect backlog by 40% and led end-to-end onboarding of a new enterprise system.

Identity ManagementIAMEnterprise Systems
2009 – 2012 India
Accenture & Turning Point

Software Engineer

Built file processing pipelines for Bank of America at Accenture, and developed inventory reconciliation modules for a Telecom Lifecycle Management System at Turning Point Global.

JavaFile ProcessingTelecom
Social Proof

Testimonials

Recommendations from colleagues and managers — also visible on LinkedIn ↗

"

As Director of the Platform Engineering SRE team, I frequently collaborated with Soma in his role within the Engineering SRE team. I always found Soma to be strong technically, knowledgeable in his domain, diligent in his work, pleasant to work with, and an excellent communicator. Soma is an asset to his organization and I would be thrilled to hire a candidate of his caliber for an open role on my team.

👤
Gerald Keller
Director of the Platform Engineering SRE team· Netskope
Download

Full Resume

17+ years of SRE experience, agentic AI systems, and platform reliability — one document.

Get in Touch

Open to Opportunities

Exploring Senior / Staff SRE and AI Infrastructure roles in the Bay Area.
Remote-friendly considered.