Title: Lead Architect
Area(s) of responsibility
7A lead architect (only hands-
on) in practice for SRE (Observability and DevOps). Below are the key skills.
- AWS architecture: VPC, Subnets, Routing, NAT, Security Groups, NACLs, Transit Gateway
- Compute & container orchestration: EC2, ECS, EKS (Kubernetes), Fargate, Lambda
- Storage & data: S3, EBS/EFS, RDS/Aurora, DynamoDB, ElastiCache
- Networking & edge: ALB/NLB, API Gateway, Route 53, CloudFront, Global Accelerator
- Identity & access: IAM policies/roles, STS, Organizations, Control Tower, SCPs
- Reliability patterns: multi-AZ/region HA, DR (Pilot Light/Warm Standby/Active-Active), backup/restore automation
- AWS stack: CloudWatch (metrics/logs/alarms), CloudTrail (audit), X-Ray (tracing), Config (drift/compliance)
- Metrics & tracing: Prometheus, Grafana, Jaeger, OpenTelemetry (OTLP, SDKs, collectors)
- Log aggregation & search: ELK/Elastic Stack (Elasticsearch, Logstash, Kibana), Fluentd/Fluent Bit, Splunk
- APM tools: Datadog, New Relic, AppDynamics, Dynatrace (bonus)
- SLO/SLI/SLA design, error budgets, golden signals, alert hygiene & runbook quality
- Pipeline design: GitHub Actions, GitLab CI, Jenkins, AWS CodePipeline/CodeBuild/CodeDeploy
- Deployment strategies: Blue/Green, Canary, Rolling, Feature Flags; automated rollbacks
- Artifact & dependency management: Docker registries (ECR), SBOM, supply chain security
- Release governance: trunk-based development, GitOps (Argo CD/Flux), approvals & gates
- Terraform (modules, workspaces, remote state, data sources),
- AWS CloudFormation/CDK (TypeScript/Python), nested stacks, custom resources
- Ansible (playbooks, roles, vault), Packer (AMI pipelines), Helm charts for Kubernete
- Cluster lifecycle: node groups, CNI (Amazon VPC CNI/Calico), storage classes, ingress controllers
- Service mesh: Istio/Linkerd (optional), mTLS, traffic policies, sidecars
- Workload ops: HPA/VPA, pod disruption budgets, resource quotas/requests/limits
- Observability for K8s: kube-state-metrics, Prometheus Operator, Grafana dashboards
- Multi-tenancy, namespaces, RBAC, network policies; admission controllers & policy-as-code
- Incident management: on-call practices, escalation, blameless postmortems, RCA depth
- Chaos engineering: fault injection (chaos-mesh/litmus), game days, resilience scoring
- Capacity planning & performance tuning: autoscaling, throughput/latency profiling, caching strategies
- Availability engineering: circuit breakers, retries/backoff, bulkheads, graceful degradation
- Cloud security: IAM least privilege, Secrets Manager/Parameter Store, KMS, VPC endpoints
- Container security: image scanning (Trivy/Grype), runtime policies (Falco), admission controls
- Policy-as-code: AWS Config rules, GuardDuty, Security Hub
- Compliance: audit trails, encryption in transit/at rest, CIS/NIST/ISO mappings
- Cost governance: tagging standards, cost allocation, savings plans/reserved instances, rightsizing
- Strong programming for tooling/automation: Python/Go (preferred), Bash
- Event-driven ops: Lambda/Step Functions for remediation; webhooks & bots (ChatOps)
- API-first mindset: AWS SDK/CLI, tool integrations, custom exporters/collectors