Title: Manager-Cloud & Infra Engg
Area(s) of responsibility
Job Description:
Mandatory Skills:
- DataDog
- Terraform
- Python / Scripting
- Previous SRE experience
- Lead experience
- Great communicator
Nice to have Skills:
- AppDynamics or Dynatrace
- Chef / Ansible
- Previous DevOps pipeline buildout experience
Roles and Responsibilities:
- Design, build, and maintain highly available and scalable infrastructure and services to support critical applications.
- Monitor and analyze system performance, identifying and addressing bottlenecks and potential issues to ensure optimal performance and uptime.
- Implement and maintain automated monitoring, alerting, and incident response systems to detect and resolve issues promptly.
- Proven experience as a Site Reliability Engineer with a focus on managing complex, distributed systems.
- Strong expertise in cloud platforms like AWS, Azure, and experience with Infrastructure as Code (IaC) tools such as Terraform.
- Proficiency in at least one programming language for automation and tooling (e.g., Python, Go, Ruby, Shell etc.).
- Experience with containerization and container orchestration technologies (e.g., Docker, Kubernetes).
- In-depth knowledge of monitoring and logging tools, such as Prometheus, Grafana, ELK Stack, or similar.
- Solid understanding of networking protocols, security principles, and best practices.
- Strong problem-solving skills and the ability to work well under pressure during incidents.
- Excellent communication and collaboration skills to work effectively with cross-functional teams.