Title: Technical Lead-App Development
Area(s) of responsibility
Reliability Engineer
- Ensure the dependability and scalability of enterprise-scale applications.
- Be a member of DevOps teams in shared responsibility for the reliability of these applications and associated platforms.
- Develop the reliability process to ensure the highest level of systems availability, stability, security and performance, including maintenance and support, root cause analysis, systems validation, performance tuning and capacity management.
- Have dedicated time for creating software that improves the reliability of systems in production, fixing issues, and responding to incidents/on-call events.
- Help in the design and implementation of monitoring, alerting and appropriate metrics to track and report adherence to service SLOs and SLAs, performance and operational efficiency.
- Drive technical innovation and efficiency in application and infrastructure operations via simplification and automation.
- Coordinate between infrastructure, platform and application subject matter experts to promote reliability efforts through communication and best practice sharing.
- Support a root cause analysis program that will lead to reduced downtime, increased resiliency and a culture of continuous improvement.
Technology Skills:
- APM tools knowledge – New Relic – service maps, tracing, creating dashboards, custom events and queries, monitoring – resources, JVM heap / thread pools etc, alerting
- Knowledge of spring boot applications development & monitoring, tuning – JVM params, application params etc.
- Knowledge of Splunk and Splunk queries, creating Splunk dashboards
- DevOps pipelines – bitbucket, cloudbees, Amazon cloud,
- Knowledge of distributed tracing frameworks like Jaeger
Skills with M/O flag are part of Specialization