Title: Consultant Specialist
JD for reliability Engineer – 4C/5A
- Monitoring and Automation: Proactively monitor software systems to prevent incidents and automate routine tasks.
- Effective Monitoring: Build monitoring systems that alert based on symptoms rather than outages.
- Application Performance Monitoring (APM): Implement and utilize APM tools such as New Relic or Dynatrace to monitor application performance, identify bottlenecks, and optimize resource usage.
- Log Analysis with Splunk: Analyze logs using Splunk to troubleshoot issues, detect anomalies, and improve system reliability.
- Dashboards Preparation: Create informative dashboards to visualize system health, performance, and key metrics.
- Alerts Setup: Configure alerts based on thresholds and anomalies to promptly address issues.
- Reports Scheduling: Set up regular reports to provide insights into system performance and reliability.
- Reliability Metrics: Establish and track reliability metrics (e.g., SLOs, SLIs, error budgets) to measure system performance.
- Observability Skills: Proficiency in observability practices, including distributed tracing, logging, and metrics collection.
- Collaboration: Partner with development, support teams to improve services through rigorous testing and release procedures.
- Capacity Planning: Participate in system design consulting and capacity planning.
- Debugging and Incident Response: Understand debugging information, handle incidents, and roll back faulty software pushes.
- Mentoring L1/L2 Support Teams: Provide guidance, mentorship to establish best practices on monitoring and observability.
- Infrastructure Management: Run and manage infrastructure using tools like Chef, Ansible, Terraform, GitLab CI/CD, and Kubernetes.
- Documentation: Document processes and procedures to avoid redundancy.
- Enthusiastic Attitude: Approach challenges with enthusiasm and a proactive mindset.