Country/Region:  MX
Requisition ID:  18513
Work Model:  Remote
Position Type:  Permanent
Salary Range: 
Location:  MEXICO CUST SITE

Title:  Sr Technical Lead

Description: 

10 + years Experience

We are hiring a Databricks Administrator and Site Reliability Engineer who will contribute to building and supporting reliable, high capacity, and high-performing infrastructure within AWS and Databricks that support our mission to reimagine learning for millions of students worldwide. We are looking for someone that already has experience with Databricks as a data transformation solution who can help us support it, make it reliable, make it efficient, reduce costs, and improve efficiency. We use AWS to back our Databricks jobs so experience with AWS is also a must.

 

About you:

• You are a strong communicator, both verbally and written

• You thrive on solving problems

• You take ownership for the quality of your individual efforts

• You have proven experience in all aspects of Databricks E2

• You have experience as an SRE/Devops Professional working on n-Tier AWS architectures

• You are adept at authoring, reviewing, testing, and deploying Terraform defined infrastructure including terraform modules

• You are well-versed in Docker and containerization strategies for application code

• You treat people with respect, have a good sense of humor and are non-abrasive

• You are constantly reading and learning. Ever the consumer of knowledge.

• You follow through and follow up and do not leave tasks undone.

Qualifications

• Extensive experience with Databricks E2 is a must

o unity catalog

o event lake / data lake

o delta tables / lakehouse

o Databricks e2 workspaces, jobs, clusters

o spark clusters

• Experience with Terraform is a must

• Strong familiarity with telemetry systems like New Relic and/or Datadog including skills in querying logs and metric data. We will be working on making Databricks environments “observable” and the candidate will be involved in that effort.

• Strong problem solving, triage, root cause analysis and systems engineering skills

• Demonstrated expertise building and managing highly scaled production infrastructure in the cloud

• Understand git fundamentals

• Understanding of AWS IAM roles and permissions

• Bash, Python, or Scala scripting experience

Nice to Have

• BS Degree in Computer Science (or related technical field and/or equivalent industry experience) preferred

• Hands on Apache Spark experience

Your contributions would include:

• Ensure repeatability, traceability, and transparency of our infrastructure automation (infrastructure-as-code, monitoring-as-code)

• Participate in continual learning of the AWS ecosystem

• Support on-call rotations for operational duties that have not been addressed with automation, with an eye for correcting issues that result in on-call alarms

• Maintain telemetry that improves the visibility to our applications' performance and business metrics and keep operational workload in-check

• Develop, communicate, collaborate, and monitor standard processes to promote the long-term health and sustainability of operational development tasks.

• Support healthy software development practices, including complying with agile software development methodology, performing code reviews, work packaging, and continuous delivery

• Observe and document steady state production levels, growth patterns

• Constantly monitoring costs of the Databricks platform and advising on possible paths to reduce cost.

• Coordinate improvements of existing software and infrastructure to meet resiliency goals

• Improving our knowledge base around Databricks E2

Our cloud stack includes:

• AWS Cloud tech: AWS ( S3, EC2, ECS, EKS, SQS, Elasticache Redis, Elasticache Memcache, RDS, Aurora RDS, Open Search, Dynamodb, Athena, Route53, WAF, IAM, Cloudfront, Load Balancing, ACM, VPCs, Lambda, API Gateway, VPC Endpoints, Routing tables, SES, SNS, Bedrock, Glue and more)

• Data Analytics: Databricks E2

• Infrastructure as Code: Terraform mainly with some occasional Cloudformation

• RDBMS: Mysql, PostgreSQL, Aurora including Connection Pooling technologies

• Caching: Redis, Memcache, Dynamodb

• Programming: Python, Bash, Scala, Terraform

• Container orchestration: AWS ECS

• Telemetry: NewRelic, CloudWatch, Datadog

• Build/Run: Github Actions, Artifactory, CircleCI, GitHub Enterprise

• Security: Rapid7 InsightAppSec, Crowdstrike, Kentik, 1Password, DivvyCloud (Insight Cloudsec), Security Scorecard

• Task Management: JIRA

• Documentation: Confluence, Office365, Sharepoint