Senior Site Reliability Engineer

Taos
09/28/2020

Full time

Job Description

Taos is looking for an experienced, enthusiastic and articulate Senior Site Reliability Engineer. One that is comfortable and efficient in a fast-paced enterprise IT environment.

Job Duties and Responsibilities

The Site Reliability Engineer (SRE) will be responsible for both uplifting and maintaining our evolving technology platforms, infrastructure and technology controls. As an SRE, the role will include both oversight for production operations of our systems, as well as development/engineering of solutions to maximize system reliability & automation. The role will address three dimensions:

• Tools Coverage - Assess the tools coverage and ensure sufficient monitoring is in place to enable mature observability and data driven decision making

• Defining and educating Engineering teams - Process, Procedures, Guide Rails and best practices

• Culture - Inculcate the culture of high performing teams and adopt the ways of working with the influence of SRE

The role will need to work with a global team responsible for a mission critical business function, and will partner with Infrastructure, DevOps and Core practices (like Security, Identity, ProdOps, Cloud platform and Tools) teams to identify and implement automation opportunities to drive down toil, reduce technical debt and improve system reliability.
What you'll be doing

Own the Infrastructure, APM and work with DevOps teams to Build, Release, Monitor and run the services to improve service reliably
Write software to automate API-driven tasks at scale and contribute to the product codebase in Java, JS, React, Node, Go and Python
Work with Ansible, Puppet, Chef, Terraform or another config management / orchestration suite, know where it's broken, work towards fixing them and explore new alternatives
Define and accelerate implementation of support processes, tools and best practices
Maintain services once they are live by measuring and monitoring availability, latency and overall system reliability
Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure
Performance and maturity baselining of DevOps process, tools maturity & coverage, metrics, technology and engineering practices
Define, Measure and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Mgmt) and streamline - automate release management
Strong believer of automation to bring in sustained continuous improvement by automating Toil, Runbooks, Improving ability of the applications to auto heal leading to improved reliability

What you'll bring with you

Knowledge in the one or more of the following key areas: Ops maturity (performance testing, monitoring, operations - SIP), APM, Performance Benchmarking, Software Design and lifecycle (planning - discovery to provision), Infosec (including compliance, security)
Good understanding & implementation experience using 12-factor App principle
Exp in building monitoring/metrics & alerting tool (APM tool), custom dashboard for each Application stack against supported environment
Expertise with Python-related Technologies and Frameworks
Exp with Unix/Linux-OS Internals and administration or Networking and SME on at least one of the Cloud computing Infrastructure - ====P / Azure / AWS

Skills - Requirements

The successful candidate will have the following attributes/qualifications:
Bachelor's/ Master's Degree and 10+ years of Development and Operations related experience and/or training; or equivalent combination of education and experience
Relevant experience as SRE would be an added advantage
Good understanding of uplifting the maturity (App Engineering practices & Ops)
Understanding of software delivery lifecycles, particularly Agile/Lean & DevOps
Proven experience in handling large scale and growing infrastructure across Data Centres and heterogeneous Cloud platforms
Experience as a service owner in managing large - geographically diverse stake holders
Ability to work with creative - fast growing engineering team and motivate them to deliver their best work
History of driving innovation

Familiarity with handling

Containerization - Kubernetes, Docker, Rancher, etc
Kafka, Yarn, Elastic Search etc.
Source code management and Implementation of Security best practices.
Tech Stack - Python, Falcon, Elastic Search, MongoDB, AWS (SQS S3), Map Reduce
Data science (AI/ ML) and analytics to be able to predict failures / operational issues
Be a subject matter expert, able to upskill / cross skill engineering teams on SRE principles, tools and execution
Troubleshoot, debug, and diagnose operational issues and drive them to closure.
Monitor the health of Dish-Sling services, and define as well as track reliability metrics

Who is Taos?

Taos helps today's enterprises and rapidly growing businesses harness the power of the cloud and DevOps with digital transformation and optimization solutions. From Executive Leadership to our delivery teams, Taos listens, understands, and delivers best-in-class work. Our deep technical expertise and solutions-driven approach help address our client's biggest business challenges and opportunities. As a Global Leader of Cloud and DevOps, Taos continues to solve What's Next.

Talent at our Core

Taos Consultants are adaptable problem-solvers, growth-minded doers, and lifelong learners.

Thanks to this mindset, we have helped thousands of clients achieve their goals and solve their challenges. From Cloud Architects to Security Analysts to DevOps Engineers, Taos is always seeking the best and brightest technical talent. Joining Taos gives you the opportunity to work with national enterprises and innovative Silicon Valley companies. Our model provides the support and benefits of full-time employment while giving you exposure to a variety of environments and technologies to sharpen your skills and deepen your technical expertise. These advantages combined with competitive benefits, continuous training and education, and a clear career progression path make Taos a great place to work.

Referrals:

We love referrals so much that we pay for them! If you know someone that you would recommend, send an email to or Contact Us and we will do the rest! We'll make sure that you receive the $1000 referral bonus after they are employed with us.

Compensation:

Our compensation package includes a competitive salary, medical and dental insurance, 401k, paid vacation, sick time and holiday pay, plus loads of free training (Puppet, Chef, Nagios, LAMP Stack, PMP, ITIL, Python, etc.)!

Equal Opportunity:

Taos Mountain, LLC is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, religion, color national origin, sex, age, status as a protected veteran, or status as a qualified individual with disability.

Veterans are encouraged to apply!

E-Verify Participant:

This employer will provide the Social Security Administration (SSA) and, if necessary, the Department of Homeland Security (DHS), with information from each new employee's Form I-9 to confirm work authorization. Please go to and review the E-Verify Participant and Right to Work links for more information.

#DICE - provided by Dice

Senior Site Reliability Engineer

Job Description

Modal Window