As Director of SRE, you will build and lead Site Reliability Engineering practice. You will be responsible for not only transforming the existing organization to embrace SRE principles but also hire talented engineers to strengthen the SRE practice. The role would be responsible for both uplifting and maintaining our evolving technology platforms, as well as defining and implementing modern infrastructure and technology controls. As the head of SRE, you will have oversight for production operations of our systems as well as development of solutions to maximize system reliability & automation.
This will be a global role with teams across different geographies and will be responsible for operating mission critical business functions. This role will partner with Infrastructure, DevOps and Core practices (Security, Identity, ProdOps, Cloud platform and Tools) teams to identify and implement automation opportunities to drive down toil, reduce technical debt and improve system reliability.
Key Responsibilities:
- Relentless focus on repeatable processes, practices, and automation.
- Create reusable solutions that continuously improve the reliability, scalability, observability, security, and availability of our service.
- Articulate a vision for establishing SRE practices and execute the vision by working closely with technology leadership across the enterprise.
- Drive the initiative to transform the existing Engineering and Operations organizations to embrace modern SRE principles.
- Overall accountability for end to end system reliability, scalability, and performance across multiple applications.
- Work with broader technology, operational & business teams to continuously improve end-to -end service experience and associated cost of delivery.
- Help build a culture of trust, collaboration, and ownership across the organization.
- Responsible for providing stable, secure, and compliant infrastructure environments.
- Focus on understanding business and customer engagement, service performance and continuous service improvement.
- Demonstrate passion to learn and experiment with emerging technologies.
- Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure.
- Performance and maturity baselining of DevOps process, tools maturity & coverage, metrics, technology and engineering practices
- Hire, build and lead a talented organization of SRE professionals.
Experience Required
- Demonstrated ability to articulate vision and strategy and execution working collaboratively with Engineering and Business leadership.
- Extensive experience in handling successful large-scale distributed production systems serving millions of customers, preferably in the OTT/media industry.
- Shall have experience in running applications across the board with four 9's and five 9's availability.
- Experience in leading SRE and implementing concepts including defining, measuring SLI/SLOs, error budgets, data-based decision making and balancing reliability vs innovation.
- Good understanding & experience in operating applications in public, hybrid cloud and on-prem infrastructure.
- Experience in several of the following key areas: Operations maturity (performance testing, monitoring, operations - SIP), APM, Performance Benchmarking, Software Design and lifecycle (planning - discovery to provision), Infosec (including compliance, security).
- Experience in systems, storage, networking, security and databases is strongly desirable.
- Familiarity with handling
- Containerization - Kubernetes, Docker, Rancher, etc.
- Kafka, Yarn, ElasticSearch etc.
- Source code management and Implementation of security best practices.
- Python, Falcon, MongoDB, AWS (SQS S3), Map Reduce
- Data science (AI/ ML) and analytics to be able to predict failures / operational issues
- Native OTT applications that run on partner platforms - Roku, Android, iOS/tvOS, smart tvs.
- Experience with private cloud - Openstack, VMWare and public clouds - AWS, GCP
- Be a subject matter expert, able to upskill / cross skill engineering teams on SRE principles, tools and execution.
- Monitor the health of services and define as well as track reliability metrics.
The successful candidate will have the following attributes/qualifications:
- At least 10+ years of experience in a leadership role and at least 5 years in leading SRE and/or Development engineers and teams at different levels
- Bachelor's/Master's Degree and 15+ years of Development and Operations related experience; or equivalent combination of education and experience
- Relevant experience as hands-on SRE would be an added advantage
- Proficiency with continuous integration and continuous delivery tooling and practices.
- Understanding of modern software engineering practices, and good exposure to Agile/Lean & DevOps
- Proven experience in handling large scale and growing infrastructure across data centers and heterogeneous cloud platforms
- Experience as a service owner in managing large - geographically diverse stakeholders
- Ability to work with creative - fast growing engineering teams and motivate them to deliver their best work
- History of driving innovative ideas and implementing solutions
- provided by Dice