Senior Service Reliability Engineer Job

Job Overview

Toronto, Ontario
Job Type
Full Time
Salary / Compensation
Details Not Provided
Date Posted
1 year ago

Additional Details

Extensive Exp. Required (9+ Years)

Job Description

Reporting to the Manager Service Reliability Engineering, the Senior SRE will be the technical lead of the team responsible for helping to support 24x7 uptime and availability of production mission critical customer facing cloud services distributed across multiple regions. You'll help to create more consistent, automated push button environments across all tiers, proactively test and tune all aspects of the infrastructure, monitor, and respond to system notifications and alerts and continually work to optimize and improve the performance, security, and reliability of our systems.

Core Accountabilities

  • Providing technical direction and subject matter expertise leadership to all other SRE’s
  • Assist teams on escalations ensuring our critical production environments are functioning as intended.
  • Lead trouble shooting, restoration of services and post-incident learnings.
  • Contributes to initiatives aimed at refining our plans to deploy practices for improved stability, reliability, and security
  • Drive Value Stream Mapping to ensure we are focused on reporting on and automating the most valuable work
  • Provide technical progress updates to the various managers in Datacenter Services, providing data analysis to gauge our service trends and recommend/drive improvements
  • Apply automation and software to any tasks or parts of the system that would benefit from it or are performed manually
  • Able to troubleshoot complicated, cross platform issues handling OS, Networking, Database in a cloud-based SaaS environment and handle live production incidents, debug/troubleshoot application and infrastructure issues, follow and implement SRE best practices
  • Monitor application performance take steps to improve overall application performance and stability and follow through with implementation
  • Conduct system analysis, configuration management and develops improvements for system software performance, availability, and reliability
  • Design, write, ship, and motivate the creation of software and systems to increase observability, product reliability and organizational efficiency
  • Work closely with software engineers and testers to ensure the system is responding properly to non-functional requirements such as performance, security, and availability
  • Document your system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available
  • Maintain and monitor deployment, orchestration, of the servers, containers, databases, and general backend infrastructure
  • Develop standards and maintain self-provisioning infrastructure using tools like Ansible, Terraform, and Docker

Minimum Qualification and Skills

  • Post-secondary degree or diploma in Business and/or Computer Science (or related education)
  • 8+ years’ experience as SRE/DevOps Engineer
  • Experience working with engineering teams to understand their product requirements and how they build/test/deploy their software applications
  • 3+ years experience in Containerization and orchestration
  • Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
  • Basic programming and scripting skills (preferably Golang, bash, shell, python)
  • Ability to provide advice, best practices and recommendations for the operation and deployment of Microsoft Azure Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools - BHOM, New Relic, Perfmon, PerfView, ProcDump, DebugDiag
  • Familiarity with Linux and UNIX systems (e.g., CentOS, RedHat) and command line system administration such as Bash, VIM, SSH
  • Hands on experience in configuration management of server farms (using tools such as Puppet, Chef, Ansible) Network routing, Load balancing and Networking protocols, a base knowledge of TCP/IP, with an understanding of HTTP and DNS


This website uses cookies to ensure you get the best experience on our website. Cookie Policy