Job Description
About Us
Founded in Jan 2011 by IIT Bombay alumni Bhavish Aggarwal and Ankit Bhati, Ola (formerly Olacabs), is India’s most popular mobile app for personal transportation. Ola integrates city transportation for customers and driver partners onto a mobile technology platform ensuring convenient, transparent and quick service fulfillment. Ola is committed to its mission of building mobility for a billion people. Using the Ola mobile app, users across 110 cities can book from over 6,00,000 vehicles across cabs, auto-rickshaws and taxis. Ola has also introduced a range of shared mobility services on its platforms like Ola Shuttle and Ola Share for commute and ride-sharing respectively. The app is available on Windows, Android and iOS platforms.
We are looking for a Principal Engineer DevOps to help us build and enhance platforms to achieve availability, scalability and operational effectiveness. The right individual will embrace the opportunity to tackle challenging problems and use their influence to drive continual improvement. You will also work on the cutting edge of technology, leveraging Kong, Repose, Docker, Mesos/Kubernetes, Jenkins, Chef, HaProxy, Nginx, GitLab, MySQL, Scylla, Aerospike, Service Mesh (Istio/Linkerd), Prometheus etc.
Roles and Responsibilities
- Managing Availability, Performance, Capacity of infrastructure and applications.
- Building and implementing observability for applications health/performance/capacity.
- Optimizing On-call rotations and processes.
- Documenting “tribal” knowledge.
- Managing Infra-platforms like Mesos/Kubernetes
- CICD
- Observability (Prometheus/New Relic/ELK)
- Cloud Platforms (AWS/ Azure)
- Databases
- Data Platforms Infrastructure
- Providing help in onboarding new services with the production readiness review process.
- Providing reports on services SLO/Error Budgets/Alerts and Operational Overhead.
- Working with Dev and Product teams to define SLO/Error Budgets/Alerts.
- Working with the Dev team to have an in-depth understanding of the application architecture and its bottlenecks.
- Identifying observability gaps in product services, infrastructure and working with stake owners to fix them.
- Managing Outages and doing detailed RCA with developers and identifying ways to
- avoid that situation.
- Managing/Automating upgrades of the infrastructure services.
- Automate toil work.
Experience & Skills:
- Experience as an SRE/DevOps/Infrastructure Engineer on large scale microservices and infrastructure.
- A collaborative spirit with the ability to work across disciplines to influence, learn, and
- deliver.
- A deep understanding of computer science, software development, and networking principles.
- Demonstrated experience with languages, such as Python, Java, Golang etc.
- Extensive experience with Linux administration and a good understanding of the various Linux kernel subsystems (memory, storage, network etc).
- Extensive experience in DNS, TCP/IP, UDP, GRPC, Routing and Load Balancing.
- Expertise in GitOps, Infrastructure as Code tools such as Terraform etc. and
- Configuration Management Tools such as Chef, Puppet, Saltstack, Ansible.
- The expertise of Amazon Web Services (AWS) and/or other relevant Cloud Infrastructure solutions like Microsoft Azure or Google Cloud.
- Experience in building CI/CD solutions with tools such as Jenkins, GitLab, Spinnaker,
- Argo etc.
- Experience in managing and deploying containerized environments using Docker,
- Mesos/Kubernetes is a plus.
- Experience with multiple datastores is a plus (MySQL, PostgreSQL, Aerospike,
- Couchbase, Scylla, Cassandra, Elasticsearch).
