Senior Site Reliability Engineer
Job Summary
The core premise for SRE lies in treating operations as a software problem where operations are concerned with addressing availability, scalability, latency and efficiency. At its core the SRE is tasked with engineering efforts to solve complex problems, requiring a strong aptitude to develop software systems that will minimise (i.e. through automation) human labour and increase system & service reliability. Ultimately the fundamental software engineering skills coupled with strong systems and networking knowledge will guide the SRE to create more reliable systems & services that are highly available, which scale with growth and that is efficient and latency-sensitive
Key responsibilities:
- Design, develop and implement systems software that improves the stability, scalability, availability and latency of the products;
- Take ownership of one or more services and have the freedom to do what is best for our business and customers;
- Solve problems occurring with our highly available production systems and build solutions and automation to prevent them from happening again;
- Build effective monitoring to monitor the health of your system, and jump in to handle outages;
- Build and run capacity tests to manage the growth of your systems;
- Plan for reliability by designing systems to work across our multinational data centres
Key requirements:
-
Experience with design, development, testing, and monitoring of large-scale and data-intensive systems;
-
Experience in designing, implementing and testing integrations between ServiceNow and different thirty-party systems using REST, WebServices, etc, while applying
-
Decent experience with Java (10+ years)
-
experience with MySQL, Docker & Kubernetes, Graphite & Grafana for metrics, dashboards and alerting, Gitlab CI/CD, ElasticSearch, Knowledge of bash scripting and Linux environments
-
Well versed with creating & extending REST APIs, and integrating with internal APIs and services
-
Expected to participate in the operational shift during the day (reacting to outages) and providing customer support (ticket work)
