Senior Site Reliability Engineer

Job Summary

The core premise for SRE lies in treating operations as a software problem where operations are concerned with addressing availability, scalability, latency and efficiency. At its core the SRE is tasked with engineering efforts to solve complex problems, requiring a strong aptitude to develop software systems that will minimise (i.e. through automation) human labour and increase system & service reliability. Ultimately the fundamental software engineering skills coupled with strong systems and networking knowledge will guide the SRE to create more reliable systems & services that are highly available, which scale with growth and that is efficient and latency-sensitive

Key responsibilities:

  • Design, develop and implement systems software that improves the stability, scalability, availability and latency of the products;
  • Take ownership of one or more services and have the freedom to do what is best for our business and customers;
  • Solve problems occurring with our highly available production systems and build solutions and automation to prevent them from happening again;
  • Build effective monitoring to monitor the health of your system, and jump in to handle outages;
  • Build and run capacity tests to manage the growth of your systems;
  • Plan for reliability by designing systems to work across our multinational data centres

Key requirements:

  • Experience with design, development, testing, and monitoring of large-scale and data-intensive systems;

  • Experience in designing, implementing and testing integrations between ServiceNow and different thirty-party systems using REST, WebServices, etc, while applying

  • Decent experience with Java (10+ years)

  • experience with MySQL, Docker & Kubernetes, Graphite & Grafana for metrics, dashboards and alerting, Gitlab CI/CD, ElasticSearch, Knowledge of bash scripting and Linux environments

  • Well versed with creating & extending REST APIs, and integrating with internal APIs and services

  • Expected to participate in the operational shift during the day (reacting to outages) and providing customer support (ticket work)

Apply for this Job