Descrição do trabalho
JobdescriptionSite Reliability Engineer for Logging & MonitoringOur client develops and operates state-of-the-art logging and monitoring platforms to collect application behaviour information, detect/limit service disruption and provide the associated reporting capabilities, in order to help application and platform owners identify any growing risks, have a clear understanding of their SLAs, reduce the meantime to resolution and be ahead of the curve with regards to long term trends.As a Site Reliability Engineer you will be:Developing and operating state-of-the-art logging, monitoring & event management platforms to help application and platform owners for better understand their workloads running in multi-cloud platformsProviding consultancy service in logging & monitoring to application, product and service owners as well as developers.Part of a squad of very dynamic, highly motivated and diverse engineers.Working closely with developers and application owners to improve our cloud infrastructure and application stability and resilienceAbout the teamThe mission of the new Service Insights Centre is hugely ambitious – They want to provide autonomous IT Operations capabilities (e.g. with AI) to recognize and resolve serious issues faster and with greater accuracy than humans can today. They will provide monitoring and predictive insights about how all IT services are meeting current and will meet future service level objectives (stability, availability, security etc.). They will support and enable business and technical domains to deliver the data required for our vision.The mission of the Service Insights squad is to:Develop, maintain and provide a continuous (24/7) overview of the IT Health (current issues, incidents, deployments, possible future issues, cyber threats etc.).Provide insights to quickly determine the cause of and possible resolution path for incidents.Predict incidents before they happen, trigger automation to resolve them and strive for autonomous operation of systems.Provide insights and transparency to application owners and other stakeholders on stability, impacts, pain points, where to improve etc.Contribute insights to all stakeholders in support of making our IT services stable and reliable and increasing transparency about outages and/or incidents.Provide critical metrics and insights to the CTO and senior management in order to steer the technology strategy and investments etc.About you5+ years' software development, continuous integration/deployment and system engineering experience in cloud-native ecosystems.Experience with a container orchestration system (e.g. Kubernetes) with solid security and network skills.Experience in a modern language e.g. Golang, Java and in scripting languages (Shell, PowerShell, Python).Experience in open-source application and infrastructure monitoring tools e.g. Elastic stack (ELK), Influx stack (TICK), Prometheus and Grafana.Experience with Azure cloud and multiple Azure services, including but not limited to AKS, Azure Monitor, App-Insights, Application Services.Passion for sharing knowledge and creating technical documentation.Strong analytical and problem-solving skills, as well as the ability to focus on details without losing track of the bigger picture.Excellent oral and written English skills, additional language skills are a plus.We are happy to receiving Your application if this opportunity sparks Your interest!