Valtech is looking for a Site Reliability Engineer (SRE). Are you passionate about Site Reliability Engineering, do you have an eye for SLIs, SLOs, and automation, do you like eliminating toil, and does it excite you to get things done in close collaboration with people around the globe? Would you like the freedom to work from the comfort of your home and also have the opportunity to visit any of our offices close to you? Then you might be the person we’re looking for! Keep reading to find out.
Valtech and Site Reliability Engineering
Over the last years, experience and commerce platforms have drastically evolved into complex ecosystems that tie together multiple services of multiple vendors – also known as the MACH architecture. As a founding member of MACH Alliance, a group that educates enterprises on best-of-breed Microservices, APIs, Cloud, and Headless (MACH) technology, Valtech pioneers in how to properly build and manage those complex ecosystems. Site reliability engineering is at the core of our vision of how this modern-day distributed ecosystem should and can be managed.
A day in the life of a Site Reliability Engineer
As a Site Reliability Engineer (SRE), you fulfil an essential role. You will mainly be responsible for the continuity and reliability of production of the commerce and experience platforms of our clients in continuous collaboration with developers, QA engineers and cloud engineers. You will work with our multidisciplinary teams in an essential DevOps way of working, where your main responsibility is to keep everyone focused on production while creating the facilities to do so.
Your responsibilities will be:
- Monitor performance, availability, and security of applications and services in cloud environments
- Enhance and maintain CI/CD pipelines
- Analyze and troubleshoot issues in development and production environments
- Support teams in testing and improving logging
- Define and maintain SLOs to ensure system reliability
- Maintain runbooks
- Setup backup and disaster recovery process
- Provide proactive support/maintenance
- Collaborate with development teams and provide insights for improvements
You and the role
You are someone with 5+ years of experience in the field of Site Reliability Engineering. Leading up to that, you have gained a profound level of expertise in either cloud engineering, DevOps engineering or software engineering. Taking the lead is something that you feel comfortable doing. In your current role, people come to you for advice on what to look for to determine the robustness of their production environments, advice for reliable deployment procedures, assistance in the analysis of failure scenarios and ideas on how to mitigate or remediate those.
We would love if you have
- Good communicative skills, capable of taking the lead and collaborating with the development team to make the right choices
- Deeper understanding of IT service management in the DevOps environment
- Experience with incident management in a production environment of a public-facing online service with high business value and preferably high traffic in a 24x7 fashion
- Experience working in corporate environments
- Experience programming and scripting e.g. Java, Python, MySQL etc.
- Knowledge of serverless services in one or more public cloud providers and configuring, managing, operating, maintaining AWS infrastructure (IaC) (AWS, Azure, GCP)
- Extensive knowledge of and experience with various monitoring systems, amongst which APM systems such as Datadog, Pagerduty, New Relic, Dynatrace, Prometheus, Grafana
- Knowledge of and experience with various pipelining tools, such as GitHub, Azure DevOps, Gitlab, Jenkins
- Knowledge of and experience with microservices-related technology: Docker, Kubernetes, HELM
- Experience in using GitOps tools such as ARGO CD/FLUX CD, Harbor
- Use of Jira, Confluence
- Good conceptual understanding of software architecture and system thinking
- Experience in debugging, optimizing, and proposing changes to application code for scalability, resilience, and monitoring
- Familiar with the automation of routine tasks
- Strong problem-solving skills, effective communication, and a proactive attitude
- An excellent command of English (C1 or above)
What you can expect from us
Apart from the benefits listed above, we have more to offer. Our growth development program, for example, to help you excel in your existing profession, or enable you to explore another. Also, there are plenty of internal initiatives for you to take part in. Whether it's about improving Valtech as a business or contributing to the world around us, we encourage our employees to pursue their professional and personal ambitions.
Join Valtech
Not only can you help us lead the experience revolution, but you can lead the change. The journey Valtech is on is ambitious and therefore opens doors. The more we grow – the more opportunities there are to take responsibility, implement your creative ideas, and be the innovator and driver to help move us, and our clients, forward.
Does it excite you to join Valtech on this journey? And would you like to become our next Site Reliability Engineer?
Then apply today. We cannot wait to hear from you!
Why do we state “all genders” in the title of this job description?
As per the AGG (General Equal Treatment Act) in Germany and equivalent in France, jobs must be advertised in a gender-neutral manner. We use the statement (all genders) to make it clear that this position is open to all genders in these countries, even if the job title itself can be translated or interpreted as "masculine" in the French or German language.