Site Reliability Engineer
Additional responsibilities will include but are not limited to;
- Cloud Infrastructure Management: Writing pull requests (PRs) to make changes that improve and optimize our AWS+Terraform+Kubernetes setup, centring around ensuring its high availability, scalability, and resilience.
- Security & Compliance: Implementing security measures, auditing the cloud environment, and ensuring adherence to compliance standards.
- Tool Development: Expanding our internal tool base, focusing on Infrastructure as a Code and configuration management improvements.
- Issue Resolution: Collaborating with teams to identify and resolve infrastructure-related issues swiftly, minimizing any impact on product performance.
- Cloud Strategy Advocacy: Championing cloud strategies that align with and advance our business objectives, especially during pitch cycles and other planning meetings.
- Knowledge Sharing: Connecting with Cloud Engineers, Site Reliability Engineers, and application engineers, documenting key decisions where possible and making sure critical knowledge isn't siloed in a single spot in the organization.
What you can expect your first 12 months will look like;
- Infrastructure Knowledge: Within six months, acquire expert understanding of and submit an approved peer-reviewed pull request (APRPR) for each of the following technologies: Terraform, Flux, Kustomize, and Argo.
- Stability Improvements: In the first 6-9 months, deliver a POC for a technology improvement centred around improving or maintaining uptime, reducing the reliance on single points of failure, or reducing the Time to Recovery after an incident.
- Signal and Metrics Improvement: Within six months, contribute to at least one cycle of signal and metrics improvement and show that the overall number of alerts decreased in the following cycle and/or a requested metric or set of metrics has been made available for use.
- Security and Compliance: In the first 12 months, contribute to at least one of the following: AWS Product and Architecture Review, SOC 2 compliance review, Disaster Recovery (DR) plan review and drill, Security Penetration Test (Pen Test) review and remediation.
Little bit about you;
- Cloud Infrastructure Management: Proficiency in managing cloud infrastructures, especially AWS, along with associated tools like Terraform and Kubernetes, ensuring high availability, scalability, and resilience.
- Experience with Infrastructure as Code (IaC): Hands-on experience with IaC tools and techniques, including configuration management and cloud provisioning.
- Software Development: Basic programming skills in at least one language, such as Python, for tool development and automation tasks.
- Security Best Practices: Knowledge of security protocols and compliance requirements specific to cloud environments, with experience in implementing security measures.
- Troubleshooting & Issue Resolution: Experience in diagnosing and resolving infrastructure-related issues, working closely with development and support teams.
- Monitoring and Metrics: Familiarity with cloud monitoring tools and performance metrics to continuously evaluate and improve the infrastructure.
- CI/CD Practices: Understanding of continuous integration and continuous deployment practices for efficient and reliable product releases.
- Documentation & Communication: Ability to document technical processes clearly and effectively communicate architectural decisions and changes to various stakeholders.
Just some of the reasons why to join Graylog;
- Management team with deep programming, technical, and product experience.
- Opportunity to work with a globally distributed and diverse team.
- Grow and develop professionally and personally in a fast-growing environment.
- Choice of the latest equipment to help you succeed.
- Monthly allowance to support your commute costs and support outfitting your work-from-home environment.