Description of the job
We're looking for a thoughtful and driven Senior Cloud Operations & Reliability Engineer to join our team. You're someone who's
naturally curious, notices when things aren't quite right, and takes initiative to investigate and improve them. You care deeply about
building reliable systems and understand the importance of working collaboratively and responsibly.
In this role, you'll be part of a team that values ownership, continuous learning, and a strong sense of purpose. You'll have the
autonomy to act when something needs fixing, and the judgment to know when to bring others in. If you're passionate about cloud
technologies, enjoy solving complex problems, and thrive in a supportive, high-trust environment—this could be the perfect fit.
Key Responsibilities
- Own and operate cloud infrastructure (EC2, ECS, Lambda, S3, etc.) with a focus on reliability and performance.
- Proactively monitor systems using Datadog and CloudWatch—if something looks off, you're already investigating.
- Automate deployments and infrastructure using AWS CDK, Terraform, and CloudFormation.
- Maintain centralised logging and observability to ensure no issue goes unnoticed.
- Drive security compliance and manage access controls with a strong sense of responsibility.
- Handle service requests and documentation with clarity and precision.
- Mentor junior engineers and provide technical leadership within the team.
- Collaborate with cross-functional teams to drive strategic initiatives and improvements.
- Contribute to architectural decisions and influence the evolution of our cloud infrastructure strategy.
- Participate in project based work and work off the Cloud team backlog.
- Participate in on-call roster for Cloud Operations team.
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- 5+ years of experience in cloud operations, DevOps, or site reliability engineering.
- Proven experience with AWS services and cloud-native architectures.
- Excellent understanding of the SDLC.
- Deep experience with maintaining CI/CD pipelines and infrastructure automation.
- Strong understanding of Agile practices and incident management workflows.
Technology Stack
- Languages: Java, Golang, TypeScript, Python, Bash
- IaC Tools: AWS CDK, Terraform, CloudFormation
- DevOps & Monitoring: GitLab, Datadog, Apigee
- Cloud Platform: AWS
- Cloud Expertise: Deep knowledge of AWS services and cloud-native design patterns.
- Automation: Proficiency in scripting and infrastructure automation.
- Language: Excellent skills in at least one scripting language (Python/TypeScript).
- Monitoring & Observability: Extensive experience designing, implementing, and maintaining observability solutions using
Datadog, CloudWatch, and centralised logging platforms.
Security & Compliance: Deep understanding of IAM, KMS, WAF, and audit logging. - Collaboration: Strong communication and teamwork skills across technical and non-technical teams.
- Incident Management: Ability to lead and resolve high-impact incidents efficiently.
- Leadership: Proven ability to mentor and guide junior engineers.
- Proactive: You don't wait for instructions—you investigate, fix, and improve.
- Strategic: You think beyond the immediate fix and consider long-term impact and scalability.
- Curious: You're driven by a need to understand how things work and why they break.
- Responsible: You take initiative thoughtfully and know when to seek support.
- Collaborative: You work well across teams and communicate clearly under pressure.
- Detail-Oriented: You document, test, and validate your work to ensure long-term stability.
- Leader: You inspire and mentor your team, fostering a culture of continuous improvement.
- Available: You are able to be on-call.