![Kontakt.io logo](https://static.remoteliz.com/static/companies/company-kontakt.io-logo.jpg)
Senior Site Reliability Engineer
Kontakt.ioJob Summary
Kontakt.io is seeking a Senior Site Reliability Engineer to ensure the scalability, availability, and security of their cloud-based AI-driven healthcare platform. The ideal candidate will have 3+ years of experience in SRE, expertise in Kubernetes, Docker, and container orchestration, as well as knowledge of machine learning infrastructure and healthcare compliance. As an SRE at Kontakt.io, you will collaborate with software, data, and infrastructure teams to build highly resilient and automated systems, allowing hospitals and care facilities to operate seamlessly and without downtime. You will design and maintain cloud infrastructure, implement SLOs, SLIs, and SLAs, and participate in 24/7 on-call rotation. Kontakt.io offers a competitive salary, stock option plan, flexible remote work options, and a collaborative environment.
Key Responsibilities:
- Design and maintain highly available, fault-tolerant, and scalable cloud infrastructure.
- Implement SLOs, SLIs, and SLAs to track system reliability and optimize uptime.
- Participate in 24/7 on-call rotation
- Oversee production platform deployments
- Monitor latency, traffic, errors, and system health using modern observability tools.
- Conduct root cause analysis (RCA) and post-mortems to continuously improve system resilience.
- Automate infrastructure provisioning using Terraform, Ansible, or Pulumi.
- Implement CI/CD pipelines to ensure seamless and safe deployments.
- Enable self-healing mechanisms using Kubernetes operators, auto-scaling, and fault detection.
- Ensure compliance with HIPAA, GDPR, and other healthcare data regulations.
- Define and execute disaster recovery (DR) and business continuity plans.
- Manage and optimize AWS environments for cost-efficiency and performance.
- Deploy and manage observability tools and build real-time alerting and response frameworks
- Establish best practices for logging, debugging, and performance monitoring.
- Improve incident response automation through runbooks, AI-based anomaly detection, and predictive analytics.
What You Bring
- 3+ years of experience as an SRE
- Strong expertise in Kubernetes, Docker, and container orchestration.
- Experience managing cloud-native environments (AWS).
- Experience with event-driven architectures, Kafka, or real-time data streaming.
- Knowledge of machine learning infrastructure.
- Previous experience in healthcare, compliance (HIPAA), and highly regulated environments.
- Proficiency in Infrastructure as Code (IaC) using Terraform.
- Deep knowledge of networking, DNS, load balancing, and security best practices.
- Experience with CI/CD pipelines (Jenkins, CI, or ArgoCD).
- Hands-on experience with monitoring and logging tools (Prometheus, Grafana, ELK, OpenTelemetry).
- Strong programming skills in Python, Golang, or Bash for automation.
- Knowledge of machine learning infrastructure.
We offer:
- Work on a mission-driven platform that improves healthcare operations and patient outcomes.
- B2B contract or an employment agreement
- Competitive salary and stock option plan
- Collaborate with top engineers, data scientists, and AI experts.
- Flexible remote or hybrid work options (office in Krakow)
- Collaborative and self-organized environment
- private medical care, cafeteria system