![Binance logo](https://static.remoteliz.com/static/companies/company-binance-logo.jpeg)
Senior DevOps Engineer (Monitoring - Grafana, Prometheus)
BinanceJob Summary
We are seeking a Senior DevOps Engineer with expertise in monitoring tools like Grafana, Prometheus, and AWS. The ideal candidate will design, implement, and manage comprehensive monitoring solutions to ensure high availability and performance of microservices infrastructure and applications. They will collaborate with the infra team to integrate monitoring solutions into the CI/CD pipeline, conduct performance analysis, capacity planning, and scalability testing. Our Senior DevOps Engineer will lead incident response and troubleshooting efforts, utilizing monitoring data to quickly resolve operational issues. With a strong emphasis on monitoring and observability in cloud-native environments, they will work with a talented team to create innovative AI solutions that make the world programmable. We offer flexible remote work options, competitive salary, and company benefits.
Responsibilities:
- Design, implement, and manage comprehensive monitoring solutions to ensure high availability, performance of our microservices infrastructure and applications.
- Utilize advanced monitoring tools and scripting to automate the monitoring of our cloud environments, focusing on AWS.
- Develop and maintain robust logging and alerting mechanisms to identify and mitigate potential issues proactively.
- Collaborate with infra team to integrate monitoring solutions into the CI/CD pipeline, ensuring seamless deployments and operations.
- Conduct performance analysis, capacity planning, and scalability testing to ensure our systems meet current and future demands.
- Lead incident response and troubleshooting efforts, utilizing monitoring data to quickly resolve operational issues.
Requirements:
- Minimum of 5 years of hands-on experience with Kubernetes, Elasticsearch, Promtheus, Grafana and AWS, with a strong emphasis on monitoring and observability in cloud-native environments.
- Proficiency in programming languages (such as Python, Go or Rust) for automation of monitoring tasks.
- Experience with infrastructure as code (IaC) tools, and strong understanding of CI/CD principles, including experience with Docker and Kubernetes for container orchestration.
- Deep knowledge monitoring tools (such as Prometheus, Grafana or ELK stack) and strategies for large-scale environments.
- Proven track record in managing and troubleshooting large-scale distributed systems, with an emphasis on performance tuning and optimization.
- Excellent problem-solving skills, with a focus on delivering high-quality, reliable, and scalable infrastructure solutions.
- Strong communication and teamwork skills, with the ability to work effectively in a fast-paced, collaborative environment.