Job Description
**Monitoring and Alerting:**
**Monitoring and Alerting:**
- Implement and maintain monitoring systems to proactively identify potential issues and alert engineers to problems before they impact users.
- Respond to incidents and outages, diagnose problems, and implement solutions to minimize downtime and restore service.
- Automate repetitive tasks and processes to improve efficiency and reduce manual effort.
- Identify and address performance bottlenecks to ensure systems run efficiently and effectively.
- Manage and maintain the underlying infrastructure, including servers, networks, and cloud resources.
- Plan for future capacity needs to ensure systems can handle anticipated workloads.
- Develop and maintain processes for deploying software updates and releases.
- Work closely with developers, operations teams, and other stakeholders to ensure system reliability and availability.
- Maintain clear and concise documentation of systems, processes, and procedures.
- Identify areas for improvement and implement changes to enhance system reliability and performance.
- Cloud Platform (AWS, Microsoft Azure)
- Automation (DevOps, CI/CD, Terraform)
- Operating System (Windows, Linux)
- Scripting (Shell Scripting, Python, Power Shell)
- Database (MySQL, Oracle, SQL database management)
- Application Deployment (Wild Fly, JBoss, Apache Tomcat)
- Container Services (Kubernetes, Docker, Helm)