Duties and Responsibilities:
- Design, implement, and manage systems and infrastructure to ensure high availability and performance.
- Develop and maintain monitoring and alerting systems to detect and address issues proactively.
- Automate repetitive tasks and streamline operations through scripting and configuration management.
- Collaborate with development and operations teams to improve system reliability and scalability.
- Perform incident response and post-incident analysis to resolve and prevent issues.
- Conduct capacity planning and performance tuning to meet operational demands.
- Develop and maintain documentation for system operations and reliability practices.
Requirements and Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- Proven experience with systems administration, operations, and reliability engineering.
- Proficiency with automation and orchestration tools (e.g., Kubernetes, Ansible, Terraform).
- Strong understanding of system monitoring, performance tuning, and incident management.
- Experience with cloud platforms (e.g., AWS, Google Cloud, Azure) and containerization technologies.
- Excellent problem-solving skills and the ability to work collaboratively with cross-functional teams.