About this role
Key Areas of Responsibilities
• Own and support monitoring and SRE operations, ensuring system reliability, availability, and performance. • Build, enhance, and maintain monitoring solutions using ITRS Geneos, Prometheus, Victoria‑Metrics, Elasticsearch, and Grafana. • Develop, optimize, and maintain alerting rules, dashboards, and observability pipelines. • Troubleshoot and resolve complex issues during major incidents, providing clear and timely communication. • Troubleshoot Linux servers (RHEL 7/8/9), including upgrades, configurations, patching, and maintenance, while determining appropriate monitoring requirements for system changes. • Analyze logs, investigate issues, and perform fault finding to identify performance exceptions. • Collaborate with engineering, application, and infrastructure teams to improve system resilience, stability, security, efficiency, and scalability. • Contribute to automation strategies, deployment processes, and continuous operational improvements. • Participate in on‑call rotations, including off‑hours and scheduled weekend support. • Participate in Disaster Recovery (DR) and Business Continuity Planning (BCP) drills. • Continuously research and adopt modern monitoring and SRE tools and practices.
Requirements
• Bachelor’s degree in computer science / engineering • Minimum 8 years’ experience within IT / Investment bank. • Strong experience with monitoring and observability platforms, including: ITRS Geneos, Prometheus, Victoria‑Metrics, Elasticsearch, Grafana, and Kibana. • Hands-on experience building and implementing Prometheus pipelines, including exporters, scraping configurations, relabelling, metric routing, and integrations with long‑term storage (e.g., Victoria‑Metrics). • Experience building and maintaining Logstash pipelines, including ingestion, parsing, filtering, enrichment, and routing of logs into Elasticsearch. • Ability to design, build, and maintain Grafana and Kibana dashboards for metrics, logs, and performance analytics across distributed systems. • Solid understanding of metrics, logging, alerting, dashboards, and observability pipelines. • Strong Linux administration skills (RHEL 7/8/9), including troubleshooting, upgrades, configuration, patching, and performance optimization. • Good understanding of SRE principles, high availability, scalability, incident management and DR (Disaster Recovery) / BCP (Business Continuity Planning) activities • Experience with automation (e.g., Bash, Python, Ansible, CI/CD tools) is an advantage. • Understanding of networking fundamentals, performance tuning, and troubleshooting distributed systems. • Prior experience in Production Support, SRE, Monitoring Engineering, or Shared Services Operations with participation in on‑call rotations, including after-hours and weekend support. • Strong analytical, problem‑solving and communication skills with the ability to work collaboratively under pressure. • Self-motivated, adaptable and able to prioritize, learn continuously and manage multiple responsibilities effectively. • Excellent/Fluent in English
Stay informed on CITIC CLSA Job Opportunities Not the right fit? You can create a job alert to receive our latest job openings that meet your interest.