Qureos

Find The RightJob.

Position Summary

Serve as the Lead for the team ensuring smooth operation of the Linux cluster consisting of 300+ GPU/CPU compute nodes including parallel filesystems and high-performance network. This is partly technical and partly people leading role which involves supervision of 3-4 experienced HPC system administrators. The role involves development, implementation and supervision of standard operating procedures for the system and the team.


Major Responsibilities

  • System operation and upgrade planning to meet laboratory and customer requirements
  • Workload scheduler policy development and implementation
  • Support of high-performance filesystems
  • Network infrastructure management including TCP/IP and HPC networks
  • Use of scripting languages for nodes automation and configuration management
  • Hardware failures and spare part management
  • Build effective relationships with staff, faculty and students through the Core Labs
  • Manages multiple or significant projects which may require the use of sophisticated project planning techniques
  • Plans, schedules, conducts, or coordinates detailed phases of the work of a major project or in a total project of moderate scope
  • Identifies technical training needs for staff attached to the area
  • Serve as a resource and as a member to respond to security and safety incidents
  • Creates opportunities to enhance technical methodology or content through expansion of existing, or development of, new efforts; may extend technology into new application areas; contributes or leads in major intellectual development activities
  • Provides innovative problem-solving approaches to enhance organizational capabilities; uses peer network to expand technical capabilities and identify new research opportunities
  • Understands broad strategic objectives and contributes to them; nurtures and maintains relationships with major customers
  • May initiate new project concepts; develops technical proposals and makes presentations to potential customers
  • Will supervise several scientists, engineers or technicians on assigned work; provides major input to staffing of overall project teams; builds teams and staff to optimize efficiency and cost effectiveness
  • Identifies and evaluates candidates for open positions; mentors/trains staff in development of technical, project and business development skills


Competencies

  • SLURM workload manager including GPU scheduling
  • Parallel filesystems (Weka IO, Lustre)
  • TCP/IP and high performance networks (Infiniband)
  • Proficient in scripting languages (i.e. Bash, Python, Ruby)
  • Familiar with configuration management tools (Puppet)
  • Proficient documentation skills
  • Will have working level contact with users and suppliers
  • Demonstrates an analytical and systematic approach to problem solving
  • Takes the initiative in identifying and negotiating appropriate development opportunities
  • Demonstrates effective communication skills in written and oral English
  • Works effectively with other teams in the Supercomputing Laboratory
  • Plans, schedules and monitors own work (and that of others) competently within limited deadlines and according to relevant legislation and procedures
  • Ability to work successfully in a highly collaborative research environment
  • Uses discretion in identifying and resolving complex problems and assignments
  • Performs a broad range of work, sometimes complex and non-routine, in a variety of environments
  • Maintain expert-level knowledge in most of the laboratory systems, including high performance computing systems administration, high performance storage administration, or high performance network administration


Qualifications and Experience

  • Bachelor of Science (or equivalent) in a relevant discipline plus 10 years’ experience, OR Master of Science (or equivalent) in a relevant discipline plus 7 years’ experience OR Doctor of Philosophy (or equivalent) in a relevant discipline plus 5 years’ experience.

Similar jobs

No similar jobs found

© 2026 Qureos. All rights reserved.