Position Summary
Serve as the Lead for the team ensuring smooth operation of the Linux cluster consisting of 300+ GPU/CPU compute nodes including parallel filesystems and high-performance network. This is partly technical and partly people leading role which involves supervision of 3-4 experienced HPC system administrators. The role involves development, implementation and supervision of standard operating procedures for the system and the team.
Major Responsibilities
-
System operation and upgrade planning to meet laboratory and customer requirements
-
Workload scheduler policy development and implementation
-
Support of high-performance filesystems
-
Network infrastructure management including TCP/IP and HPC networks
-
Use of scripting languages for nodes automation and configuration management
-
Hardware failures and spare part management
-
Build effective relationships with staff, faculty and students through the Core Labs
-
Manages multiple or significant projects which may require the use of sophisticated project planning techniques
-
Plans, schedules, conducts, or coordinates detailed phases of the work of a major project or in a total project of moderate scope
-
Identifies technical training needs for staff attached to the area
-
Serve as a resource and as a member to respond to security and safety incidents
-
Creates opportunities to enhance technical methodology or content through expansion of existing, or development of, new efforts; may extend technology into new application areas; contributes or leads in major intellectual development activities
-
Provides innovative problem-solving approaches to enhance organizational capabilities; uses peer network to expand technical capabilities and identify new research opportunities
-
Understands broad strategic objectives and contributes to them; nurtures and maintains relationships with major customers
-
May initiate new project concepts; develops technical proposals and makes presentations to potential customers
-
Will supervise several scientists, engineers or technicians on assigned work; provides major input to staffing of overall project teams; builds teams and staff to optimize efficiency and cost effectiveness
-
Identifies and evaluates candidates for open positions; mentors/trains staff in development of technical, project and business development skills
Competencies
-
SLURM workload manager including GPU scheduling
-
Parallel filesystems (Weka IO, Lustre)
-
TCP/IP and high performance networks (Infiniband)
-
Proficient in scripting languages (i.e. Bash, Python, Ruby)
-
Familiar with configuration management tools (Puppet)
-
Proficient documentation skills
-
Will have working level contact with users and suppliers
-
Demonstrates an analytical and systematic approach to problem solving
-
Takes the initiative in identifying and negotiating appropriate development opportunities
-
Demonstrates effective communication skills in written and oral English
-
Works effectively with other teams in the Supercomputing Laboratory
-
Plans, schedules and monitors own work (and that of others) competently within limited deadlines and according to relevant legislation and procedures
-
Ability to work successfully in a highly collaborative research environment
-
Uses discretion in identifying and resolving complex problems and assignments
-
Performs a broad range of work, sometimes complex and non-routine, in a variety of environments
-
Maintain expert-level knowledge in most of the laboratory systems, including high performance computing systems administration, high performance storage administration, or high performance network administration
Qualifications and Experience
-
Bachelor of Science (or equivalent) in a relevant discipline plus 10 years’ experience, OR Master of Science (or equivalent) in a relevant discipline plus 7 years’ experience OR Doctor of Philosophy (or equivalent) in a relevant discipline plus 5 years’ experience.