Administer and maintain the Amazon SageMaker platform, including Studio, Notebooks, Training Jobs, Inference Endpoints, and Pipelines.
Implement and enforce security best practices, including IAM roles, VPC configuration, encryption, and access controls.
Automate environment setup, user onboarding, and resource provisioning using CloudFormation, Terraform, or AWS CDK.
Monitor platform usage, performance, cost, and health using CloudWatch, AWS Cost Explorer, and other monitoring tools.
Collaborate with data scientists to troubleshoot platform issues, optimize compute usage (e.g., instance selection, spot vs. on-demand), and improve model deployment workflows.
Maintain and manage custom Docker images, lifecycle configurations, and shared datasets.
Integrate SageMaker with other services (e.g., ECR, S3, EFS, Secrets Manager, CodePipeline, CloudTrail).
Maintain compliance with enterprise governance and security policies.
Drive improvements in platform usability, documentation, automation, and overall ML developer experience.