OverviewAbout Business Unit:
SaaSOps leads post-production support and the overall experience of Epsilon PeopleCloud products for our global clients. This function is responsible for product support, incident management, managed operations and the automation of processes. The team has successfully incubated and mainstreamed Site Reliability Engineering (SRE) as a practice, to ensure reliable product operations on a global scale. Plus, the team is actively leading the adoption of AI in operations (AIOps) and recently launched AI driven self-service capabilities to enhance operational efficiency and improve client experiences.
Click here to view how Epsilon transforms marketing with 1 View, 1 Vision and 1 Voice.
Responsibilities-
Will be a senior IC role responsible for driving strong operations engineering practices in SaaS product operations.
-
Role will drive the incident triage practices, implement effective monitoring and observability tools and help build SRE competence in the team.
-
Role will be closely working with product operations team to deep dive and identify root cause of production issues and work with concerned teams to come up with a permanent fix to recurring issues
-
Role will identify automation opportunities to streamline repeat tasks.
-
Will contribute to evolution of AIOps strategy - identify use cases and come up with AI / Agentic autonomous solutions
Qualifications-
15+ Years hands on experience in SRE
-
The candidate will be hands-on technology leader with a proven experience working as a SRE leader in a SAAS product set up.
-
The candidate should have a deep understanding of monitoring tools (New Relic, Prometheus) and observability practices.
-
Prior experience working with ServiceNow, JIRA, Bitbucket and Confluence required.
-
The candidate should be proficient at designing effective Ops dashboards, especially for peak traffic events in a SaaS environment.
-
The candidate should have prior experience handling communications with leadership across an organization for peak traffic events.
- The ideal candidate should have a strong full stack engineering background with Cloud Engineering, L1-L3 Operations & AI / Gen AI experience
-
Must have strong development skills - at least two of Python, Java, C#; strong DB skills (RDBMS, NoSql, Cloud DBs), Container / orchestration, Cloud Infrastructure
-
Super proficient in atleast one hyperscaler cloud (AWS, GCP, Azure)
-
Demonstrated real world experience in traditional ML & Gen AI use case deployments in production
- Candidate should have had experience in working closely with Engineering & Operations team - must have a strong DevOps, Incident Management, Release management, change management experience
-
Prior experience with at least one AIOps solution preferred.
-
Must have proven skills in collaboration and getting things done
-
ITIL certification and experience working in an ITIL environment will be a plus.
Additional Information
Our pillars aren't just words. They're how we show up every day.
-
People centricity: We focus on employee well-being in an environment where colleagues truly care about each other.
-
Collaboration: We work together, support one another and collectively achieve goals.
-
Growth: There are endless opportunities for growth through learning, development and career advancement.
-
Innovation: We drive progress through cutting-edge solutions and
forward-thinking approaches. -
Flexibility: We've created a balance between work and personal life, and we encourage adaptability to solve problems creatively.
Our values guide us to create value for our clients, our people and consumers.
- Act with integrity
-
Work together to win together
-
Innovate with purpose
-
Respect all voices
-
Empower with accountability
These pillars and values are our foundation-shaping our culture, guiding our decisions and uniting us in common purpose.