cFocus Software seeks a System Reliability Engineer to join our program supporting the Executive Office of the President. This position is remote. This position requires a TS/SCI clearance.
Qualifications:-
5+ years and Bachelor's Degree in Computer Programming, Science, Engineering or a related technical discipline, or the equivalent combination of education, technical training, or work/military experience, including:
-
3+ years of related systems programming experience
-
Experience maintaining an operational environment and use of monitoring tools and dashboard interfaces (ie. Kibana, Grafana)
-
Experience working with container images and platforms (Kubernetes/Docker)
-
Strong understanding of DevOps and software/application development processes
-
Understanding of GitLab, Jenkins, ArgoCD, and other DevOps/Continuous Integration tools for Kubernetes
-
Understanding of microservice design and architectural pattern best practices
-
Understanding of Python, Bash, and Shell scripting
-
Knowledge of network technologies, common infrastructure components, load balancers, firewalls, virtual and physical infrastructure design
-
problem solving and troubleshooting skills
-
communication and interpersonal skills
-
Must possess excellent time management skills and the drive to work unsupervised
-
Experience with deploying to on prem/data center infrastructure
-
Experience using Jira and Confluence on a daily basis
-
Experience in building processes for deploying to a Kubernetes based environment using Gitlab and Helm
-
Understanding of access management and security groups (i.e. IAM, S3 bucket, SSH, VPN, etc.)
-
Ability to write and use unit and functional testing
-
Technical Skills: Proficiency in programming languages (such as Python, Go, or Bash) is essential for scripting and automation tasks. Knowledge of Linux/Unix systems is also crucial, as SREs often work in these environments.
- Problem-Solving: analytical and problem-solving skills are necessary to diagnose and resolve complex system issues effectively.
- Understanding of SRE Principles: Familiarity with key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets is important for measuring and maintaining system reliability.
- Reliability and Availability: SRE practices help ensure that services are consistently available and reliable, which is critical for user satisfaction and business success.
- Scalability: SREs implement strategies that allow systems to scale efficiently as demand increases, ensuring that performance remains optimal even under heavy load.
- Cost Management: By optimizing resource usage and reducing downtime, SREs contribute to cost savings for organizations.
- Programming and Scripting: Proficiency in languages like Python, Go, or Ruby is crucial for automating tasks and managing infrastructure.
- Operating Systems: A strong understanding of Linux/Unix systems is essential for troubleshooting and managing servers.
- Cloud Computing: Familiarity with cloud platforms like AWS, Azure, or Google Cloud is vital for deploying and managing applications in distributed environments.
- Containers & Orchestration: Understanding containerization tools like Docker and managing containerized workloads with Kubernetes is crucial for cloud-native applications.
- Monitoring and Logging: Proficiency in tools like Prometheus, Grafana, or Elasticsearch, Logstash, and Kibana (ELK) Stack is necessary for tracking metrics, setting up alerts, and analyzing logs.
- Networking: Knowledge of networking protocols and configurations is essential for maintaining system health and performance.
- Configuration Management: Skills in managing and maintaining system configurations are critical for ensuring system reliability.
- Incident Response: Ability to respond quickly and effectively to incidents, including documenting and learning from them.
- Security Best Practices: Understanding security protocols and best practices to protect systems from vulnerabilities.
- These skills are essential for SREs to maintain high availability and performance, balancing the demands of development and operations.
-
Support required during core business hours of 8am – 5pm, Monday through Friday.
- On-call for evenings or weekends, if needed for outages, application upgrades, security patches or other unplanned activities.
Duties:-
Monitor system health, availability, and performance using centralized monitoring and logging tools.
-
Administration of accounts (role-based access and rights).
-
Manage accessibility to the application through EOP’s authentication systems.
-
Manage the workflow templates to ensure consistent and predictable task flows.
-
Configure workflow management for new or adjustments based on user requests, while adhering to EOP template standards.
- Maintain configurations and configurable fields for users and workflows.
-
Maintain the test environment to mimic production and conduct test and evaluation in the environment prior to deployments.
-
Design and maintain a secure and reliable form of backups, ensuring High Availability (HA) and resiliency.
-
Develop a Disaster Recovery (DR) or Incident Response (IR) plan for specific applications and services in the event of a disaster or unexpected downtime.
-
Maintain unique instances that support various offices.
-
Configure and support integrations with complementary systems.
-
Establish and Improve system monitoring while maintaining established security protocols within development, test, and production systems.
-
Architect, build and maintain on premise and/or cloud infrastructure to support team and customer initiatives.
-
Maintain and improve existing infrastructure (build out autoscaling, support new services, optimize for cost efficiencies/authentication/search, etc.).
-
Administer production, staging and development environments.
-
Manage and aggregate server logs and monitor for security and system related incidents.
-
Monitor and analyze system performance, such as server load and resource usage.
-
Maintain and improve existing build and deployment processes using CI/CD tools.
-
Apply configuration management disciplines to maintain software revisions, security patches, hardening, and documentation.
-
Enforce best practices for security and reliability, and drive security initiatives, like access control and vulnerability testing.
-
Maintain up to date documentation of designs/configurations, ensuring team members have continuity of recurring tasks.
-
Maintain status of operations at all times: perform after actions reporting on all outages and work with engineering teams to determine solution and root cause analysis. Present findings to management for prioritization and tasking.
-
Create and determine required metrics for dashboards and service health.
-
Follow up on engineering tasks for operational solutions, and validate completion
-
Manage operational readiness board – present at weekly meetings and determine if development services are ready for automation based on best practices and maintainability.
-
Track and ensure routine operations maintenance tasks are completed in a timely manner.
-
Align to the customer's strategies for configuration of workflows, without compromising the integrity of the workflow tool and templates.
-
Build, maintain, and utilize the customer's enterprise Development, Security, and Operations (DevSecOps) pipeline.
-
Work with other service providers to support areas of common interest.
-
On-call support may be required.
gg5V88t86u