Details of the role:
Site Reliability Engineer.
Key responsibilities:
· Site Reliability Engineer responsibilities include monitoring computer systems and building alerts for various operational issues that computer systems can experience. Ultimately, you will work with our IT team to ensure our organization can continue to deliver products and services in our computer system environment.
· Participate in our on-call rotation to assess, mitigate, or escalate to ensure continuously improving reliability.
Required Skills:
· SRE is a support role so a candidate/ site Reliability engineer who is working on a Cloud transformation project will qualify for this role.
· They should have exposure and experience in the below Skills (AWS, Google could, System Migration projects, Python, Kubernetes, tooling & API interactions).
· Have good experience in AWS and (OR) Google Cloud.
· Would have Worked on application performance engineering, specifically on Data Reliability space.
· Should have experience in migrating from on-prem to Azure via Kubernetes(k8s)
· Comfortable with Kubernetes (K8s)
· Good communication and attitude.
· Have Python experience, simple tooling & API interactions.
· Good understanding of service level indicators (SLIs)
· Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems.
· Monitoring & Logging: Linux, Prometheus, Grafana, ELK.
· SRE ensures that Google Cloud's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to customer's needs, and a fast rate of improvement.