Sre specialized in hpc/ai - h/f

Le poste

Our mission

Founded in 1999, Scaleway, the cloud of choice, helps developers and businesses to build, deploy and scale applications to any infrastructure. Located in Paris, Amsterdam and Warsaw, Scaleway’s complete cloud ecosystem is used by 25,000+ businesses, including European startups, who choose Scaleway for its multi-AZ redundancy, smooth developer experience, carbon-neutral data centers and native tools for managing multi-cloud architectures. With fully managed offerings for bare metal, containerization and serverless architectures, Scaleway brings choice to the world of cloud computing, offering customers the ability to choose where their customer’s data resides, to choose what architecture works best for their business, and to choose a more responsible way to scale.

Our journey

We want all our actions and decisions to bring us closer to achieving our vision: building and scaling technologies that make sense to us, to our customers and their end users. Scaleway is the challenger nobody’s expecting.As our business scales, the customers and developers we serve are increasingly diverse and global. Giving them an unbeatable experience is central to our business strategy and value proposition. To better understand them, we've discovered that the best way to deliver the highest value and performance is by incorporating a well-rounded team that leverages diverse perspectives, knowledge, skills, and cross-cultural understanding.

Our values

Singularity: We do it our own way.

Community: One company, one culture

Adventure: Level up if you dare, never stop innovating.

Leadership: Be the leader you want to follow.

Excellence: We want to be customers' first choice as a cloud provider.

Rock Solid: You can always count on us!

About the job

With teraflops of computing power available for Scaleway customers, we are looking for a SRE to join our new team specialized in HPC (High Performance Computing).

We are deploying several clusters, one single cluster can be part of the top 15 of HPC listed in the Top500 (https://www.top500.org).

Reporting to our Engineering Manager Emerick Mounoury, you will be responsible to ensure the deployment and the health of the components of our multiple HPC clusters composed of Nvidia hardware.

We expect you to have a strong background in HPC environment and system administration, along with some DevOps experience and SRE best practices.

Our systems evolve constantly and the tools we use to monitor and ensure their resilience need to evolve accordingly.

Profil recherché

Experience in system programming using at least one of these languages:Python, Bash, Go, etc.
Demonstrated ability to troubleshoot production system failures
A positive mindset and desire to work with a team
Passion for automation and incremental improvements on tooling,
Experience with Linux systems: based on Debian and Centos derivatives
Experience with batch job schedulers like Slurm, OAR, SGE
Good understanding of computer networks: TCP/IP, DNS, load balancing, IPv6, firewall, network, Infiniband, vlan/partition, …
Storage knowledge: large pools, NAS, S3, ..
Experience with Nvidia, Cuda, MPI
Good command of English

Ability to meticulously identify and solve any kind of bug in any codebase.
Experience with infrastructure-as-code and continuous deployment
Experience dealing with physical hardware automation
Experience monitoring & logging systems
Experience handling account management (LDAP)
Knowledge of at least one cloud platform and related use-cases
Experience as an OSS contributor and/or maintainer
Knowledge in AI / LLM / ML / neuronal networks

Create or optimize existing tools & documentation that will help identify, diagnose, and solve production incidents, automating as much as possible
Troubleshoot high-impact issues by working with multiple Engineering teams (Storage, Network, Hardware)
Take on-call responsibilities, mitigate issues encountered in production and answer our customers in real time
Ensure a high quality of service for our customers by leveraging observability and monitoring technologies
Manage the life cycle of HPC clusters in production and take part to the escalation of the hardware and software issues to our suppliers
Empower your teammates to swiftly integrate and deploy software components across our systems
Help implementing best stability, resiliency, scalability, security, and performance practices across our systems

Python/Bash
MySQL
S3 API, Lustre, NAS
Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
Ansible, Salt
GitLab, Nexus
Ubuntu, Debian, CentOS
Nvidia hardware and software
MPI, Module, AI software
Slurm
K8s
Jira, Confluence, Slack, GSuite

Location

Based in our offices in Paris or Lille (France), partial remote is possible.

Recruitment Process

Screening call - 30 mins with the recruiter

Manager Interview - 45 mins

Technical Interviews 1h 30 mins

HR Interview - 45 mins

Offer sent - 48 hours

On average our process lasts 2-3 weeks and offers usually follow within 48 hours

Important note: if you don't see yourself ticking all the boxes don't hesitate to apply anyway. Don't limit yourself to a job description, you never know!

To learn more about us Scaleway website | Scaleway’s blog| Scaleway on Twitter

L'entreprise

Powered by talented and passionate people working hard on democratizing the cloud, Scaleway, the 2nd leading European infrastructure cloud provider, is a multicultural company, rapidly growing into a global brand. We are present in 160 countries, with more than 300 employees of 18 nationalities.

We are a cloud computing pioneer delivering the innovative capabilities of modern multi cloud, covering a full spectrum of services for professionals: public cloud services with Scaleway Elements, private infrastructures and colocation with Scaleway Datacenter and bare Metal infrastructures with Scaleway Dedibox.

We place people at the heart of our purpose as an enabler of the internet. Our organization encourages responsibility, autonomy, commitment and thought leadership from our collaborators. Our premises are open spaces, conducive to exchange and interaction between individuals.

We believe it is our responsibility to be a positive force in society and to collectively design new systems for a better future. We want to increase access to the digital and technology industry. As our business scales, the customers we serve are increasingly diverse and global. Giving them an unbeatable experience is central to our business strategy. To better understand our customers and partners, we need a workforce that’s as diverse as they are.
Our Diversity Equity and Inclusion (DE&I) strategy is a strategic asset for nurturing our future business growth, highly visible to our customers and partners. Scaleway has committed to take a proactive approach to develop the rich skills and competencies of all our workforce and to open up professional opportunities in creative and flexible ways, so that we can truly enjoy the rewards of working in a highly diverse, inclusive and global team, no matter the gender, religious beliefs or ethnicity.

Join a community of more than 300 passionate people and become part of a growing company rooted in the world of tomorrow.

Éléments nécessaires pour postuler

Pour valider votre candidature, nous vous demandons de fournir les éléments suivants, vous devrez télécharger les pièces demandées directement lors de votre inscription.

Toute candidature incomplète ne sera pas traitée par nos services.

Document(s) :

Curriculum Vitæ

Conditions pratiques

CDI
Temps plein
Tech & Digital

Niveau d'expérience requis

Expérience exigée

Niveau d'étude requis

Sans diplôme

Permis requis

Permis non obligatoire

Prise de poste

Dès que possible

Type de salaire

A définir selon profil

Localisations

75000 Paris, Île-de-France

Sre specialized in hpc/ai - h/f

Le poste

Profil recherché

L'entreprise

Éléments nécessaires pour postuler

Modifier ma photo

Mentions légales

Politique de protection des données

CGU