Data Engineer

Scienta Lab is hiring!

About

Scienta Lab is a deeptech company harnessing artificial intelligence to transform the drug discovery and development process in immunology and inflammation.
With its unique and proprietary EVA foundation model dedicated to immune-mediated diseases, Scienta Lab leverages multimodal data to bridge the gap of translational research, accelerate the validation of new therapeutic targets and the development of personalized treatments. The company’s research activities, led in partnership with top-tier academic institutions across Europe, are regularly featured in medical journals and international congresses.
Scienta Lab is based in Biolabs Hôtel Dieu and has been selected amongst the 2023 edition of the Future 40 program which rewards the most promising startups of Station F each year. In December 2023, the company announced a seed round of € 4M from CentraleSupélec Venture and a team of world-class business angels and is thus seeking for top talents to accelerate its development. In June 2025, Scienta Lab was laureate of the EIC Accelerator program, selected as one of the most innovative deeptech startups across Europe, providing a significant funding to the company.
We are a team with diverse backgrounds in AI, Computational Biology, Immunology and pharma industry, with half the company with PhDs. Join us and be a part of our exciting journey to unravel the mysteries of the immune system!

Discover who we are and what we value here.

Job Description

As a Biomedical Data Engineer at Scienta Lab, you will be the backbone of our data infrastructure, designing and maintaining robust, scalable systems that power our EVA foundation model and enable breakthrough discoveries in immunology and inflammation research. You will work with complex, multi-modal biological datasets including histology, transcriptomics, proteomics, and clinical data, ensuring their quality, accessibility, and compliance with healthcare regulations.

Your role is critical to our mission of transforming drug discovery through AI. You will build the data pipelines that fuel our machine learning models, working closely with computational biologists, AI researchers, and pharmaceutical partners to ensure seamless data flow from raw biological samples to actionable therapeutic insights.

You will collaborate with interdisciplinary teams including biologists, clinicians, AI researchers, and regulatory specialists, ensuring that our data infrastructure meets both scientific rigor and industry compliance standards.

We value engineering excellence, data integrity, and aim to build world-class data systems that enable reproducible science and accelerate therapeutic development. This position offers excellent opportunities to work with cutting-edge biological datasets and contribute to life-changing medical breakthroughs.

Main Missions:

  • Design, implement, and maintain robust ETL/ELT pipelines for complex biological datasets including RNA-seq, single-cell sequencing, proteomics, clinical trials data, and real-world evidence datasets

  • Build systems to harmonize and integrate diverse data types (genomics, transcriptomics, proteomics, histology, clinical metadata) while preserving biological relationships and experimental context

  • Implement comprehensive data quality frameworks including automated validation, anomaly detection, and quality control metrics specific to biological data characteristics (batch effects, technical artifacts, missing values)

  • Build and maintain comprehensive data catalogs, metadata management systems, and data lineage tracking to ensure reproducibility and traceability in scientific research

  • Develop robust APIs and data services that enable seamless access to biological datasets for ML training, research analysis, and client applications

Preferred Experience

Who we are looking for

  • Bachelor's or Master's degree in Computer Science, Data Engineering, Bioinformatics, or related technical field

  • 5+ years of experience in data engineering with strong proficiency in Python, SQL, and modern data orchestration tools

  • Extensive experience with at least one major cloud platform (AWS, GCP, Azure) including data services, storage solutions, and compute resources

  • Proficiency with both SQL and NoSQL databases, including experience with time-series databases and graph databases for biological data relationships

How to stand out

  • You have high agency, look for solutions rather than problems and you are a team-player willing to go beyond your job description

  • Experience working with at least one type of omics data (genomics, transcriptomics, proteomics, metabolomics) including understanding of file formats and databases (FASTQ, HDF5, AnnData, TileDB-SOMA)

  • Familiarity with bioinformatics workflows and tools (Nextflow, Snakemake, Galaxy) and biological databases (NCBI, Ensembl, UniProt)

  • Experience building data pipelines that feed machine learning models, including feature stores and model training infrastructure

  • You are experienced in working in dynamic and changing environments such as startups

Additional Information

  • Contract Type: Full-Time
  • Location: Paris
  • Occasional remote authorized