Life Science Jobs in India - Find Pharma, Biotech, Clinical Research & Medical Jobs
Merck India logo

Genetics Data Engineer

Bengaluru, KA

Job Description


Senior Principle Data Engineer - Genetics

Your Role

You will advance our human quantitative genetics strategy by providing the data engineering fundamentals that enable downstream analysis. You will collaborate with quantitative geneticists, data scientists, data engineers, platform experts, IT, and others to ensure that data availability and quality are never the bottlenecks for our analyses. In that collaboration, you will provide the vision and implementation for how our FAIR data environment works across internal and external platforms such as biobank trusted research environments (TREs).

You will work with other scientists to:

  • Bring software tools, reference datasets, and genetic data into internal environments.

  • Bring tools, containers, and reference data into TREs (e.g., UK Biobank Research Analysis Platform, All of Us Researcher Workbench) and manage their deployment and versioning.

  • Ingest and maintain connections to genetic reference databases (OpenTargets, GWAS Catalog, ClinVar, dbSNP, OMIM, HGMD, ChEMBL, DrugBank) and integrate them with the internal knowledge graph (Synaptix) and analytics platforms.

  • Perform and automate data QC for diverse genomic data types including SNP arrays, whole-exome sequencing (WEX), whole-genome sequencing (WGS), and GWAS summary statistics.

  • Develop, test, and execute reproducible analysis pipelines using workflow managers (e.g., Nextflow, WDL, Snakemake) and containerized environments (Docker, Singularity) for deployment within TREs.

  • Return results from TREs to our internal platforms in accordance with each biobank's privacy and data protection policies.

  • Optimize query performance and pipeline execution to support rapid-turnaround target assessments and in-licensing due diligence (20-25 targets per year requiring fast genetic evaluation).

  • Contribute to the design and implementation of agentic AI workflows for automated genetic evidence generation, integrating genetics pipelines with the broader agentic AI platform.

  • Build and maintain interactive dashboards and data services that expose genetic evidence to project teams, leadership, and due diligence committees.

  • Link genetic data to our AI tools and platforms, ensuring seamless data flow between genetic analyses and downstream decision-support systems.

  • Automate routine analyses including standard safety assessments, target-disease association lookups, and genetic evidence reports to minimize geneticist time on repetitive tasks.

  • Manage cloud compute budgets and optimize resource usage within TREs to maximize analytical throughput within allocated funding.

Who You Are

You have substantial expertise in data engineering for scientific and genomic data and are comfortable working both on strategic questions as well as hands-on implementation. You have

  • Bachelor's or Master's degree in computer science, data engineering, bioinformatics, or a related field.

  • Minimum 5 years relevant experience in data engineering, with significant exposure to genomics or bioinformatics data.

  • Strong experience building production-grade data pipelines for genetic and genomic datasets, including familiarity with common formats (VCF, PLINK/BED/BIM/FAM, BGEN, GWAS summary statistics).

  • Hands-on experience with biobank trusted research environments such as UK Biobank (DNAnexus), All of Us Researcher Workbench, or similar platforms.

  • Strong expertise in Python and R for data ingestion, processing, cleaning, and pipeline orchestration.

  • Experience with genomics-specific tools and frameworks (e.g., Hail, PLINK, bcftools, samtools, liftOver) and workflow managers (Nextflow, WDL, Snakemake).

  • Expertise in working with large, complex datasets in cloud or HPC environments, including tools such as Spark, S3, and cloud-native compute platforms (AWS, GCP, Azure).

  • Experience with containerization (Docker, Singularity) and infrastructure-as-code practices for reproducible deployments in secure environments.

  • Solid understanding of data modeling, versioning, and reproducibility principles.

  • Experience with methods and requirements for medical and genetic data privacy, including biobank data governance and controlled-access data handling.

Preferred Qualifications

  • Experience with additional biobank platforms or multi-ethnic datasets (e.g., Biobank Japan, Galatea, FinnGen).

  • Familiarity with agentic AI frameworks or experience building LLM-integrated data pipelines and automated reporting tools.

  • Experience building dashboards or visualization tools (e.g., Shiny, Streamlit, Plotly Dash) for scientific audiences.

  • Background in pharmaceutical or biotech R&D environments, particularly supporting genetics or genomics teams.

  • Experience with API development (REST/GraphQL) and MCP for serving analytical results to downstream applications.