Arman Kazmi

Founding Engineer · SoulBio

ML infrastructure and data systems for precision oncology

Applied ML and data engineering for drug discovery. I work on production systems—data pipelines, cloud infrastructure, and ML deployment—used by biotech teams in practice. BS-MS from IISER Bhopal.

About

I build ML and data engineering systems for drug discovery, currently focused on oncology. At SoulBio, I work closely with biotech teams to design and deploy end-to-end infrastructure—data pipelines, models, cloud systems, and internal tools—that help them move faster from raw data to usable biological insight.

I began in NLP research, but moving into biological data changed how I think about ML. In drug discovery, the data is messy, context matters, and models don’t live in isolation. A lot of the real work is in the pipelines, the assumptions, and the handoff between computation and biology. That’s where I tend to focus.

Alongside client work, I also build internal tools at SoulBio, including AI agents and bioinformatics workflows, especially around RNA-seq, to reduce repetitive analysis and help scientists move faster. I enjoy working on problems where good engineering choices make complex work feel simpler and more reliable.

My background is in applied machine learning and data engineering rather than deep biology, and I like building technology that simplifies complex workflows through automation and better tooling.

Experience

Feb 2024 – Present

Founding Engineer

SoulBio

  • Client-facing systems
  • Devloped and deployed open-source drug targetability and selectivity models, reducing hypothesis validation time by ~40% for oncology-focused drug discovery teams.
  • Re-architected bulk and single-cell RNA-seq ingestion and computation workflows on a ~1 TB Postgres database, optimizing large-scale table operations involving millions of daily row updates and reducing processing time by ~10× (from ~1 day to a few hours).
  • Built and productionized RNA-seq pipelines, internal APIs, and CI/CD automation to support large-scale biological data workflows and downstream analysis.
  • Built and deployed an internal visualization and analysis platform over internal data to support various downstream analyses involving heavy computation.
  • Maintained AWS cloud infrastructure for all deployed applications, including reliability, scaling, and cost considerations.
  • Authored technical documentation and PRDs for internal platforms and new client projects.
  • Internal platform & research
  • Co-authored a research paper analyzing GPT-based cell type annotations and highlighting their limitations in real biological settings.
  • Co-developed and deployed an LLM-based agent enabling natural language search across ~250,000 GEO datasets.
  • Proposed and built an AI agent to automate complex RNA-seq bioinformatics workflows, reducing manual overhead and allowing scientists to focus on interpretation rather than pipeline orchestration.
  • Wrote technical blogs and whitepapers for external communication and outreach.
May 2022 – Dec 2023

Data Analyst

Elucidata

  • Curated large-scale biomedical datasets using data-centric models enabling FAIR data generation and ML-based research workflows
  • Implemented information extraction models and pipelines for biological databases (GEO and PubMed), improving accuracy by ~20%
  • Co-authored a multi-task learning approach achieving 3× faster biomedical NER inference
  • Fine-tuned a proprietary 3B parameter language model using PEFT techniques, resulting in a 40% increase in performance on custom instructions tailored for biomedical data curation tasks.
  • Built GPT-powered ontology normalization tool achieving 96% mapping accuracy
Dec 2021 – Apr 2022

Python Developer Intern

MindBowser

  • Built and tested REST APIs in Python
  • Optimized data-processing scripts and improved code quality through refactoring and modularization

Education

Aug 2017 – Apr 2022

BS-MS in Electrical Engineering & Computer Science

Indian Institute of Science Education & Research (IISER), Bhopal, India

CGPA: 7.34/10.00

Apr 2015 – Apr 2016

Senior Secondary, ISC (Class XII)

HBEC, Kanpur, India

Percentage: 89.2/100

Research Projects

May 2021 – Apr 2022 Bhopal, India

MS Thesis: Identifying Manipulative Writing Style From Shorter Texts

Advisors: Dr. Arpit Sharma & Dr. Rajakrishnan Rajkumar

  • Conducted research on detecting manipulated text from paragraph-level texts, revealing resemblance of manipulative texts to fictional writing styles
  • Proposed 3 novel syntactic features from parse trees addressing word order variations, syntactic complexity, and argument-adjunct patterns from psycholinguistics literature
  • Achieved 92% accuracy with classical ML classifier and 96% state-of-the-art accuracy using fine-tuned BERT
  • Published thesis work at COLING, 2022
Jan 2022 – Apr 2022 Bhopal, India

Extracting Causality from Natural Language

Advisors: Dr. Arpit Sharma & Dr. Rajakrishnan Rajkumar

  • Conducted extensive research on extracting cause-effect pairs from English sentences
  • Proposed and defined rules based on dependency relations of causal sentences for causality extraction
Jan 2021 – Apr 2021 Bhopal, India

BS Thesis: Importance of POS Tags in Document Level Genre Classification

Advisor: Dr. Kushal Shah

  • Implemented Markov Chains for genre identification using transition matrices built over letters and POS tags
  • Evaluated the importance of POS tags in predicting fiction vs non-fiction text genres
  • Quantified significance of Adverbs, Adjectives, and Pronouns in determining fictitious nature of text — Report
May 2019 – Jul 2019 Bhopal, India

Writing Style of News Articles

Advisor: Dr. Kushal Shah

  • Researched sentiments in news articles and their writing styles
  • Developed and deployed the backend for NewsChase app categorizing articles as Reliable, Imaginative, or Alright — App

Publications & Whitepapers

bioRxiv 2025 Co-Author

DeepGEOSearch: LLM-Powered Schemaless Retrieval for Biomedical Data Discovery

Deepshikha Singh, Shashank Jatav, Shefali Lathwal, Arman Kazmi, Soumya Luthra.

LLM-powered schemaless retrieval system enabling semantic search across diverse biomedical datasets without predefined schemas.

Abstract

Discovering and accessing relevant biomedical data across heterogeneous repositories remains a significant bottleneck in research workflows. Traditional schema-based retrieval systems require predefined metadata structures, limiting their applicability to diverse data sources. We present DeepGEOSearch, an LLM-powered retrieval system that enables semantic search and discovery of biomedical datasets through natural language queries without requiring explicit schema definitions. Our approach leverages large language models for flexible data interpretation and ranking, allowing researchers to discover datasets based on experimental intent rather than rigid schema constraints. We demonstrate the system's effectiveness on large-scale biomedical data repositories, showing improved discoverability and usability compared to traditional keyword-based approaches.

bioRxiv 2024 First Author

Beyond the Hype: The Complexity of Automated Cell Type Annotations with GPT-4

Arman Kazmi, Deepshikha Singh, Shashank Jatav, Soumya Luthra.

Comprehensive evaluation of GPT-4 for cell type annotation, revealing limitations and introducing improved methods with RAG enhancement.

Abstract

Recent research has shown the impressive capability of large language models like GPT-4 in various downstream tasks in single-cell data analysis. Among these tasks, cell type annotation remains particularly challenging, with researchers exploring various methods to improve accuracy and efficiency. While recent studies on GPT-like models have demonstrated annotation performance comparable to manual annotations, a significant gap remains in understanding their limitations and generalizability. In this work, we compare and evaluate the annotation performance of the GPT-4 model against traditional methods on nine randomly selected public single-cell RNA seq datasets from cellxgene, covering diverse tissue types. Our evaluation highlights the complexity of annotating cell types in single-cell data, revealing key differences between automated and manual approaches. We found specific cases where GPT-4 underperforms, demonstrating its limitations in certain contexts. We further introduce an automated approach to incorporate literature search using a RAG approach which enhances and outperforms GPT-4 cell type annotation when compared to traditional methods. We also introduce metrics based on taxonomic distance in the ontology tree to evaluate the granularity of the cell type annotations. To support future research, we also release an open-source Python package that enables fully automated cell-type annotation of single-cell data using GPT-4 alongside other methods. The pipeline can take paper as an input and do cell type annotations on its own.

Whitepaper 2024

ReMAP: Repurposing through Multi-Omics Analysis and Prediction

Multi-omics machine learning approach for predicting drug responses and identifying new therapeutic indications in oncology.

Abstract

The complexity and high costs of traditional drug development have driven increased interest in drug repurposing, a strategy that explores new therapeutic uses for existing drugs. This approach leverages established safety profiles and can significantly accelerate the drug development process. Recent advances in computational methods, particularly those that utilize multi-omics data, have further enhanced the potential for systemic drug repurposing. In this study, we present ReMAP (Repurposing through Multi-Omics Analysis and Prediction), a model designed for predicting drug responses, with a focus on applications in cancer treatment. By integrating somatic mutations, copy number aberrations, and gene expression data, ReMAP outperforms traditional single-omics and early integration approaches in predictive accuracy. Utilizing data from the PRISM database, we identify potential new indications for drugs, including Dacomitinib for Head & Neck Squamous cell cancer. This work demonstrates the potential of multi-omics integration and machine learning to revolutionize drug repurposing in oncology, bridging the gap between computational predictions and clinical applications.

Whitepaper 2023

Leveraging Machine Learning for Robust Cell Type Annotation: A Data-Driven Perspective

Benchmarking study of cell-type annotation methods with recommendations for improving quality and reproducibility in scRNA-seq analysis.

Abstract

Cell-type annotation of scRNA-seq data is a complex data-driven process that can be impacted by user bias. Reliable cell-type annotation is crucial, and we at Elucidata have been actively working towards building high-quality pipelines to accurately and reproducibly annotate cell types. When compared to author-assigned annotations, automated methods for cell-type identification in scRNA-seq data show limited agreement, indicating substantial variability in published cell annotations. The choice of reference data has a more pronounced impact on computational cell-type predictions than the specific algorithm employed, underscoring the data-centric nature of this problem. The whitepaper explores available cell-type annotation methods, shares the results of an extensive in-house benchmarking study, and introduces Elucidata's approach to improving quality and reproducibility.

COLING 2022 First Author

Linguistically motivated features for classifying shorter text into fiction and non-fiction genre

Arman Kazmi, Sidharth Ranjan, Arpit Sharma, and Rajakrishnan Rajkumar.

Linguistically-motivated syntactic features enabling interpretable genre classification of short texts with 97-98% accuracy.

Abstract

This work deploys linguistically motivated features to classify paragraph-level text into fiction and non-fiction genre using a logistic regression model and infers lexical and syntactic properties that distinguish the two genres. Previous works have focused on classifying document-level text into fiction and non-fiction genres, while in this work, we deal with shorter texts which are closer to real-world applications like sentiment analysis of tweets. Going beyond simple POS tag ratios proposed in Qureshi et al.(2019) for document-level classification, we extracted multiple linguistically motivated features belonging to four categories: Lexical features, POS ratio features, Syntactic features and Raw features. For the task of short-text classification, a model containing 28 best-features (selected via Recursive feature elimination with cross-validation; RFECV) confers an accuracy jump of 15.56 % over a baseline model consisting of 2 POS-ratio features found effective in previous work (cited above). The efficacy of the above model containing a linguistically motivated feature set also transfers over to another dataset viz, Baby BNC corpus. We also compared the classification accuracy of the logistic regression model with two deep-learning models. A 1D CNN model gives an increase of 2% accuracy over the logistic Regression classifier on both corpora. And the BERT-base-uncased model gives the best classification accuracy of 97% on Brown corpus and 98% on Baby BNC corpus. Although both the deep learning models give better results in terms of classification accuracy, the problem of interpreting these models remains unsolved. In contrast, regression model coefficients revealed that fiction texts tend to have more character-level diversity and have lower lexical density (quantified using content-function word ratios) compared to non-fiction texts. Moreover, subtle differences in word order exist between the two genres, i.e., in fiction texts Verbs precede Adverbs (inter-alia).

ICON 2022 Co-Author

Reducing inference time of biomedical NER tasks using multi-task learning

Mukund Chaudhry, Arman Kazmi, Shashank Jatav, Akhilesh Verma, Vishal Samal, Kristopher Paul, Ashutosh Modi

Multi-task learning framework achieving 3× faster biomedical NER inference without sacrificing accuracy.

Abstract

Recently, fine-tuned transformer-based models (eg, PubMedBERT, BioBERT) have shown the state-of-the-art performance of a number of BioNLP tasks, such as Named Entity Recognition (NER). However, transformer-based models are complex and have millions of parameters, and, consequently, are relatively slow during inference. In this paper, we address the time complexity limitations of the BioNLP transformer models. In particular, we propose a Multi-Task Learning based framework for jointly learning three different biomedical NER tasks. Our experiments show a reduction in inference time by a factor of three without any reduction in prediction accuracy.