Arman Kazmi

Founding Engineer · SoulBio

ML infrastructure and data systems for precision oncology

Applied ML and data engineering for drug discovery. I work on production systems—data pipelines, cloud infrastructure, and ML deployment—used by biotech teams in practice. BS-MS from IISER Bhopal.

About

I build ML and data engineering systems for drug discovery, currently focused on oncology. At SoulBio, I work closely with biotech teams to design and deploy end-to-end infrastructure—data pipelines, models, cloud systems, and internal tools—that help them move faster from raw data to usable biological insight.

I began in NLP research, but moving into biological data changed how I think about ML. In drug discovery, the data is messy, context matters, and models don’t live in isolation. A lot of the real work is in the pipelines, the assumptions, and the handoff between computation and biology. That’s where I tend to focus.

Alongside client work, I also build internal tools at SoulBio, including AI agents and bioinformatics workflows, especially around RNA-seq, to reduce repetitive analysis and help scientists move faster. I enjoy working on problems where good engineering choices make complex work feel simpler and more reliable.

My background is in applied machine learning and data engineering rather than deep biology, and I like building technology that simplifies complex workflows through automation and better tooling.

Experience

Feb 2024 – Present

Founding Engineer

SoulBio

  • Leading full-stack ML, computational biology, and cloud engineering for enterprise oncology clients
  • Built open-source drug targetability model reducing hypothesis-validation time by ~80%
  • Designed AWS Batch + FastAPI pipeline achieving 10× faster high-throughput data processing
  • Developed internal visualization platform (PCA/UMAP, differential expression, GSEA, tumor plasticity) enabling rapid biological insights
  • Maintain CI/CD pipelines, infrastructure automation, and production system reliability
May 2022 – Dec 2023

Data Analyst

Elucidata

  • Curated large-scale biomedical datasets enabling ML-based research workflows
  • Enhanced GEO/PubMed extraction models improving accuracy by 20%
  • Co-authored multi-task learning approach achieving 3× faster biomedical NER inference
  • Fine-tuned 3B parameter LLM via PEFT improving performance by 40%
  • Built GPT-powered ontology normalization tool achieving 96% mapping accuracy
Dec 2021 – Apr 2022

Python Developer Intern

MindBowser

  • Built and tested REST APIs in Python
  • Optimized data-processing scripts and improved code quality through refactoring and modularization

Education

Aug 2017 – Apr 2022

BS-MS in Electrical Engineering & Computer Science

Indian Institute of Science Education & Research (IISER), Bhopal, India

CGPA: 7.34/10.00

Apr 2015 – Apr 2016

Senior Secondary, ISC (Class XII)

HBEC, Kanpur, India

Percentage: 89.2/100

Research Projects

May 2021 – Apr 2022 Bhopal, India

MS Thesis: Identifying Manipulative Writing Style From Shorter Texts

Advisors: Dr. Arpit Sharma & Dr. Rajakrishnan Rajkumar

  • Conducted research on detecting manipulated text from paragraph-level texts, revealing resemblance of manipulative texts to fictional writing styles
  • Proposed 3 novel syntactic features from parse trees addressing word order variations, syntactic complexity, and argument-adjunct patterns from psycholinguistics literature
  • Achieved 92% accuracy with classical ML classifier and 96% state-of-the-art accuracy using fine-tuned BERT
  • Published thesis work at COLING, 2022
Jan 2022 – Apr 2022 Bhopal, India

Extracting Causality from Natural Language

Advisors: Dr. Arpit Sharma & Dr. Rajakrishnan Rajkumar

  • Conducted extensive research on extracting cause-effect pairs from English sentences
  • Proposed and defined rules based on dependency relations of causal sentences for causality extraction
Jan 2021 – Apr 2021 Bhopal, India

BS Thesis: Importance of POS Tags in Document Level Genre Classification

Advisor: Dr. Kushal Shah

  • Implemented Markov Chains for genre identification using transition matrices built over letters and POS tags
  • Evaluated the importance of POS tags in predicting fiction vs non-fiction text genres
  • Quantified significance of Adverbs, Adjectives, and Pronouns in determining fictitious nature of text — Report
May 2019 – Jul 2019 Bhopal, India

Writing Style of News Articles

Advisor: Dr. Kushal Shah

  • Researched sentiments in news articles and their writing styles
  • Developed and deployed the backend for NewsChase app categorizing articles as Reliable, Imaginative, or Alright — App

Publications

bioRxiv 2025 Co-Author

DeepGEOSearch: LLM-Powered Schemaless Retrieval for Biomedical Data Discovery

Deepshikha Singh, Shashank Jatav, Shefali Lathwal, Arman Kazmi, Soumya Luthra.

LLM-powered schemaless retrieval system enabling semantic search across diverse biomedical datasets without predefined schemas.

Abstract

Discovering and accessing relevant biomedical data across heterogeneous repositories remains a significant bottleneck in research workflows. Traditional schema-based retrieval systems require predefined metadata structures, limiting their applicability to diverse data sources. We present DeepGEOSearch, an LLM-powered retrieval system that enables semantic search and discovery of biomedical datasets through natural language queries without requiring explicit schema definitions. Our approach leverages large language models for flexible data interpretation and ranking, allowing researchers to discover datasets based on experimental intent rather than rigid schema constraints. We demonstrate the system's effectiveness on large-scale biomedical data repositories, showing improved discoverability and usability compared to traditional keyword-based approaches.

bioRxiv 2024 First Author

Beyond the Hype: The Complexity of Automated Cell Type Annotations with GPT-4

Arman Kazmi, Deepshikha Singh, Shashank Jatav, Soumya Luthra.

Comprehensive evaluation of GPT-4 for cell type annotation, revealing limitations and introducing improved methods with RAG enhancement.

Abstract

Recent research has shown the impressive capability of large language models like GPT-4 in various downstream tasks in single-cell data analysis. Among these tasks, cell type annotation remains particularly challenging, with researchers exploring various methods to improve accuracy and efficiency. While recent studies on GPT-like models have demonstrated annotation performance comparable to manual annotations, a significant gap remains in understanding their limitations and generalizability. In this work, we compare and evaluate the annotation performance of the GPT-4 model against traditional methods on nine randomly selected public single-cell RNA seq datasets from cellxgene, covering diverse tissue types. Our evaluation highlights the complexity of annotating cell types in single-cell data, revealing key differences between automated and manual approaches. We found specific cases where GPT-4 underperforms, demonstrating its limitations in certain contexts. We further introduce an automated approach to incorporate literature search using a RAG approach which enhances and outperforms GPT-4 cell type annotation when compared to traditional methods. We also introduce metrics based on taxonomic distance in the ontology tree to evaluate the granularity of the cell type annotations. To support future research, we also release an open-source Python package that enables fully automated cell-type annotation of single-cell data using GPT-4 alongside other methods. The pipeline can take paper as an input and do cell type annotations on its own.

ICON 2022 Co-Author

Reducing inference time of biomedical NER tasks using multi-task learning

Mukund Chaudhry, Arman Kazmi, Shashank Jatav, Akhilesh Verma, Vishal Samal, Kristopher Paul, Ashutosh Modi

Multi-task learning framework achieving 3× faster biomedical NER inference without sacrificing accuracy.

Abstract

Recently, fine-tuned transformer-based models (eg, PubMedBERT, BioBERT) have shown the state-of-the-art performance of a number of BioNLP tasks, such as Named Entity Recognition (NER). However, transformer-based models are complex and have millions of parameters, and, consequently, are relatively slow during inference. In this paper, we address the time complexity limitations of the BioNLP transformer models. In particular, we propose a Multi-Task Learning based framework for jointly learning three different biomedical NER tasks. Our experiments show a reduction in inference time by a factor of three without any reduction in prediction accuracy.

COLING 2022 First Author

Linguistically motivated features for classifying shorter text into fiction and non-fiction genre

Arman Kazmi, Sidharth Ranjan, Arpit Sharma, and Rajakrishnan Rajkumar.

Linguistically-motivated syntactic features enabling interpretable genre classification of short texts with 97-98% accuracy.

Abstract

This work deploys linguistically motivated features to classify paragraph-level text into fiction and non-fiction genre using a logistic regression model and infers lexical and syntactic properties that distinguish the two genres. Previous works have focused on classifying document-level text into fiction and non-fiction genres, while in this work, we deal with shorter texts which are closer to real-world applications like sentiment analysis of tweets. Going beyond simple POS tag ratios proposed in Qureshi et al.(2019) for document-level classification, we extracted multiple linguistically motivated features belonging to four categories: Lexical features, POS ratio features, Syntactic features and Raw features. For the task of short-text classification, a model containing 28 best-features (selected via Recursive feature elimination with cross-validation; RFECV) confers an accuracy jump of 15.56 % over a baseline model consisting of 2 POS-ratio features found effective in previous work (cited above). The efficacy of the above model containing a linguistically motivated feature set also transfers over to another dataset viz, Baby BNC corpus. We also compared the classification accuracy of the logistic regression model with two deep-learning models. A 1D CNN model gives an increase of 2% accuracy over the logistic Regression classifier on both corpora. And the BERT-base-uncased model gives the best classification accuracy of 97% on Brown corpus and 98% on Baby BNC corpus. Although both the deep learning models give better results in terms of classification accuracy, the problem of interpreting these models remains unsolved. In contrast, regression model coefficients revealed that fiction texts tend to have more character-level diversity and have lower lexical density (quantified using content-function word ratios) compared to non-fiction texts. Moreover, subtle differences in word order exist between the two genres, i.e., in fiction texts Verbs precede Adverbs (inter-alia).