Data copilots for scientific discovery PhD Student, Biomedical Informatics @ Stanford AI Lab
gmachi [AT] {stanford, cs.stanford} [DOT] edu he/him
Bio
I am a final year PhD Candidate in Biomedical Informatics at Stanford University's department of Biomedical Data Science. My work centers around building data copilots — AI systems that augment our capabilities in (1) making rational and evidence-based decisions, (2) discovering and acquiring knowledge, and (3) generating new data given our design preferences. I am especially motivated to develop data copilots to understand the natural world, particularly for biological and physical systems. I am extremely fortunate to be co-advised and mentored by Parag Mallick (Radiology) and Christopher Ré (CS).
My name is most similarly pronounced as Batman's city of residence, "Gotham" (i.e. GAW-thum). There are many intonation variants for my Sanskrit-originating name across India, but this pronunciation is the closest to the North Indian variant. I also go by "Machi" if that's easier.
I am on the industrial job market for ML Research Scientist roles, particularly for Interpretability/Safety, AI4Science, AI4Bio, or AI4Med teams. Please feel free to contact me about opportunities!
Research
My research focuses on building data copilots, or AI systems that have the following interrelated capabilities:
Evidence-based & rational decision-making — pinpointing what is important (evidence) within inputs for a model's outputs, as well as why and how those outputs were created (rationale). Our work has built automated evidence-verification systems with synthetic data [ECCV'22] and new equippable, GPT-inspired architectural blocks that reveal AI evidence and rationale in text documents, pathology images, and protein sequences & structures [IMLH'23,ICML'24].
Knowledge discovery & acquisition — identifying new information and ensuring it is human-comprehensible. We used model rationale from capability (1) in scientific deployments to discover tumor regions associated with aggressive cancers via spatial proteomics data and single-cell encoders [upcoming preprint]. We are also making deployments in global health and justice, adapting vision-language models to mine covariates from satellite imagery to combat human labor trafficking in the Amazon Rainforest. Finally, we are using protein language models to learn the "functional language" of synthetic and natural proteoforms for viral capsids, metal binders, and circulating biomarkers of chronic diseases.
Controlled generation — using weak labels and machine-extracted knowledge (e.g. memory banks of salient motifs) to guide LLMs toward aligned and safe data generation. Primary applications are rooted in natural language, chess gameplay, and protein sequence design.
I work with foundation models (FMs) spanning multimodal embedding models to autoregressive models (LLMs, agents). We have found success in fostering capabilities (1)-(3) by building compound AI systems that adapt FMs for novel forms of inference & reasoning. Crucially, these systems must be data- and time-efficient to reason over relatively limited sets of high-dimensional unstructured data often found in biomedicine. Finally, our methods aim to increase AI interpretability to determine what our AI systems have learned and how they function.
Projects & Artifacts
Stay tuned for future publications and writings on:
A multimodal FM that is trained on public data and passes medical board exams
A review on spatial statistics for spatial biology
Data copilots for spatial biology: in silico discovery of biomarkers of aggressive cancers
A call to action for evidence-based and rational AI systems in biomedicine
A new form of controlled generation that incorporates weak supervision, motivated by chess and protein design
Adapting remote sensing FMs that teach us the drivers of human labor trafficking in the Amazon Rainforest
A lecture introducing modern ML to life scientists: the history, developments, and opportunities
Lawrence Berkeley National LabJuly 2013 – May 2015
Undergraduate Researcher Computational Biophysics & structural biology of enzymes
Berkeley BioLabsSummer 2014
Bioengineering Intern Gene transfer experimentation
University of California, Berkeley2012 - 2016
BA Student Applied Mathematics (emphasis in Mathematical Biology) Minor in Bioengineering
Process
Fun: On weekends, you can find me and my partner grabbing some sun in urban-suburban East Bay. Or we're unplugging at one of California's numerous regional, state, or national parks to spend time on the water and trails. Outside of hiking and camping, my hobbies include painting, gardening, and hosting regular themed dinner and cocktail parties. I also love city trekking with friends and popping into cafés and roasteries, museums, used bookstores (hunting for vintage maths sections), and outdoor pubs with live music sessions. Reach out if you'd like to grab a coffee!
Design: One of my favorite aspects of research is thinking about aesthetic and design when communicating technical ideas. This drive to understand ideas by visually communicating them (often to myself) sparked as a dyslexic Maths undergraduate. Despite my numerous interests in Maths, I struggled to parse and conceptualize blocks of textual abstraction in modern mathematical presentation, typical of standard teaching materials. I thus relied heavily on intuition and visual proofs as mental anchors. Thanks in part to training as a CIR Scholar at Stanford's Hasso Plattner Institute of Design, I cartoon-ify almost everything I work on and will carve out time to reflect on, mock up, and refine any discussed concepts.
Selected
Detecting salient objects
toy examples to show class differential features, presented in behind our ICML IMLH 2023 paper
Systems for Foundation Models
describing advances on chip design + distributed training
New architecture to learn salient objects
depicting graphical data structures behind our ICML IMLH 2023 paper
Nash Equilibria for joint optimizers
multiple choice grid depicted for each training step
Mpox mobile surveillance
graphical abstract behind our Nature Medicine (2023) paper
Increasing context lengths
comparison of lab's architectures
Detecting salient objects
toy examples to show class differential features, presented in behind our ICML IMLH 2023 paper
Systems for Foundation Models
describing advances on chip design + distributed training
New architecture to learn salient objects
depicting graphical data structures behind our ICML IMLH 2023 paper
Nash Equilibria for joint optimizers
multiple choice grid depicted for each training step
Mpox mobile surveillance
graphical abstract behind our Nature Medicine (2023) paper
Increasing context lengths
comparison of lab's architectures
Evaluating model explanations
graphical abstract behind our ECCV 2022 paper
Vision for instructive AI
talking points for where models could help humans
Global health monitoring
graphical abstract behind remote sensing for health surveillance
Hallmarks of cancer
background information on cancer progression
Visual grounding
desiderata to create more expressive Foundation Models
Physics-based model of cancer
graphical abstract for our Nature Sci Reports (2021) paper