OpenTSLM by StanfordBDHG

Time-Series Language Models for medical data reasoning

Created 6 months ago

1,076 stars

Top 35.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Magnus Müller

Cofounder of Browser Use

Luis Capelo

Cofounder of Lightning AI

Phil Wang

Prolific Research Paper Implementer

Project Summary

OpenTSLM introduces a family of Time Series Language Models (TSLMs) designed to bridge the gap in Large Language Models' (LLMs) ability to process sequential data. It enables LLMs to natively integrate and reason over multivariate time-series data alongside text, a critical capability for medical applications where rich temporal patient information is common. The project targets researchers and engineers needing to extract actionable insights from complex medical datasets, offering benefits like enhanced data synthesis and the development of advanced digital health tools.

How It Works

OpenTSLM integrates time-series data as a native modality into pretrained LLMs, such as Llama and Gemma, creating TSLMs. This approach allows for natural-language prompting and reasoning over multiple time series of arbitrary length, overcoming a significant limitation of standard LLMs. The models are trained using a multi-stage curriculum learning process, progressively tackling tasks from multiple-choice questions on time series data to chain-of-thought reasoning for human activity recognition, sleep staging, and ECG analysis. This staged approach aims to build robust temporal reasoning capabilities.

Quick Start & Requirements

Installation involves cloning the repository (git clone https://github.com/StanfordBDHG/OpenTSLM.git --recurse-submodules) and installing dependencies (pip install -r requirements.txt). OpenTSLM supports Llama (e.g., meta-llama/Llama-3.2-1B) and Gemma models, requiring users to request access from Hugging Face and authenticate via huggingface-cli login. CUDA is preferred for training and inference due to potential compatibility issues with Apple's MPS backend. A full curriculum run can be initiated with python curriculum_learning.py --model OpenTSLMFlamingo.

Highlighted Details

Enables LLMs to reason over multivariate time series of any length.
Supports Llama and Gemma LLM backends, with meta-llama/Llama-3.2-1B as the default.
Features a multi-stage curriculum learning framework covering tasks like time series question answering, captioning, and chain-of-thought reasoning on HAR, sleep staging, and ECG data.
Generates natural-language findings, captions, and rationales from time-series inputs.

Maintenance & Community

The project is a collaborative effort with numerous co-authors from institutions including Stanford University, Google Research, and ETH Zurich. Contributions are welcomed, with guidelines and a code of conduct available. The project also highlights research opportunities for students and potential collaboration avenues for researchers and partners via digitalhealthresearch@stanford.edu.

Licensing & Compatibility

This project is released under the MIT License, a permissive open-source license that allows for broad use, modification, and distribution, including for commercial purposes and integration into closed-source projects.

Limitations & Caveats

A significant caveat is the reported MPS/CUDA compatibility warning: users employing Apple's MPS backend may encounter issues, and checkpoints trained with CUDA (NVIDIA GPUs) might not be fully compatible or yield optimal results when run on MPS. CUDA is generally recommended for best performance and compatibility.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

40 stars in the last 30 days