OpenTSLM  by StanfordBDHG

Time-Series Language Models for medical data reasoning

Created 5 months ago
948 stars

Top 38.7% on SourcePulse

GitHubView on GitHub
Project Summary

OpenTSLM introduces a family of Time Series Language Models (TSLMs) designed to bridge the gap in Large Language Models' (LLMs) ability to process sequential data. It enables LLMs to natively integrate and reason over multivariate time-series data alongside text, a critical capability for medical applications where rich temporal patient information is common. The project targets researchers and engineers needing to extract actionable insights from complex medical datasets, offering benefits like enhanced data synthesis and the development of advanced digital health tools.

How It Works

OpenTSLM integrates time-series data as a native modality into pretrained LLMs, such as Llama and Gemma, creating TSLMs. This approach allows for natural-language prompting and reasoning over multiple time series of arbitrary length, overcoming a significant limitation of standard LLMs. The models are trained using a multi-stage curriculum learning process, progressively tackling tasks from multiple-choice questions on time series data to chain-of-thought reasoning for human activity recognition, sleep staging, and ECG analysis. This staged approach aims to build robust temporal reasoning capabilities.

Quick Start & Requirements

Installation involves cloning the repository (git clone https://github.com/StanfordBDHG/OpenTSLM.git --recurse-submodules) and installing dependencies (pip install -r requirements.txt). OpenTSLM supports Llama (e.g., meta-llama/Llama-3.2-1B) and Gemma models, requiring users to request access from Hugging Face and authenticate via huggingface-cli login. CUDA is preferred for training and inference due to potential compatibility issues with Apple's MPS backend. A full curriculum run can be initiated with python curriculum_learning.py --model OpenTSLMFlamingo.

Highlighted Details

  • Enables LLMs to reason over multivariate time series of any length.
  • Supports Llama and Gemma LLM backends, with meta-llama/Llama-3.2-1B as the default.
  • Features a multi-stage curriculum learning framework covering tasks like time series question answering, captioning, and chain-of-thought reasoning on HAR, sleep staging, and ECG data.
  • Generates natural-language findings, captions, and rationales from time-series inputs.

Maintenance & Community

The project is a collaborative effort with numerous co-authors from institutions including Stanford University, Google Research, and ETH Zurich. Contributions are welcomed, with guidelines and a code of conduct available. The project also highlights research opportunities for students and potential collaboration avenues for researchers and partners via digitalhealthresearch@stanford.edu.

Licensing & Compatibility

This project is released under the MIT License, a permissive open-source license that allows for broad use, modification, and distribution, including for commercial purposes and integration into closed-source projects.

Limitations & Caveats

A significant caveat is the reported MPS/CUDA compatibility warning: users employing Apple's MPS backend may encounter issues, and checkpoints trained with CUDA (NVIDIA GPUs) might not be fully compatible or yield optimal results when run on MPS. CUDA is generally recommended for best performance and compatibility.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
9
Issues (30d)
4
Star History
953 stars in the last 30 days

Explore Similar Projects

Starred by Alexander Borzunov Alexander Borzunov(Research Scientist at OpenAI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
2 more.

nlp_course by yandexdataschool

0.1%
10k
NLP course materials
Created 7 years ago
Updated 4 days ago
Feedback? Help us improve.