locomo  by snap-research

Evaluating long-term conversational memory in LLM agents

Created 1 year ago
284 stars

Top 92.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository introduces LoCoMo, a benchmark dataset and evaluation framework for assessing the very long-term conversational memory of LLM agents. It targets researchers and developers, enabling rigorous testing of agent recall, coherence, and RAG capabilities over extended dialogs to understand long-term context maintenance.

How It Works

LoCoMo features 10 annotated, very long conversations structured into sessions with timestamps, speakers, and dialog turns (including image URLs/metadata). The framework provides scripts for generating synthetic conversations using LLM agents with defined personas and for evaluating LLMs on Question Answering (QA) and Event Summarization. Generated 'observations' and 'session summaries' serve as RAG databases.

Quick Start & Requirements

Configuration is handled via scripts/env.sh. Conversation generation uses bash scripts/generate_conversations.sh, supporting custom personas or MSC dataset sampling. Evaluation scripts (bash scripts/evaluate_gpts.sh, etc.) cover various LLM providers. Re-generating RAG data uses bash scripts/generate_observations.sh and bash scripts/generate_session_summaries.sh. API keys may be necessary.

Highlighted Details

  • LoCoMo Benchmark: 10 high-quality, very long conversations annotated for QA and Event Summarization.
  • LLM Agent Evaluation: Facilitates comprehensive assessment of long-term memory, context retention, and RAG performance.
  • Generative Framework: Creates synthetic, long-term dialogs with customizable agent personas.
  • RAG Data: Offers generated 'observations' and 'session summaries' as distinct databases for RAG model evaluation.

Maintenance & Community

The provided README lacks specific details on community channels, project roadmaps, or notable contributors and sponsorships.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. This omission is a potential adoption blocker, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

Images are not included; only web URLs, BLIP captions, and search queries are provided. The current dataset is a subset of 10 conversations, selected for evaluation cost-effectiveness. Event summarization and multimodal dialog generation evaluation features are marked as "Coming soon."

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
4
Star History
58 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.