locomo by snap-research

Evaluating long-term conversational memory in LLM agents

Created 2 years ago

616 stars

Top 53.5% on SourcePulse

Project Summary

This repository introduces LoCoMo, a benchmark dataset and evaluation framework for assessing the very long-term conversational memory of LLM agents. It targets researchers and developers, enabling rigorous testing of agent recall, coherence, and RAG capabilities over extended dialogs to understand long-term context maintenance.

How It Works

LoCoMo features 10 annotated, very long conversations structured into sessions with timestamps, speakers, and dialog turns (including image URLs/metadata). The framework provides scripts for generating synthetic conversations using LLM agents with defined personas and for evaluating LLMs on Question Answering (QA) and Event Summarization. Generated 'observations' and 'session summaries' serve as RAG databases.

Quick Start & Requirements

Configuration is handled via scripts/env.sh. Conversation generation uses bash scripts/generate_conversations.sh, supporting custom personas or MSC dataset sampling. Evaluation scripts (bash scripts/evaluate_gpts.sh, etc.) cover various LLM providers. Re-generating RAG data uses bash scripts/generate_observations.sh and bash scripts/generate_session_summaries.sh. API keys may be necessary.

Highlighted Details

LoCoMo Benchmark: 10 high-quality, very long conversations annotated for QA and Event Summarization.
LLM Agent Evaluation: Facilitates comprehensive assessment of long-term memory, context retention, and RAG performance.
Generative Framework: Creates synthetic, long-term dialogs with customizable agent personas.
RAG Data: Offers generated 'observations' and 'session summaries' as distinct databases for RAG model evaluation.

Maintenance & Community

The provided README lacks specific details on community channels, project roadmaps, or notable contributors and sponsorships.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. This omission is a potential adoption blocker, especially for commercial use or integration into closed-source projects.

Limitations & Caveats

Images are not included; only web URLs, BLIP captions, and search queries are provided. The current dataset is a subset of 10 conversations, selected for evaluation cost-effectiveness. Event summarization and multimodal dialog generation evaluation features are marked as "Coming soon."

locomo by snap-research

Explore Similar Projects

open-eqa by facebookresearch

MemoryAgentBench by HUST-AI-HYZ

pykoi by CambioML

howdoi.ai by bborn

groundingLMM by mbzuai-oryx

nonebot_plugin_naturel_gpt by KroMiose

MemoryBank-SiliconFriend by zhongwanjun

dialogbot by shibing624

CyberWaifu by Syan-Lin

Paper-Reading-ConvAI by iwangjian

multiwoz by budzianowski

mem0 by mem0ai