ghostwriter  by superzhang21

Linguistic feature datasets for LLM enhancement

Created 9 months ago
260 stars

Top 97.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This project, ghostwriter (影子作家), offers a collection of JSON files containing structured linguistic and stylistic features extracted from various individuals and fictional characters. It targets researchers, writers, and AI developers aiming to analyze, replicate, or condition language models on specific authorial voices, enabling more nuanced AI-generated text.

How It Works

Linguistic features are extracted from diverse sources (social media, literature, speeches) and organized into JSON files named by source and initials (e.g., Weibo_Hu.json). This structured data is designed for integration with long-context large language models (LLMs) to effectively adopt or analyze specific writing styles.

Quick Start & Requirements

JSON files reside in the data/ directory. Usage involves integrating these files with compatible long-context LLMs. The README does not specify installation commands, non-default prerequisites (GPU, CUDA, Python versions), or setup time estimates.

Highlighted Details

  • Features extracted from public figures (Hu Xijin, Lu Xun, Lei Jun), fictional characters (Ding Yuanying, Lin Daiyu, Li Yunlong), and TV personalities (Lv Ziqiao).
  • Data origins include Weibo, novels, speeches, and TV series, capturing diverse linguistic nuances.
  • Continuously updated, with recent entries dated May 2025, and includes "feature framework versions" (e.g., 1.0, 2.1).
  • Examples: Weibo_Hu.json (Hu Xijin from Weibo), Tiandao_DingYuanying.json (Ding Yuanying from "Tiandao"), Public_LuXun.json (Lu Xun from public data, potentially differing from popular perception).

Maintenance & Community

Direct contributions (PRs) are not accepted. Suggestions/issues can be raised but lack guaranteed response or action. Users can submit data for specific feature extraction. Contact: null@linux.do. No community channels are listed.

Licensing & Compatibility

Licensed under CC BY-NC-ND 4.0. Requires attribution, prohibits commercial use, and forbids distribution of modified versions. Suitable for non-commercial research and analysis.

Limitations & Caveats

Direct contributions are not accepted, limiting community involvement. User-submitted suggestions/issues may not be addressed. The "NoDerivatives" license clause restricts creating and distributing derivative works.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.