entity-fishing  by kermitt2

ML tool for entity recognition and disambiguation

created 9 years ago
264 stars

Top 96.7% on SourcePulse

GitHubView on GitHub
Project Summary

Entity-fishing is a machine learning tool for entity recognition and disambiguation against Wikidata, supporting 15 languages and various text formats including raw text, document-level PDF analysis, and search queries. It is designed for researchers and developers needing efficient and accurate entity linking, offering a faster and lighter alternative to models like BLINK for specific datasets.

How It Works

Entity-fishing employs a query DSL for disambiguation and leverages a large knowledge base derived from Wikidata, encompassing millions of entities and embeddings. Its architecture is optimized for speed, enabling high token processing rates on a single server, and includes an in-house Named Entity Recognizer for English and French.

Quick Start & Requirements

Highlighted Details

  • Achieves 0.765 F-score on general Named Entity recognition, surpassing BLINK on AQUAINT (0.891 vs. 0.8588) and MSNBC (0.867 vs. 0.8509) datasets.
  • Processes 1,000-1,500 tokens/sec on a single server (up to 5,000 tokens/sec with concurrency).
  • Handles PDF documents at 4.5 pages/sec (single client) to 18.2 pages/sec (6 clients).
  • Supports 15 languages with an in-house NER for English and French.

Maintenance & Community

  • Maintained by SCIENCE-MINER.
  • Development started in 2015, first public release in 2016.
  • Received support from Kairntech and contributions from Inria Paris.
  • A spaCy wrapper is available.

Licensing & Compatibility

  • Distributed under the Apache 2.0 license.
  • Dependencies are also Apache 2.0 or compatible.
  • Suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is described as a "work-in-progress side project" with version 0.0.6. While benchmarks show strong performance, the F1-score for disambiguation is noted as needing improvement in future versions. An initial server launch/start-up time of 15-30 seconds is expected.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by John Resig John Resig(Author of jQuery; Chief Software Architect at Khan Academy), Simon Horup Eskildsen Simon Horup Eskildsen(Cofounder of Turbopuffer), and
18 more.

meilisearch by meilisearch

0.2%
53k
Search engine API for integrating AI-powered hybrid search
created 7 years ago
updated 2 days ago
Feedback? Help us improve.