entity-fishing  by kermitt2

ML tool for entity recognition and disambiguation

Created 9 years ago
263 stars

Top 97.0% on SourcePulse

GitHubView on GitHub
Project Summary

Entity-fishing is a machine learning tool for entity recognition and disambiguation against Wikidata, supporting 15 languages and various text formats including raw text, document-level PDF analysis, and search queries. It is designed for researchers and developers needing efficient and accurate entity linking, offering a faster and lighter alternative to models like BLINK for specific datasets.

How It Works

Entity-fishing employs a query DSL for disambiguation and leverages a large knowledge base derived from Wikidata, encompassing millions of entities and embeddings. Its architecture is optimized for speed, enabling high token processing rates on a single server, and includes an in-house Named Entity Recognizer for English and French.

Quick Start & Requirements

Highlighted Details

  • Achieves 0.765 F-score on general Named Entity recognition, surpassing BLINK on AQUAINT (0.891 vs. 0.8588) and MSNBC (0.867 vs. 0.8509) datasets.
  • Processes 1,000-1,500 tokens/sec on a single server (up to 5,000 tokens/sec with concurrency).
  • Handles PDF documents at 4.5 pages/sec (single client) to 18.2 pages/sec (6 clients).
  • Supports 15 languages with an in-house NER for English and French.

Maintenance & Community

  • Maintained by SCIENCE-MINER.
  • Development started in 2015, first public release in 2016.
  • Received support from Kairntech and contributions from Inria Paris.
  • A spaCy wrapper is available.

Licensing & Compatibility

  • Distributed under the Apache 2.0 license.
  • Dependencies are also Apache 2.0 or compatible.
  • Suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is described as a "work-in-progress side project" with version 0.0.6. While benchmarks show strong performance, the F1-score for disambiguation is noted as needing improvement in future versions. An initial server launch/start-up time of 15-30 seconds is expected.

Health Check
Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

tokenmonster by alasdairforsythe

0.2%
604
Subword tokenizer and vocabulary trainer for multiple languages
Created 2 years ago
Updated 1 year ago
Starred by Elvis Saravia Elvis Saravia(Founder of DAIR.AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
3 more.

nlp-library by mihail911

0%
1k
NLP papers for practitioners
Created 8 years ago
Updated 5 years ago
Starred by Boris Cherny Boris Cherny(Creator of Claude Code; MTS at Anthropic), Andrew Kane Andrew Kane(Author of pgvector), and
8 more.

awesome-nlp by keon

0.1%
18k
Curated list of NLP resources
Created 10 years ago
Updated 1 month ago
Feedback? Help us improve.