Class: Candle::NER
- Inherits:
-
Object
- Object
- Candle::NER
- Defined in:
- lib/candle/ner.rb
Overview
Named Entity Recognition (NER) for token classification
This class provides methods to extract named entities from text using pre-trained BERT-based models. It supports standard NER labels like PER (person), ORG (organization), LOC (location), and can be extended with custom entity types.
Class Method Summary collapse
-
.from_pretrained(model_id, device: nil, tokenizer: nil) ⇒ NER
Load a pre-trained NER model from HuggingFace.
-
.suggested_models ⇒ Object
Popular pre-trained models for different domains.
Instance Method Summary collapse
-
#_extract_entities ⇒ Object
Create an alias for the native method.
-
#analyze(text, confidence_threshold: 0.9) ⇒ Hash
Analyze text and return both entities and token predictions.
-
#entity_types ⇒ Array<String>
Get available entity types.
-
#extract_entities(text, confidence_threshold: 0.9) ⇒ Array<Hash>
Extract entities from text.
-
#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ Array<Hash>
Extract entities of a specific type.
-
#format_entities(text, confidence_threshold: 0.9) ⇒ String
Get a formatted string representation of entities.
-
#inspect ⇒ String
(also: #to_s)
Get model information.
-
#supports_entity?(entity_type) ⇒ Boolean
Check if model supports a specific entity type.
Class Method Details
.from_pretrained(model_id, device: nil, tokenizer: nil) ⇒ NER
Load a pre-trained NER model from HuggingFace
36 37 38 |
# File 'lib/candle/ner.rb', line 36 def from_pretrained(model_id, device: nil, tokenizer: nil) new(model_id, device, tokenizer) end |
.suggested_models ⇒ Object
Popular pre-trained models for different domains
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
# File 'lib/candle/ner.rb', line 41 def suggested_models { general: { model: "Babelscape/wikineural-multilingual-ner", note: "Has tokenizer.json" }, general_alt: { model: "dslim/bert-base-NER", tokenizer: "bert-base-cased", note: "Requires separate tokenizer" }, multilingual: { model: "Davlan/bert-base-multilingual-cased-ner-hrl", note: "Check tokenizer availability" }, biomedical: { model: "dmis-lab/biobert-base-cased-v1.2", note: "May require specific tokenizer" }, clinical: { model: "emilyalsentzer/Bio_ClinicalBERT", note: "May require specific tokenizer" }, scientific: { model: "allenai/scibert_scivocab_uncased", note: "May require specific tokenizer" } } end |
Instance Method Details
#_extract_entities ⇒ Object
Create an alias for the native method
73 |
# File 'lib/candle/ner.rb', line 73 alias_method :_extract_entities, :extract_entities |
#analyze(text, confidence_threshold: 0.9) ⇒ Hash
Analyze text and return both entities and token predictions
123 124 125 126 127 128 |
# File 'lib/candle/ner.rb', line 123 def analyze(text, confidence_threshold: 0.9) { entities: extract_entities(text, confidence_threshold: confidence_threshold), tokens: predict_tokens(text) } end |
#entity_types ⇒ Array<String>
Get available entity types
88 89 90 91 92 93 94 95 96 97 |
# File 'lib/candle/ner.rb', line 88 def entity_types return @entity_types if @entity_types label_config = labels @entity_types = label_config["label2id"].keys .reject { |l| l == "O" } .map { |l| l.sub(/^[BI]-/, "") } .uniq .sort end |
#extract_entities(text, confidence_threshold: 0.9) ⇒ Array<Hash>
Extract entities from text
80 81 82 83 |
# File 'lib/candle/ner.rb', line 80 def extract_entities(text, confidence_threshold: 0.9) # Call the native method with positional arguments _extract_entities(text, confidence_threshold) end |
#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ Array<Hash>
Extract entities of a specific type
113 114 115 116 |
# File 'lib/candle/ner.rb', line 113 def extract_entity_type(text, entity_type, confidence_threshold: 0.9) entities = extract_entities(text, confidence_threshold: confidence_threshold) entities.select { |e| e["label"] == entity_type.upcase } end |
#format_entities(text, confidence_threshold: 0.9) ⇒ String
Get a formatted string representation of entities
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# File 'lib/candle/ner.rb', line 135 def format_entities(text, confidence_threshold: 0.9) entities = extract_entities(text, confidence_threshold: confidence_threshold) return text if entities.empty? # Sort by start position (reverse for easier insertion) entities.sort_by! { |e| -e["start"] } result = text.dup entities.each do |entity| label = "[#{entity['label']}:#{entity['confidence'].round(2)}]" result.insert(entity["end"], label) end result end |
#inspect ⇒ String Also known as: to_s
Get model information
154 155 156 |
# File 'lib/candle/ner.rb', line 154 def inspect "#<Candle::NER #{model_info}>" end |
#supports_entity?(entity_type) ⇒ Boolean
Check if model supports a specific entity type
103 104 105 |
# File 'lib/candle/ner.rb', line 103 def supports_entity?(entity_type) entity_types.include?(entity_type.upcase) end |