Class: Candle::NER
- Inherits:
-
Object
- Object
- Candle::NER
- Defined in:
- lib/candle/ner.rb
Overview
Named Entity Recognition (NER) for token classification
This class provides methods to extract named entities from text using pre-trained BERT-based models. It supports standard NER labels like PER (person), ORG (organization), LOC (location), and can be extended with custom entity types.
Class Method Summary collapse
-
.from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil) ⇒ NER
Load a pre-trained NER model from HuggingFace.
-
.suggested_models ⇒ Object
Popular pre-trained models for different domains.
Instance Method Summary collapse
-
#_extract_entities ⇒ Object
Create an alias for the native method.
-
#analyze(text, confidence_threshold: 0.9) ⇒ Hash
Analyze text and return both entities and token predictions.
-
#entity_types ⇒ Array<String>
Get available entity types.
-
#extract_entities(text, confidence_threshold: 0.9) ⇒ Array<Hash>
Extract entities from text.
-
#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ Array<Hash>
Extract entities of a specific type.
-
#format_entities(text, confidence_threshold: 0.9) ⇒ String
Get a formatted string representation of entities.
-
#inspect ⇒ String
(also: #to_s)
Get model information.
-
#supports_entity?(entity_type) ⇒ Boolean
Check if model supports a specific entity type.
Class Method Details
.from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil) ⇒ NER
Load a pre-trained NER model from HuggingFace
39 40 41 |
# File 'lib/candle/ner.rb', line 39 def from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil) new(model_id, device, tokenizer) end |
.suggested_models ⇒ Object
Popular pre-trained models for different domains
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/candle/ner.rb', line 44 def suggested_models { general: { model: "Babelscape/wikineural-multilingual-ner", note: "Has tokenizer.json" }, general_alt: { model: "dslim/bert-base-NER", tokenizer: "bert-base-cased", note: "Requires separate tokenizer" }, multilingual: { model: "Davlan/bert-base-multilingual-cased-ner-hrl", note: "Check tokenizer availability" }, biomedical: { model: "dmis-lab/biobert-base-cased-v1.2", note: "May require specific tokenizer" }, clinical: { model: "emilyalsentzer/Bio_ClinicalBERT", note: "May require specific tokenizer" }, scientific: { model: "allenai/scibert_scivocab_uncased", note: "May require specific tokenizer" } } end |
Instance Method Details
#_extract_entities ⇒ Object
Create an alias for the native method
76 |
# File 'lib/candle/ner.rb', line 76 alias_method :_extract_entities, :extract_entities |
#analyze(text, confidence_threshold: 0.9) ⇒ Hash
Analyze text and return both entities and token predictions
126 127 128 129 130 131 |
# File 'lib/candle/ner.rb', line 126 def analyze(text, confidence_threshold: 0.9) { entities: extract_entities(text, confidence_threshold: confidence_threshold), tokens: predict_tokens(text) } end |
#entity_types ⇒ Array<String>
Get available entity types
91 92 93 94 95 96 97 98 99 100 |
# File 'lib/candle/ner.rb', line 91 def entity_types return @entity_types if @entity_types label_config = labels @entity_types = label_config["label2id"].keys .reject { |l| l == "O" } .map { |l| l.sub(/^[BI]-/, "") } .uniq .sort end |
#extract_entities(text, confidence_threshold: 0.9) ⇒ Array<Hash>
Extract entities from text
83 84 85 86 |
# File 'lib/candle/ner.rb', line 83 def extract_entities(text, confidence_threshold: 0.9) # Call the native method with positional arguments _extract_entities(text, confidence_threshold) end |
#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ Array<Hash>
Extract entities of a specific type
116 117 118 119 |
# File 'lib/candle/ner.rb', line 116 def extract_entity_type(text, entity_type, confidence_threshold: 0.9) entities = extract_entities(text, confidence_threshold: confidence_threshold) entities.select { |e| e[:label] == entity_type.upcase } end |
#format_entities(text, confidence_threshold: 0.9) ⇒ String
Get a formatted string representation of entities
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/candle/ner.rb', line 138 def format_entities(text, confidence_threshold: 0.9) entities = extract_entities(text, confidence_threshold: confidence_threshold) return text if entities.empty? # Sort by start position (reverse for easier insertion) entities.sort_by! { |e| -e[:start] } result = text.dup entities.each do |entity| label = "[#{entity[:label]}:#{entity[:confidence].round(2)}]" result.insert(entity[:end], label) end result end |
#inspect ⇒ String Also known as: to_s
Get model information
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
# File 'lib/candle/ner.rb', line 157 def inspect opts = rescue {} parts = ["#<Candle::NER"] parts << "model=#{opts["model_id"] || "unknown"}" parts << "device=#{opts["device"] || "unknown"}" parts << "labels=#{opts["num_labels"]}" if opts["num_labels"] if opts["entity_types"] && !opts["entity_types"].empty? types = opts["entity_types"].sort.join(",") parts << "types=#{types}" end parts.join(" ") + ">" end |
#supports_entity?(entity_type) ⇒ Boolean
Check if model supports a specific entity type
106 107 108 |
# File 'lib/candle/ner.rb', line 106 def supports_entity?(entity_type) entity_types.include?(entity_type.upcase) end |