Class: Candle::NER

Inherits:
Object
  • Object
show all
Defined in:
lib/candle/ner.rb

Overview

Named Entity Recognition (NER) for token classification

This class provides methods to extract named entities from text using pre-trained BERT-based models. It supports standard NER labels like PER (person), ORG (organization), LOC (location), and can be extended with custom entity types.

Examples:

Load a pre-trained NER model

ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner")

Load a model with a specific tokenizer

ner = Candle::NER.from_pretrained("dslim/bert-base-NER", tokenizer: "bert-base-cased")

Extract entities from text

entities = ner.extract_entities("Apple Inc. was founded by Steve Jobs in Cupertino.")
# => [
#   { text: "Apple Inc.", label: "ORG", start: 0, end: 10, confidence: 0.99 },
#   { text: "Steve Jobs", label: "PER", start: 26, end: 36, confidence: 0.98 },
#   { text: "Cupertino", label: "LOC", start: 40, end: 49, confidence: 0.97 }
# ]

Get token-level predictions

tokens = ner.predict_tokens("John works at Google")
# Returns detailed token-by-token predictions with confidence scores

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil) ⇒ NER

Load a pre-trained NER model from HuggingFace

Parameters:

  • model_id (String)

    HuggingFace model ID (e.g., “dslim/bert-base-NER”)

  • device (Device) (defaults to: Candle::Device.best)

    Device to run on (defaults to best available)

  • tokenizer (String, nil) (defaults to: nil)

    Tokenizer model ID to use (defaults to same as model_id)

Returns:

  • (NER)

    NER instance



39
40
41
# File 'lib/candle/ner.rb', line 39

def from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil)
  new(model_id, device, tokenizer)
end

.suggested_modelsObject

Popular pre-trained models for different domains



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/candle/ner.rb', line 44

def suggested_models
  {
    general: {
      model: "Babelscape/wikineural-multilingual-ner",
      note: "Has tokenizer.json"
    },
    general_alt: {
      model: "dslim/bert-base-NER",
      tokenizer: "bert-base-cased",
      note: "Requires separate tokenizer"
    },
    multilingual: {
      model: "Davlan/bert-base-multilingual-cased-ner-hrl",
      note: "Check tokenizer availability"
    },
    biomedical: {
      model: "dmis-lab/biobert-base-cased-v1.2",
      note: "May require specific tokenizer"
    },
    clinical: {
      model: "emilyalsentzer/Bio_ClinicalBERT",
      note: "May require specific tokenizer"
    },
    scientific: {
      model: "allenai/scibert_scivocab_uncased",
      note: "May require specific tokenizer"
    }
  }
end

Instance Method Details

#_extract_entitiesObject

Create an alias for the native method



76
# File 'lib/candle/ner.rb', line 76

alias_method :_extract_entities, :extract_entities

#analyze(text, confidence_threshold: 0.9) ⇒ Hash

Analyze text and return both entities and token predictions

Parameters:

  • text (String)

    The text to analyze

  • confidence_threshold (Float) (defaults to: 0.9)

    Minimum confidence for entities

Returns:

  • (Hash)

    Hash with :entities and :tokens keys



126
127
128
129
130
131
# File 'lib/candle/ner.rb', line 126

def analyze(text, confidence_threshold: 0.9)
  {
    entities: extract_entities(text, confidence_threshold: confidence_threshold),
    tokens: predict_tokens(text)
  }
end

#entity_typesArray<String>

Get available entity types

Returns:

  • (Array<String>)

    List of entity types (without B-/I- prefixes)



91
92
93
94
95
96
97
98
99
100
# File 'lib/candle/ner.rb', line 91

def entity_types
  return @entity_types if @entity_types
  
  label_config = labels
  @entity_types = label_config["label2id"].keys
    .reject { |l| l == "O" }
    .map { |l| l.sub(/^[BI]-/, "") }
    .uniq
    .sort
end

#extract_entities(text, confidence_threshold: 0.9) ⇒ Array<Hash>

Extract entities from text

Parameters:

  • text (String)

    The text to analyze

  • confidence_threshold (Float) (defaults to: 0.9)

    Minimum confidence score (default: 0.9)

Returns:

  • (Array<Hash>)

    Array of entity hashes with text, label, start, end, confidence



83
84
85
86
# File 'lib/candle/ner.rb', line 83

def extract_entities(text, confidence_threshold: 0.9)
  # Call the native method with positional arguments
  _extract_entities(text, confidence_threshold)
end

#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ Array<Hash>

Extract entities of a specific type

Parameters:

  • text (String)

    The text to analyze

  • entity_type (String)

    Entity type to extract (e.g., “PER”, “ORG”)

  • confidence_threshold (Float) (defaults to: 0.9)

    Minimum confidence score

Returns:

  • (Array<Hash>)

    Filtered entities of the specified type



116
117
118
119
# File 'lib/candle/ner.rb', line 116

def extract_entity_type(text, entity_type, confidence_threshold: 0.9)
  entities = extract_entities(text, confidence_threshold: confidence_threshold)
  entities.select { |e| e[:label] == entity_type.upcase }
end

#format_entities(text, confidence_threshold: 0.9) ⇒ String

Get a formatted string representation of entities

Parameters:

  • text (String)

    The text to analyze

  • confidence_threshold (Float) (defaults to: 0.9)

    Minimum confidence score

Returns:

  • (String)

    Formatted output with entities highlighted



138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# File 'lib/candle/ner.rb', line 138

def format_entities(text, confidence_threshold: 0.9)
  entities = extract_entities(text, confidence_threshold: confidence_threshold)
  return text if entities.empty?
  
  # Sort by start position (reverse for easier insertion)
  entities.sort_by! { |e| -e[:start] }
  
  result = text.dup
  entities.each do |entity|
    label = "[#{entity[:label]}:#{entity[:confidence].round(2)}]"
    result.insert(entity[:end], label)
  end
  
  result
end

#inspectString Also known as: to_s

Get model information

Returns:

  • (String)

    Model description



157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# File 'lib/candle/ner.rb', line 157

def inspect
  opts = options rescue {}
  
  parts = ["#<Candle::NER"]
  parts << "model=#{opts["model_id"] || "unknown"}"
  parts << "device=#{opts["device"] || "unknown"}"
  parts << "labels=#{opts["num_labels"]}" if opts["num_labels"]
  
  if opts["entity_types"] && !opts["entity_types"].empty?
    types = opts["entity_types"].sort.join(",")
    parts << "types=#{types}"
  end
  
  parts.join(" ") + ">"
end

#supports_entity?(entity_type) ⇒ Boolean

Check if model supports a specific entity type

Parameters:

  • entity_type (String)

    Entity type to check (e.g., “GENE”, “PER”)

Returns:

  • (Boolean)

    Whether the model recognizes this entity type



106
107
108
# File 'lib/candle/ner.rb', line 106

def supports_entity?(entity_type)
  entity_types.include?(entity_type.upcase)
end