Class: Candle::NER

Inherits:
Object
  • Object
show all
Defined in:
lib/candle/ner.rb

Overview

Named Entity Recognition (NER) for token classification

This class provides methods to extract named entities from text using pre-trained BERT-based models. It supports standard NER labels like PER (person), ORG (organization), LOC (location), and can be extended with custom entity types.

Examples:

Load a pre-trained NER model

ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner")

Load a model with a specific tokenizer

ner = Candle::NER.from_pretrained("dslim/bert-base-NER", tokenizer: "bert-base-cased")

Extract entities from text

entities = ner.extract_entities("Apple Inc. was founded by Steve Jobs in Cupertino.")
# => [
#   { text: "Apple Inc.", label: "ORG", start: 0, end: 10, confidence: 0.99 },
#   { text: "Steve Jobs", label: "PER", start: 26, end: 36, confidence: 0.98 },
#   { text: "Cupertino", label: "LOC", start: 40, end: 49, confidence: 0.97 }
# ]

Get token-level predictions

tokens = ner.predict_tokens("John works at Google")
# Returns detailed token-by-token predictions with confidence scores

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.from_pretrained(model_id, device: nil, tokenizer: nil) ⇒ NER

Load a pre-trained NER model from HuggingFace

Parameters:

  • model_id (String)

    HuggingFace model ID (e.g., “dslim/bert-base-NER”)

  • device (Device, nil) (defaults to: nil)

    Device to run on (defaults to best available)

  • tokenizer (String, nil) (defaults to: nil)

    Tokenizer model ID to use (defaults to same as model_id)

Returns:

  • (NER)

    NER instance



36
37
38
# File 'lib/candle/ner.rb', line 36

def from_pretrained(model_id, device: nil, tokenizer: nil)
  new(model_id, device, tokenizer)
end

.suggested_modelsObject

Popular pre-trained models for different domains



41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/candle/ner.rb', line 41

def suggested_models
  {
    general: {
      model: "Babelscape/wikineural-multilingual-ner",
      note: "Has tokenizer.json"
    },
    general_alt: {
      model: "dslim/bert-base-NER",
      tokenizer: "bert-base-cased",
      note: "Requires separate tokenizer"
    },
    multilingual: {
      model: "Davlan/bert-base-multilingual-cased-ner-hrl",
      note: "Check tokenizer availability"
    },
    biomedical: {
      model: "dmis-lab/biobert-base-cased-v1.2",
      note: "May require specific tokenizer"
    },
    clinical: {
      model: "emilyalsentzer/Bio_ClinicalBERT",
      note: "May require specific tokenizer"
    },
    scientific: {
      model: "allenai/scibert_scivocab_uncased",
      note: "May require specific tokenizer"
    }
  }
end

Instance Method Details

#_extract_entitiesObject

Create an alias for the native method



73
# File 'lib/candle/ner.rb', line 73

alias_method :_extract_entities, :extract_entities

#analyze(text, confidence_threshold: 0.9) ⇒ Hash

Analyze text and return both entities and token predictions

Parameters:

  • text (String)

    The text to analyze

  • confidence_threshold (Float) (defaults to: 0.9)

    Minimum confidence for entities

Returns:

  • (Hash)

    Hash with :entities and :tokens keys



123
124
125
126
127
128
# File 'lib/candle/ner.rb', line 123

def analyze(text, confidence_threshold: 0.9)
  {
    entities: extract_entities(text, confidence_threshold: confidence_threshold),
    tokens: predict_tokens(text)
  }
end

#entity_typesArray<String>

Get available entity types

Returns:

  • (Array<String>)

    List of entity types (without B-/I- prefixes)



88
89
90
91
92
93
94
95
96
97
# File 'lib/candle/ner.rb', line 88

def entity_types
  return @entity_types if @entity_types
  
  label_config = labels
  @entity_types = label_config["label2id"].keys
    .reject { |l| l == "O" }
    .map { |l| l.sub(/^[BI]-/, "") }
    .uniq
    .sort
end

#extract_entities(text, confidence_threshold: 0.9) ⇒ Array<Hash>

Extract entities from text

Parameters:

  • text (String)

    The text to analyze

  • confidence_threshold (Float) (defaults to: 0.9)

    Minimum confidence score (default: 0.9)

Returns:

  • (Array<Hash>)

    Array of entity hashes with text, label, start, end, confidence



80
81
82
83
# File 'lib/candle/ner.rb', line 80

def extract_entities(text, confidence_threshold: 0.9)
  # Call the native method with positional arguments
  _extract_entities(text, confidence_threshold)
end

#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ Array<Hash>

Extract entities of a specific type

Parameters:

  • text (String)

    The text to analyze

  • entity_type (String)

    Entity type to extract (e.g., “PER”, “ORG”)

  • confidence_threshold (Float) (defaults to: 0.9)

    Minimum confidence score

Returns:

  • (Array<Hash>)

    Filtered entities of the specified type



113
114
115
116
# File 'lib/candle/ner.rb', line 113

def extract_entity_type(text, entity_type, confidence_threshold: 0.9)
  entities = extract_entities(text, confidence_threshold: confidence_threshold)
  entities.select { |e| e["label"] == entity_type.upcase }
end

#format_entities(text, confidence_threshold: 0.9) ⇒ String

Get a formatted string representation of entities

Parameters:

  • text (String)

    The text to analyze

  • confidence_threshold (Float) (defaults to: 0.9)

    Minimum confidence score

Returns:

  • (String)

    Formatted output with entities highlighted



135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# File 'lib/candle/ner.rb', line 135

def format_entities(text, confidence_threshold: 0.9)
  entities = extract_entities(text, confidence_threshold: confidence_threshold)
  return text if entities.empty?
  
  # Sort by start position (reverse for easier insertion)
  entities.sort_by! { |e| -e["start"] }
  
  result = text.dup
  entities.each do |entity|
    label = "[#{entity['label']}:#{entity['confidence'].round(2)}]"
    result.insert(entity["end"], label)
  end
  
  result
end

#inspectString Also known as: to_s

Get model information

Returns:

  • (String)

    Model description



154
155
156
# File 'lib/candle/ner.rb', line 154

def inspect
  "#<Candle::NER #{model_info}>"
end

#supports_entity?(entity_type) ⇒ Boolean

Check if model supports a specific entity type

Parameters:

  • entity_type (String)

    Entity type to check (e.g., “GENE”, “PER”)

Returns:

  • (Boolean)

    Whether the model recognizes this entity type



103
104
105
# File 'lib/candle/ner.rb', line 103

def supports_entity?(entity_type)
  entity_types.include?(entity_type.upcase)
end