Class: Candle::NER

Inherits:

Object

Object
Candle::NER

show all

Defined in:: lib/candle/ner.rb

Overview

Named Entity Recognition (NER) for token classification

This class provides methods to extract named entities from text using pre-trained BERT-based models. It supports standard NER labels like PER (person), ORG (organization), LOC (location), and can be extended with custom entity types.

Examples:

Load a pre-trained NER model

ner = Candle::NER.from_pretrained("Babelscape/wikineural-multilingual-ner")

Load a model with a specific tokenizer

ner = Candle::NER.from_pretrained("dslim/bert-base-NER", tokenizer: "bert-base-cased")

Extract entities from text

entities = ner.extract_entities("Apple Inc. was founded by Steve Jobs in Cupertino.")
# => [
#   { text: "Apple Inc.", label: "ORG", start: 0, end: 10, confidence: 0.99 },
#   { text: "Steve Jobs", label: "PER", start: 26, end: 36, confidence: 0.98 },
#   { text: "Cupertino", label: "LOC", start: 40, end: 49, confidence: 0.97 }
# ]

Get token-level predictions

tokens = ner.predict_tokens("John works at Google")
# Returns detailed token-by-token predictions with confidence scores

Class Method Summary collapse

.from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil) ⇒ NER

Load a pre-trained NER model from HuggingFace.
.suggested_models ⇒ Object

Popular pre-trained models for different domains.

Instance Method Summary collapse

#_extract_entities ⇒ Object

Create an alias for the native method.
#analyze(text, confidence_threshold: 0.9) ⇒ Hash

Analyze text and return both entities and token predictions.
#entity_types ⇒ Array<String>

Get available entity types.
#extract_entities(text, confidence_threshold: 0.9) ⇒ Array<Hash>

Extract entities from text.
#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ Array<Hash>

Extract entities of a specific type.
#format_entities(text, confidence_threshold: 0.9) ⇒ String

Get a formatted string representation of entities.
#inspect ⇒ String (also: #to_s)

Get model information.
#supports_entity?(entity_type) ⇒ Boolean

Check if model supports a specific entity type.

Class Method Details

.from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil) ⇒ `NER`

Load a pre-trained NER model from HuggingFace

Parameters:

model_id (String) —

HuggingFace model ID (e.g., “dslim/bert-base-NER”)
device (Device) (defaults to: Candle::Device.best) —

Device to run on (defaults to best available)
tokenizer (String, nil) (defaults to: nil) —

Tokenizer model ID to use (defaults to same as model_id)

Returns:

(NER) —

NER instance



39
40
41

# File 'lib/candle/ner.rb', line 39

def from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil)
  new(model_id, device, tokenizer)
end

.suggested_models ⇒ `Object`

Popular pre-trained models for different domains

# File 'lib/candle/ner.rb', line 44

def suggested_models
  {
    general: {
      model: "Babelscape/wikineural-multilingual-ner",
      note: "Has tokenizer.json"
    },
    general_alt: {
      model: "dslim/bert-base-NER",
      tokenizer: "bert-base-cased",
      note: "Requires separate tokenizer"
    },
    multilingual: {
      model: "Davlan/bert-base-multilingual-cased-ner-hrl",
      note: "Check tokenizer availability"
    },
    biomedical: {
      model: "dmis-lab/biobert-base-cased-v1.2",
      note: "May require specific tokenizer"
    },
    clinical: {
      model: "emilyalsentzer/Bio_ClinicalBERT",
      note: "May require specific tokenizer"
    },
    scientific: {
      model: "allenai/scibert_scivocab_uncased",
      note: "May require specific tokenizer"
    }
  }
end

Instance Method Details

#_extract_entities ⇒ `Object`

Create an alias for the native method

76	# File 'lib/candle/ner.rb', line 76 alias_method :_extract_entities, :extract_entities

#analyze(text, confidence_threshold: 0.9) ⇒ `Hash`

Analyze text and return both entities and token predictions

Parameters:

text (String) —

The text to analyze
confidence_threshold (Float) (defaults to: 0.9) —

Minimum confidence for entities

Returns:

(Hash) —

Hash with :entities and :tokens keys

# File 'lib/candle/ner.rb', line 126

def analyze(text, confidence_threshold: 0.9)
  {
    entities: extract_entities(text, confidence_threshold: confidence_threshold),
    tokens: predict_tokens(text)
  }
end

#entity_types ⇒ `Array<String>`

Get available entity types

Returns:

(Array<String>) —

List of entity types (without B-/I- prefixes)

# File 'lib/candle/ner.rb', line 91

def entity_types
  return @entity_types if @entity_types
  
  label_config = labels
  @entity_types = label_config["label2id"].keys
    .reject { |l| l == "O" }
    .map { |l| l.sub(/^[BI]-/, "") }
    .uniq
    .sort
end

#extract_entities(text, confidence_threshold: 0.9) ⇒ `Array<Hash>`

Extract entities from text

Parameters:

text (String) —

The text to analyze
confidence_threshold (Float) (defaults to: 0.9) —

Minimum confidence score (default: 0.9)

Returns:

(Array<Hash>) —

Array of entity hashes with text, label, start, end, confidence

# File 'lib/candle/ner.rb', line 83

def extract_entities(text, confidence_threshold: 0.9)
  # Call the native method with positional arguments
  _extract_entities(text, confidence_threshold)
end

#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ `Array<Hash>`

Extract entities of a specific type

Parameters:

text (String) —

The text to analyze
entity_type (String) —

Entity type to extract (e.g., “PER”, “ORG”)
confidence_threshold (Float) (defaults to: 0.9) —

Minimum confidence score

Returns:

(Array<Hash>) —

Filtered entities of the specified type

# File 'lib/candle/ner.rb', line 116

def extract_entity_type(text, entity_type, confidence_threshold: 0.9)
  entities = extract_entities(text, confidence_threshold: confidence_threshold)
  entities.select { |e| e[:label] == entity_type.upcase }
end

#format_entities(text, confidence_threshold: 0.9) ⇒ `String`

Get a formatted string representation of entities

Parameters:

text (String) —

The text to analyze
confidence_threshold (Float) (defaults to: 0.9) —

Minimum confidence score

Returns:

(String) —

Formatted output with entities highlighted

# File 'lib/candle/ner.rb', line 138

def format_entities(text, confidence_threshold: 0.9)
  entities = extract_entities(text, confidence_threshold: confidence_threshold)
  return text if entities.empty?
  
  # Sort by start position (reverse for easier insertion)
  entities.sort_by! { |e| -e[:start] }
  
  result = text.dup
  entities.each do |entity|
    label = "[#{entity[:label]}:#{entity[:confidence].round(2)}]"
    result.insert(entity[:end], label)
  end
  
  result
end

#inspect ⇒ `String` Also known as: to_s

Get model information

Returns:

(String) —

Model description

# File 'lib/candle/ner.rb', line 157

def inspect
  opts = options rescue {}
  
  parts = ["#<Candle::NER"]
  parts << "model=#{opts["model_id"] || "unknown"}"
  parts << "device=#{opts["device"] || "unknown"}"
  parts << "labels=#{opts["num_labels"]}" if opts["num_labels"]
  
  if opts["entity_types"] && !opts["entity_types"].empty?
    types = opts["entity_types"].sort.join(",")
    parts << "types=#{types}"
  end
  
  parts.join(" ") + ">"
end

#supports_entity?(entity_type) ⇒ `Boolean`

Check if model supports a specific entity type

Parameters:

entity_type (String) —

Entity type to check (e.g., “GENE”, “PER”)

Returns:

(Boolean) —

Whether the model recognizes this entity type



106
107
108

# File 'lib/candle/ner.rb', line 106

def supports_entity?(entity_type)
  entity_types.include?(entity_type.upcase)
end

Class: Candle::NER

Overview

Examples:

Load a pre-trained NER model

Load a model with a specific tokenizer

Extract entities from text

Get token-level predictions

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil) ⇒ NER

.suggested_models ⇒ Object

Instance Method Details

#_extract_entities ⇒ Object

#analyze(text, confidence_threshold: 0.9) ⇒ Hash

#entity_types ⇒ Array<String>

#extract_entities(text, confidence_threshold: 0.9) ⇒ Array<Hash>

#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ Array<Hash>

#format_entities(text, confidence_threshold: 0.9) ⇒ String

#inspect ⇒ String Also known as: to_s

#supports_entity?(entity_type) ⇒ Boolean

.from_pretrained(model_id, device: Candle::Device.best, tokenizer: nil) ⇒ `NER`

.suggested_models ⇒ `Object`

#_extract_entities ⇒ `Object`

#analyze(text, confidence_threshold: 0.9) ⇒ `Hash`

#entity_types ⇒ `Array<String>`

#extract_entities(text, confidence_threshold: 0.9) ⇒ `Array<Hash>`

#extract_entity_type(text, entity_type, confidence_threshold: 0.9) ⇒ `Array<Hash>`

#format_entities(text, confidence_threshold: 0.9) ⇒ `String`

#inspect ⇒ `String` Also known as: to_s

#supports_entity?(entity_type) ⇒ `Boolean`