Class: Candle::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/candle/tokenizer.rb

Overview

Tokenizer class for text tokenization

This class provides methods to encode text into tokens and decode tokens back to text. It supports both single text and batch processing, with options for special tokens, padding, and truncation.

Examples:

Create a tokenizer from a pretrained model

tokenizer = Candle::Tokenizer.from_pretrained("bert-base-uncased")

Encode and decode text

tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)

Batch encoding

texts = ["Hello", "World", "Test"]
batch_tokens = tokenizer.encode_batch(texts)

Configure padding and truncation

padded_tokenizer = tokenizer.with_padding(length: 128)
truncated_tokenizer = tokenizer.with_truncation(512)

Instance Method Summary collapse

Instance Method Details

#_native_decodeObject



51
# File 'lib/candle/tokenizer.rb', line 51

alias_method :_native_decode, :decode

#_native_encodeObject

The native methods accept positional arguments, but we provide keyword argument interfaces for better Ruby ergonomics. We need to call the native methods with positional args.



46
# File 'lib/candle/tokenizer.rb', line 46

alias_method :_native_encode, :encode

#_native_encode_batchObject



49
# File 'lib/candle/tokenizer.rb', line 49

alias_method :_native_encode_batch, :encode_batch

#_native_encode_batch_to_tokensObject



50
# File 'lib/candle/tokenizer.rb', line 50

alias_method :_native_encode_batch_to_tokens, :encode_batch_to_tokens

#_native_encode_to_tokensObject



47
# File 'lib/candle/tokenizer.rb', line 47

alias_method :_native_encode_to_tokens, :encode_to_tokens

#_native_encode_with_tokensObject



48
# File 'lib/candle/tokenizer.rb', line 48

alias_method :_native_encode_with_tokens, :encode_with_tokens

#_native_get_vocabObject



52
# File 'lib/candle/tokenizer.rb', line 52

alias_method :_native_get_vocab, :get_vocab

#_native_vocab_sizeObject



53
# File 'lib/candle/tokenizer.rb', line 53

alias_method :_native_vocab_size, :vocab_size

#_native_with_paddingObject



54
# File 'lib/candle/tokenizer.rb', line 54

alias_method :_native_with_padding, :with_padding

#decode(token_ids, skip_special_tokens: true) ⇒ String

Decode token IDs with convenient keyword arguments

Parameters:

  • token_ids (Array<Integer>)

    The token IDs to decode

  • skip_special_tokens (Boolean) (defaults to: true)

    Whether to skip special tokens (default: true)

Returns:

  • (String)

    Decoded text



106
107
108
# File 'lib/candle/tokenizer.rb', line 106

def decode(token_ids, skip_special_tokens: true)
  _native_decode(token_ids, skip_special_tokens)
end

#encode(text, add_special_tokens: true) ⇒ Array<Integer>

Encode text with convenient keyword arguments

Parameters:

  • text (String)

    The text to encode

  • add_special_tokens (Boolean) (defaults to: true)

    Whether to add special tokens (default: true)

Returns:

  • (Array<Integer>)

    Token IDs



61
62
63
# File 'lib/candle/tokenizer.rb', line 61

def encode(text, add_special_tokens: true)
  _native_encode(text, add_special_tokens)
end

#encode_batch(texts, add_special_tokens: true) ⇒ Array<Array<Integer>>

Encode multiple texts with convenient keyword arguments

Parameters:

  • texts (Array<String>)

    The texts to encode

  • add_special_tokens (Boolean) (defaults to: true)

    Whether to add special tokens (default: true)

Returns:

  • (Array<Array<Integer>>)

    Arrays of token IDs



88
89
90
# File 'lib/candle/tokenizer.rb', line 88

def encode_batch(texts, add_special_tokens: true)
  _native_encode_batch(texts, add_special_tokens)
end

#encode_batch_to_tokens(texts, add_special_tokens: true) ⇒ Array<Array<String>>

Encode multiple texts into token strings

Parameters:

  • texts (Array<String>)

    The texts to encode

  • add_special_tokens (Boolean) (defaults to: true)

    Whether to add special tokens (default: true)

Returns:

  • (Array<Array<String>>)

    Arrays of token strings



97
98
99
# File 'lib/candle/tokenizer.rb', line 97

def encode_batch_to_tokens(texts, add_special_tokens: true)
  _native_encode_batch_to_tokens(texts, add_special_tokens)
end

#encode_to_tokens(text, add_special_tokens: true) ⇒ Array<String>

Encode text into token strings (words/subwords)

Parameters:

  • text (String)

    The text to encode

  • add_special_tokens (Boolean) (defaults to: true)

    Whether to add special tokens (default: true)

Returns:

  • (Array<String>)

    Token strings



70
71
72
# File 'lib/candle/tokenizer.rb', line 70

def encode_to_tokens(text, add_special_tokens: true)
  _native_encode_to_tokens(text, add_special_tokens)
end

#encode_with_tokens(text, add_special_tokens: true) ⇒ Hash

Encode text and return both IDs and token strings

Parameters:

  • text (String)

    The text to encode

  • add_special_tokens (Boolean) (defaults to: true)

    Whether to add special tokens (default: true)

Returns:

  • (Hash)

    Hash with :ids and :tokens arrays



79
80
81
# File 'lib/candle/tokenizer.rb', line 79

def encode_with_tokens(text, add_special_tokens: true)
  _native_encode_with_tokens(text, add_special_tokens)
end

#get_vocab(with_added_tokens: true) ⇒ Hash<String, Integer>

Get vocabulary with convenient keyword arguments

Parameters:

  • with_added_tokens (Boolean) (defaults to: true)

    Include added tokens (default: true)

Returns:

  • (Hash<String, Integer>)

    Token to ID mapping



114
115
116
# File 'lib/candle/tokenizer.rb', line 114

def get_vocab(with_added_tokens: true)
  _native_get_vocab(with_added_tokens)
end

#vocab_size(with_added_tokens: true) ⇒ Integer

Get vocabulary size with convenient keyword arguments

Parameters:

  • with_added_tokens (Boolean) (defaults to: true)

    Include added tokens (default: true)

Returns:

  • (Integer)

    Vocabulary size



122
123
124
# File 'lib/candle/tokenizer.rb', line 122

def vocab_size(with_added_tokens: true)
  _native_vocab_size(with_added_tokens)
end

#with_padding(**options) ⇒ Tokenizer

Create a new tokenizer with padding configuration

Parameters:

  • options (Hash)

    Padding options

Options Hash (**options):

  • :length (Integer)

    Fixed length padding

  • :max_length (Boolean)

    Use batch longest padding

  • :direction (String)

    Padding direction (“left” or “right”)

  • :pad_id (Integer)

    Padding token ID

  • :pad_token (String)

    Padding token string

Returns:

  • (Tokenizer)

    New tokenizer instance with padding enabled



135
136
137
# File 'lib/candle/tokenizer.rb', line 135

def with_padding(**options)
  _native_with_padding(options)
end