Class: Candle::Tokenizer

Inherits:

Object

Object
Candle::Tokenizer

show all

Defined in:: lib/candle/tokenizer.rb

Overview

Tokenizer class for text tokenization

This class provides methods to encode text into tokens and decode tokens back to text. It supports both single text and batch processing, with options for special tokens, padding, and truncation.

Examples:

Create a tokenizer from a pretrained model

tokenizer = Candle::Tokenizer.from_pretrained("bert-base-uncased")

Encode and decode text

tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)

Batch encoding

texts = ["Hello", "World", "Test"]
batch_tokens = tokenizer.encode_batch(texts)

Configure padding and truncation

padded_tokenizer = tokenizer.with_padding(length: 128)
truncated_tokenizer = tokenizer.with_truncation(512)

Instance Method Summary collapse

#_native_decode ⇒ Object
#_native_encode ⇒ Object

The native methods accept positional arguments, but we provide keyword argument interfaces for better Ruby ergonomics.
#_native_encode_batch ⇒ Object
#_native_encode_batch_to_tokens ⇒ Object
#_native_encode_to_tokens ⇒ Object
#_native_encode_with_tokens ⇒ Object
#_native_get_vocab ⇒ Object
#_native_vocab_size ⇒ Object
#_native_with_padding ⇒ Object
#decode(token_ids, skip_special_tokens: true) ⇒ String

Decode token IDs with convenient keyword arguments.
#encode(text, add_special_tokens: true) ⇒ Array<Integer>

Encode text with convenient keyword arguments.
#encode_batch(texts, add_special_tokens: true) ⇒ Array<Array<Integer>>

Encode multiple texts with convenient keyword arguments.
#encode_batch_to_tokens(texts, add_special_tokens: true) ⇒ Array<Array<String>>

Encode multiple texts into token strings.
#encode_to_tokens(text, add_special_tokens: true) ⇒ Array<String>

Encode text into token strings (words/subwords).
#encode_with_tokens(text, add_special_tokens: true) ⇒ Hash

Encode text and return both IDs and token strings.
#get_vocab(with_added_tokens: true) ⇒ Hash<String, Integer>

Get vocabulary with convenient keyword arguments.
#vocab_size(with_added_tokens: true) ⇒ Integer

Get vocabulary size with convenient keyword arguments.
#with_padding(**options) ⇒ Tokenizer

Create a new tokenizer with padding configuration.

Instance Method Details

#_native_decode ⇒ `Object`

51	# File 'lib/candle/tokenizer.rb', line 51 alias_method :_native_decode, :decode

#_native_encode ⇒ `Object`

The native methods accept positional arguments, but we provide keyword argument interfaces for better Ruby ergonomics. We need to call the native methods with positional args.

46	# File 'lib/candle/tokenizer.rb', line 46 alias_method :_native_encode, :encode

#_native_encode_batch ⇒ `Object`

49	# File 'lib/candle/tokenizer.rb', line 49 alias_method :_native_encode_batch, :encode_batch

#_native_encode_batch_to_tokens ⇒ `Object`

50	# File 'lib/candle/tokenizer.rb', line 50 alias_method :_native_encode_batch_to_tokens, :encode_batch_to_tokens

#_native_encode_to_tokens ⇒ `Object`

47	# File 'lib/candle/tokenizer.rb', line 47 alias_method :_native_encode_to_tokens, :encode_to_tokens

#_native_encode_with_tokens ⇒ `Object`

48	# File 'lib/candle/tokenizer.rb', line 48 alias_method :_native_encode_with_tokens, :encode_with_tokens

#_native_get_vocab ⇒ `Object`

52	# File 'lib/candle/tokenizer.rb', line 52 alias_method :_native_get_vocab, :get_vocab

#_native_vocab_size ⇒ `Object`

53	# File 'lib/candle/tokenizer.rb', line 53 alias_method :_native_vocab_size, :vocab_size

#_native_with_padding ⇒ `Object`

54	# File 'lib/candle/tokenizer.rb', line 54 alias_method :_native_with_padding, :with_padding

#decode(token_ids, skip_special_tokens: true) ⇒ `String`

Decode token IDs with convenient keyword arguments

Parameters:

token_ids (Array<Integer>) —

The token IDs to decode
skip_special_tokens (Boolean) (defaults to: true) —

Whether to skip special tokens (default: true)

Returns:

(String) —

Decoded text



106
107
108

# File 'lib/candle/tokenizer.rb', line 106

def decode(token_ids, skip_special_tokens: true)
  _native_decode(token_ids, skip_special_tokens)
end

#encode(text, add_special_tokens: true) ⇒ `Array<Integer>`

Encode text with convenient keyword arguments

Parameters:

text (String) —

The text to encode
add_special_tokens (Boolean) (defaults to: true) —

Whether to add special tokens (default: true)

Returns:

(Array<Integer>) —

Token IDs



61
62
63

# File 'lib/candle/tokenizer.rb', line 61

def encode(text, add_special_tokens: true)
  _native_encode(text, add_special_tokens)
end

#encode_batch(texts, add_special_tokens: true) ⇒ `Array<Array<Integer>>`

Encode multiple texts with convenient keyword arguments

Parameters:

texts (Array<String>) —

The texts to encode
add_special_tokens (Boolean) (defaults to: true) —

Whether to add special tokens (default: true)

Returns:

(Array<Array<Integer>>) —

Arrays of token IDs



88
89
90

# File 'lib/candle/tokenizer.rb', line 88

def encode_batch(texts, add_special_tokens: true)
  _native_encode_batch(texts, add_special_tokens)
end

#encode_batch_to_tokens(texts, add_special_tokens: true) ⇒ `Array<Array<String>>`

Encode multiple texts into token strings

Parameters:

texts (Array<String>) —

The texts to encode
add_special_tokens (Boolean) (defaults to: true) —

Whether to add special tokens (default: true)

Returns:

(Array<Array<String>>) —

Arrays of token strings



97
98
99

# File 'lib/candle/tokenizer.rb', line 97

def encode_batch_to_tokens(texts, add_special_tokens: true)
  _native_encode_batch_to_tokens(texts, add_special_tokens)
end

#encode_to_tokens(text, add_special_tokens: true) ⇒ `Array<String>`

Encode text into token strings (words/subwords)

Parameters:

text (String) —

The text to encode
add_special_tokens (Boolean) (defaults to: true) —

Whether to add special tokens (default: true)

Returns:

(Array<String>) —

Token strings



70
71
72

# File 'lib/candle/tokenizer.rb', line 70

def encode_to_tokens(text, add_special_tokens: true)
  _native_encode_to_tokens(text, add_special_tokens)
end

#encode_with_tokens(text, add_special_tokens: true) ⇒ `Hash`

Encode text and return both IDs and token strings

Parameters:

text (String) —

The text to encode
add_special_tokens (Boolean) (defaults to: true) —

Whether to add special tokens (default: true)

Returns:

(Hash) —

Hash with :ids and :tokens arrays



79
80
81

# File 'lib/candle/tokenizer.rb', line 79

def encode_with_tokens(text, add_special_tokens: true)
  _native_encode_with_tokens(text, add_special_tokens)
end

#get_vocab(with_added_tokens: true) ⇒ `Hash<String, Integer>`

Get vocabulary with convenient keyword arguments

Parameters:

with_added_tokens (Boolean) (defaults to: true) —

Include added tokens (default: true)

Returns:

(Hash<String, Integer>) —

Token to ID mapping



114
115
116

# File 'lib/candle/tokenizer.rb', line 114

def get_vocab(with_added_tokens: true)
  _native_get_vocab(with_added_tokens)
end

#vocab_size(with_added_tokens: true) ⇒ `Integer`

Get vocabulary size with convenient keyword arguments

Parameters:

with_added_tokens (Boolean) (defaults to: true) —

Include added tokens (default: true)

Returns:

(Integer) —

Vocabulary size



122
123
124

# File 'lib/candle/tokenizer.rb', line 122

def vocab_size(with_added_tokens: true)
  _native_vocab_size(with_added_tokens)
end

#with_padding(**options) ⇒ `Tokenizer`

Create a new tokenizer with padding configuration

Parameters:

options (Hash) —

Padding options

Options Hash (**options):

:length (Integer) —

Fixed length padding
:max_length (Boolean) —

Use batch longest padding
:direction (String) —

Padding direction (“left” or “right”)
:pad_id (Integer) —

Padding token ID
:pad_token (String) —

Padding token string

Returns:

(Tokenizer) —

New tokenizer instance with padding enabled



135
136
137

# File 'lib/candle/tokenizer.rb', line 135

def with_padding(**options)
  _native_with_padding(options)
end

Class: Candle::Tokenizer

Overview

Examples:

Create a tokenizer from a pretrained model

Encode and decode text

Batch encoding

Configure padding and truncation

Instance Method Summary collapse

Instance Method Details

#_native_decode ⇒ Object

#_native_encode ⇒ Object

#_native_encode_batch ⇒ Object

#_native_encode_batch_to_tokens ⇒ Object

#_native_encode_to_tokens ⇒ Object

#_native_encode_with_tokens ⇒ Object

#_native_get_vocab ⇒ Object

#_native_vocab_size ⇒ Object

#_native_with_padding ⇒ Object

#decode(token_ids, skip_special_tokens: true) ⇒ String

#encode(text, add_special_tokens: true) ⇒ Array<Integer>

#encode_batch(texts, add_special_tokens: true) ⇒ Array<Array<Integer>>

#encode_batch_to_tokens(texts, add_special_tokens: true) ⇒ Array<Array<String>>

#encode_to_tokens(text, add_special_tokens: true) ⇒ Array<String>

#encode_with_tokens(text, add_special_tokens: true) ⇒ Hash

#get_vocab(with_added_tokens: true) ⇒ Hash<String, Integer>

#vocab_size(with_added_tokens: true) ⇒ Integer

#with_padding(**options) ⇒ Tokenizer

#_native_decode ⇒ `Object`

#_native_encode ⇒ `Object`

#_native_encode_batch ⇒ `Object`

#_native_encode_batch_to_tokens ⇒ `Object`

#_native_encode_to_tokens ⇒ `Object`

#_native_encode_with_tokens ⇒ `Object`

#_native_get_vocab ⇒ `Object`

#_native_vocab_size ⇒ `Object`

#_native_with_padding ⇒ `Object`

#decode(token_ids, skip_special_tokens: true) ⇒ `String`

#encode(text, add_special_tokens: true) ⇒ `Array<Integer>`

#encode_batch(texts, add_special_tokens: true) ⇒ `Array<Array<Integer>>`

#encode_batch_to_tokens(texts, add_special_tokens: true) ⇒ `Array<Array<String>>`

#encode_to_tokens(text, add_special_tokens: true) ⇒ `Array<String>`

#encode_with_tokens(text, add_special_tokens: true) ⇒ `Hash`

#get_vocab(with_added_tokens: true) ⇒ `Hash<String, Integer>`

#vocab_size(with_added_tokens: true) ⇒ `Integer`

#with_padding(**options) ⇒ `Tokenizer`