Class: Candle::Tokenizer
- Inherits:
-
Object
- Object
- Candle::Tokenizer
- Defined in:
- lib/candle/tokenizer.rb
Overview
Tokenizer class for text tokenization
This class provides methods to encode text into tokens and decode tokens back to text. It supports both single text and batch processing, with options for special tokens, padding, and truncation.
Instance Method Summary collapse
- #_native_decode ⇒ Object
-
#_native_encode ⇒ Object
The native methods accept positional arguments, but we provide keyword argument interfaces for better Ruby ergonomics.
- #_native_encode_batch ⇒ Object
- #_native_encode_batch_to_tokens ⇒ Object
- #_native_encode_to_tokens ⇒ Object
- #_native_encode_with_tokens ⇒ Object
- #_native_get_vocab ⇒ Object
- #_native_vocab_size ⇒ Object
- #_native_with_padding ⇒ Object
-
#decode(token_ids, skip_special_tokens: true) ⇒ String
Decode token IDs with convenient keyword arguments.
-
#encode(text, add_special_tokens: true) ⇒ Array<Integer>
Encode text with convenient keyword arguments.
-
#encode_batch(texts, add_special_tokens: true) ⇒ Array<Array<Integer>>
Encode multiple texts with convenient keyword arguments.
-
#encode_batch_to_tokens(texts, add_special_tokens: true) ⇒ Array<Array<String>>
Encode multiple texts into token strings.
-
#encode_to_tokens(text, add_special_tokens: true) ⇒ Array<String>
Encode text into token strings (words/subwords).
-
#encode_with_tokens(text, add_special_tokens: true) ⇒ Hash
Encode text and return both IDs and token strings.
-
#get_vocab(with_added_tokens: true) ⇒ Hash<String, Integer>
Get vocabulary with convenient keyword arguments.
-
#vocab_size(with_added_tokens: true) ⇒ Integer
Get vocabulary size with convenient keyword arguments.
-
#with_padding(**options) ⇒ Tokenizer
Create a new tokenizer with padding configuration.
Instance Method Details
#_native_decode ⇒ Object
51 |
# File 'lib/candle/tokenizer.rb', line 51 alias_method :_native_decode, :decode |
#_native_encode ⇒ Object
The native methods accept positional arguments, but we provide keyword argument interfaces for better Ruby ergonomics. We need to call the native methods with positional args.
46 |
# File 'lib/candle/tokenizer.rb', line 46 alias_method :_native_encode, :encode |
#_native_encode_batch ⇒ Object
49 |
# File 'lib/candle/tokenizer.rb', line 49 alias_method :_native_encode_batch, :encode_batch |
#_native_encode_batch_to_tokens ⇒ Object
50 |
# File 'lib/candle/tokenizer.rb', line 50 alias_method :_native_encode_batch_to_tokens, :encode_batch_to_tokens |
#_native_encode_to_tokens ⇒ Object
47 |
# File 'lib/candle/tokenizer.rb', line 47 alias_method :_native_encode_to_tokens, :encode_to_tokens |
#_native_encode_with_tokens ⇒ Object
48 |
# File 'lib/candle/tokenizer.rb', line 48 alias_method :_native_encode_with_tokens, :encode_with_tokens |
#_native_get_vocab ⇒ Object
52 |
# File 'lib/candle/tokenizer.rb', line 52 alias_method :_native_get_vocab, :get_vocab |
#_native_vocab_size ⇒ Object
53 |
# File 'lib/candle/tokenizer.rb', line 53 alias_method :_native_vocab_size, :vocab_size |
#_native_with_padding ⇒ Object
54 |
# File 'lib/candle/tokenizer.rb', line 54 alias_method :_native_with_padding, :with_padding |
#decode(token_ids, skip_special_tokens: true) ⇒ String
Decode token IDs with convenient keyword arguments
106 107 108 |
# File 'lib/candle/tokenizer.rb', line 106 def decode(token_ids, skip_special_tokens: true) _native_decode(token_ids, skip_special_tokens) end |
#encode(text, add_special_tokens: true) ⇒ Array<Integer>
Encode text with convenient keyword arguments
61 62 63 |
# File 'lib/candle/tokenizer.rb', line 61 def encode(text, add_special_tokens: true) _native_encode(text, add_special_tokens) end |
#encode_batch(texts, add_special_tokens: true) ⇒ Array<Array<Integer>>
Encode multiple texts with convenient keyword arguments
88 89 90 |
# File 'lib/candle/tokenizer.rb', line 88 def encode_batch(texts, add_special_tokens: true) _native_encode_batch(texts, add_special_tokens) end |
#encode_batch_to_tokens(texts, add_special_tokens: true) ⇒ Array<Array<String>>
Encode multiple texts into token strings
97 98 99 |
# File 'lib/candle/tokenizer.rb', line 97 def encode_batch_to_tokens(texts, add_special_tokens: true) _native_encode_batch_to_tokens(texts, add_special_tokens) end |
#encode_to_tokens(text, add_special_tokens: true) ⇒ Array<String>
Encode text into token strings (words/subwords)
70 71 72 |
# File 'lib/candle/tokenizer.rb', line 70 def encode_to_tokens(text, add_special_tokens: true) _native_encode_to_tokens(text, add_special_tokens) end |
#encode_with_tokens(text, add_special_tokens: true) ⇒ Hash
Encode text and return both IDs and token strings
79 80 81 |
# File 'lib/candle/tokenizer.rb', line 79 def encode_with_tokens(text, add_special_tokens: true) _native_encode_with_tokens(text, add_special_tokens) end |
#get_vocab(with_added_tokens: true) ⇒ Hash<String, Integer>
Get vocabulary with convenient keyword arguments
114 115 116 |
# File 'lib/candle/tokenizer.rb', line 114 def get_vocab(with_added_tokens: true) _native_get_vocab(with_added_tokens) end |
#vocab_size(with_added_tokens: true) ⇒ Integer
Get vocabulary size with convenient keyword arguments
122 123 124 |
# File 'lib/candle/tokenizer.rb', line 122 def vocab_size(with_added_tokens: true) _native_vocab_size(with_added_tokens) end |
#with_padding(**options) ⇒ Tokenizer
Create a new tokenizer with padding configuration
135 136 137 |
# File 'lib/candle/tokenizer.rb', line 135 def with_padding(**) _native_with_padding() end |