Tokenizers
The tokenizers are designed for tokenizing a group of similar objects, for example, a column of strings in the dataframe. These objects should be in same type or pattern.
UniTok provides a list of built-in tokenizers, which are all derived from the BaseTokenizer
class.
You can also create your own tokenizer by extending the BaseTokenizer
class. Please refer to the Custom Tokenizer page for more information.
Built-in Tokenizers
You can click on the tokenizer name to see the details of the tokenizer.
Tokenizer | Object Example | Return Example | Description |
---|---|---|---|
EntityTokenizer | 'apple' |
0 |
The object is a string, and is treated as a single token. |
EntitiesTokenizer | ['apple', 'orange'] |
[0, 1] |
The object is a list of strings, and each string is treated as a single token. |
SplitTokenizer | 'apple orange' |
[0, 1] |
The object is a string, and is split by a delimiter. |
DigitTokenizer | 123 |
123 |
The object is a integer, and itself will be returned. |
DigitsTokenizer | [123, 456] |
[123, 456] |
The object is a list of integers, and each integer will be returned. |
BertTokenizer | 'apple orange' |
[6207, 4589] |
The object is a string, and is tokenized by the BertTokenizer from Hugging Face. |
TransformersTokenizer | 'apple orange' |
[26163, 24841] |
The object is a string, and is tokenized by a pretrained tokenizer from Hugging Face. |
GloVeTokenizer | 'apple orange' |
[3292, 3200] |
The object is a string, and is tokenized by the GloVe tokenizer. |
Constructor
CLASS
unitok.BaseTokenizer()
Parameter | Default | Type | Example | Description |
---|---|---|---|---|
vocab |
N/A |
Union[unitok.Vocab, str] | tokens |
The vocabulary to use. If a string is provided, a new vocabulary will be created if not exists in the current space. |
tokenizer_id |
None |
Optional[str] | 'tkn-abcd' |
The ID of the tokenizer. If not provided, a random ID (auto_ |
Please carefully compare the tokenizers <
e
,f
> with tokenizers <h
,i
> below. The declare order is very important. When using the vocabulary from other spaces, it will be registered in the current space.
from unitok import EntityTokenizer # BaseTokenizer is an abstract class, we will use EntityTokenizer for demonstration
from unitok import UniTok, Vocab, VocabHub
vegetables = Vocab(name='vegetables')
music = Vocab(name='music')
tokenizer_a = EntityTokenizer(vocab='animals', tokenizer_id='entity-animals') # OK, will use the existing vocabulary 'vegetables'
with UniTok() as ut:
fruits = Vocab(name='fruits')
tokenizer_b = EntityTokenizer(vocab=fruits) # OK, will use the existing vocabulary 'fruits' with a random tokenizer ID
tokenizer_c = EntityTokenizer(vocab='fruits', tokenizer_id='entity-fruits') # OK, same as tokenizer_b, will use the existing vocabulary 'fruits'
tokenizer_d = EntityTokenizer(vocab='berries', tokenizer_id='entity-fruits') # ValueError: Conflict object declaration, the ID 'entity-fruits' is already used
tokenizer_e = EntityTokenizer(vocab='vegetables') # OK, create a new vocabulary named 'vegetables' with a random tokenizer ID
tokenizer_f = EntityTokenizer(vocab=vegetables) # ValueError: Conflict object declaration, the 'vegetables' vocabularies from the default space and the tokenizer_e are different
tokenizer_g = EntityTokenizer(vocab='animals', tokenizer_id='entity-animals') # OK, a new 'animals' vocabulary will be created in the ut space, and the tokenizer_id will not be conflict with the one in the default space
tokenizer_h = EntityTokenizer(vocab=music) # OK, will use the existing vocabulary 'music' from the default space with a random tokenizer ID
tokenizer_i = EntityTokenizer(vocab='music') # OK, will use the existing vocabulary 'music' from the default space with a random tokenizer ID
with UniTok() as another_ut:
tokenizer_j = EntityTokenizer(vocab='music') # OK, create a new vocabulary named 'music' in the another_ut space with a random tokenizer ID
VocabHub.add(music) # manually add the 'music' vocabulary from the default space to the another_ut space
tokenizer_k = EntityTokenizer(vocab='music') # OK, will use the existing vocabulary 'music' with a random tokenizer ID
Attributes
Attribute | Type | Description |
---|---|---|
vocab |
unitok.Vocab | The vocabulary used by the tokenizer. |
_tokenizer_id |
str | The ID of the tokenizer. |
return_list |
bool | Whether the tokenizer should return a list of tokens. |
param_list |
List[str] | The list of parameters for the constructor. |
Magic Methods
METHOD
__call__() -> Union[int, List[int]]
Tokenize the object and return the token or a list of tokens.
Parameter | Type | Example | Description |
---|---|---|---|
obj |
Any | 'apple' |
The object to tokenize. |
from unitok import EntityTokenizer, EntitiesTokenizer
entity = EntityTokenizer(vocab='fruits')
print(entity('apple')) # 0
print(entity('orange')) # 1
entities = EntitiesTokenizer(vocab='fruits')
print(entities(['banana', 'orange'])) # [2, 1]
Methods
METHOD
get_tokenizer_id() -> str
Return the ID of the tokenizer.
from unitok import EntityTokenizer
entity = EntityTokenizer(vocab='fruits')
print(entity.get_tokenizer_id()) # 'auto_mKVTdn'
METHOD
get_classname() -> str
Return the class name of the tokenizer.
from unitok import EntityTokenizer
entity = EntityTokenizer(vocab='fruits')
print(entity.get_classname()) # 'Entity'
METHOD
json() -> Dict[str, Any]
Return the metadata of the tokenizer