Tokenizers

The tokenizers are designed for tokenizing a group of similar objects, for example, a column of strings in the dataframe. These objects should be in same type or pattern. UniTok provides a list of built-in tokenizers, which are all derived from the BaseTokenizer class.

You can also create your own tokenizer by extending the BaseTokenizer class. Please refer to the Custom Tokenizer page for more information.

Built-in Tokenizers

You can click on the tokenizer name to see the details of the tokenizer.

Tokenizer	Object Example	Return Example	Description
EntityTokenizer	`'apple'`	`0`	The object is a string, and is treated as a single token.
EntitiesTokenizer	`['apple', 'orange']`	`[0, 1]`	The object is a list of strings, and each string is treated as a single token.
SplitTokenizer	`'apple orange'`	`[0, 1]`	The object is a string, and is split by a delimiter.
DigitTokenizer	`123`	`123`	The object is a integer, and itself will be returned.
DigitsTokenizer	`[123, 456]`	`[123, 456]`	The object is a list of integers, and each integer will be returned.
BertTokenizer	`'apple orange'`	`[6207, 4589]`	The object is a string, and is tokenized by the BertTokenizer from Hugging Face.
TransformersTokenizer	`'apple orange'`	`[26163, 24841]`	The object is a string, and is tokenized by a pretrained tokenizer from Hugging Face.
GloVeTokenizer	`'apple orange'`	`[3292, 3200]`	The object is a string, and is tokenized by the GloVe tokenizer.

Constructor

`CLASS` unitok.BaseTokenizer()

Parameter	Default	Type	Example	Description
`vocab`	`N/A`	Union[unitok.Vocab, str]	`tokens`	The vocabulary to use. If a string is provided, a new vocabulary will be created if not exists in the current space.
`tokenizer_id`	`None`	Optional[str]	`'tkn-abcd'`	The ID of the tokenizer. If not provided, a random ID (auto_) will be generated. It should be unique in each space.

Please carefully compare the tokenizers <e, f> with tokenizers <h, i> below. The declare order is very important. When using the vocabulary from other spaces, it will be registered in the current space.

from unitok import EntityTokenizer  # BaseTokenizer is an abstract class, we will use EntityTokenizer for demonstration
from unitok import UniTok, Vocab, VocabHub

vegetables = Vocab(name='vegetables')
music = Vocab(name='music')
tokenizer_a = EntityTokenizer(vocab='animals', tokenizer_id='entity-animals')  # OK, will use the existing vocabulary 'vegetables'

with UniTok() as ut:
    fruits = Vocab(name='fruits')
    tokenizer_b = EntityTokenizer(vocab=fruits)  # OK, will use the existing vocabulary 'fruits' with a random tokenizer ID
    tokenizer_c = EntityTokenizer(vocab='fruits', tokenizer_id='entity-fruits')  # OK, same as tokenizer_b, will use the existing vocabulary 'fruits'
    tokenizer_d = EntityTokenizer(vocab='berries', tokenizer_id='entity-fruits')  # ValueError: Conflict object declaration, the ID 'entity-fruits' is already used
    tokenizer_e = EntityTokenizer(vocab='vegetables')  # OK, create a new vocabulary named 'vegetables' with a random tokenizer ID
    tokenizer_f = EntityTokenizer(vocab=vegetables)  # ValueError: Conflict object declaration, the 'vegetables' vocabularies from the default space and the tokenizer_e are different
    tokenizer_g = EntityTokenizer(vocab='animals', tokenizer_id='entity-animals')  # OK, a new 'animals' vocabulary will be created in the ut space, and the tokenizer_id will not be conflict with the one in the default space 
    tokenizer_h = EntityTokenizer(vocab=music)  # OK, will use the existing vocabulary 'music' from the default space with a random tokenizer ID
    tokenizer_i = EntityTokenizer(vocab='music')  # OK, will use the existing vocabulary 'music' from the default space with a random tokenizer ID

with UniTok() as another_ut:
    tokenizer_j = EntityTokenizer(vocab='music')  # OK, create a new vocabulary named 'music' in the another_ut space with a random tokenizer ID
    VocabHub.add(music)  # manually add the 'music' vocabulary from the default space to the another_ut space
    tokenizer_k = EntityTokenizer(vocab='music')  # OK, will use the existing vocabulary 'music' with a random tokenizer ID

Attributes

Attribute	Type	Description
`vocab`	unitok.Vocab	The vocabulary used by the tokenizer.
`_tokenizer_id`	str	The ID of the tokenizer.
`return_list`	bool	Whether the tokenizer should return a list of tokens.
`param_list`	List[str]	The list of parameters for the constructor.

Magic Methods

`METHOD` call() -> Union[int, List[int]]

Tokenize the object and return the token or a list of tokens.

Parameter	Type	Example	Description
`obj`	Any	`'apple'`	The object to tokenize.

from unitok import EntityTokenizer, EntitiesTokenizer

entity = EntityTokenizer(vocab='fruits')
print(entity('apple'))  # 0
print(entity('orange'))  # 1

entities = EntitiesTokenizer(vocab='fruits')
print(entities(['banana', 'orange']))  # [2, 1]

Methods

`METHOD` get_tokenizer_id() -> str

Return the ID of the tokenizer.

from unitok import EntityTokenizer

entity = EntityTokenizer(vocab='fruits')
print(entity.get_tokenizer_id())  # 'auto_mKVTdn'

`METHOD` get_classname() -> str

Return the class name of the tokenizer.

from unitok import EntityTokenizer

entity = EntityTokenizer(vocab='fruits')
print(entity.get_classname())  # 'Entity'

`METHOD` json() -> Dict[str, Any]

Return the metadata of the tokenizer

from unitok import EntityTokenizer

entity = EntityTokenizer(vocab='fruits')
print(entity.json())  # {'tokenizer_id': 'auto_mKVTdn', 'vocab': 'fruits', 'classname': 'Entity', 'params': {}}

Tokenizers

Built-in Tokenizers

Constructor

CLASS unitok.BaseTokenizer()

Attributes

Magic Methods

METHOD __call__() -> Union[int, List[int]]

Methods

METHOD get_tokenizer_id() -> str

METHOD get_classname() -> str

METHOD json() -> Dict[str, Any]

`CLASS` unitok.BaseTokenizer()

`METHOD` call() -> Union[int, List[int]]

`METHOD` get_tokenizer_id() -> str

`METHOD` get_classname() -> str

`METHOD` json() -> Dict[str, Any]