EntityTokenizer Series
EntityTokenizer
and EntitiesTokenizer
are designed for tokenizing and indexing (a list of) unsplittable objects.
EntityTokenizer
accepts a single object and returns a single index, while EntitiesTokenizer
accepts a list of objects and returns a list of corresponding indices.
Constructor
CLASS
unitok.EntityTokenizer()
CLASS
unitok.EntitiesTokenizer()
Parameter | Default | Type | Example | Description |
---|---|---|---|---|
vocab |
N/A |
Union[unitok.Vocab, str] | tokens |
The vocabulary to use. If a string is provided, a new vocabulary will be created if not exists in the current space. |
tokenizer_id |
None |
Optional[str] | 'tkn-abcd' |
The ID of the tokenizer. If not provided, a random ID (auto_ |
Data Examples
User age column in the Movielens dataset:
* Although the dataset has already used the meaningful integers to represent each age range, these integers are not continuous and thus cannot be used as indices.
News keywords column in the Adressa dataset
[
{"keywords": ["Ferie", "Fritid", "Tr\u00f8ndelag", "Trondheim", "Skole"]},
{"keywords": ["Tour de ski", "Langrenn"]},
{"keywords": ["utenriks", "innenriks", "trondheim", "E6" , "midtbyen", "bybrann", "bilulykker"]}
]
* The original keywords column contains a series of keyword strings concatenated with the ,
symbol, which is more suitable to apply the SplitTokenizer
. Here is just for an illustration.
Usage
from unitok import EntityTokenizer, Vocab
# the age index is depend on the appearance order
age_tokenizer = EntityTokenizer(vocab='age')
print(age_tokenizer('35')) # 0
print(age_tokenizer(35)) # 0, will automatically transform to string
print(age_tokenizer('1')) # 1
print(age_tokenizer('56')) # 2
# or declare ages in the vocab first
age_list: list = ...
age_list = list(map(int, set(age_list)))
age_list.sort()
age_vocab = Vocab(name='ordered_age')
age_vocab.extend(age_list) # {0: '1', 1: '18', 2: '25', 3: '35', 4: '45', 5: '50', 6: '56'}
age_vocab.deny_edit() # assert no other age values
age_tokenizer = EntityTokenizer(vocab=age_vocab)
...
from unitok import EntitiesTokenizer
keyword_tokenizer = EntitiesTokenizer(vocab='keyword')
print(keyword_tokenizer(["ferie", "fritid", "tr\u00f8ndelag", "trondheim", "skole"])) # [0, 1, 2, 3, 4]
print(keyword_tokenizer(["tour de ski", "langrenn"])) # [5, 6]
print(keyword_tokenizer(["utenriks", "innenriks", "trondheim", "E6" , "midtbyen", "bybrann", "bilulykker"])) # [7, 8, 3, 9, 10, 11, 12]