DigitTokenizer Series
DigitTokenizer
and DigitsTokenizer
are designed for those already-indexed (list of) objects.
DigitTokenizer
accepts a single integer and returns itself, while DigitsTokenizer
accepts a list of integers and returns themselves.
Constructor
CLASS
unitok.DigitTokenizer()
CLASS
unitok.DigitsTokenizer()
Parameter | Default | Type | Example | Description |
---|---|---|---|---|
vocab |
N/A |
Union[unitok.Vocab, str] | tokens |
The vocabulary to use. If a string is provided, a new vocabulary will be created if not exists in the current space. |
tokenizer_id |
None |
Optional[str] | 'tkn-abcd' |
The ID of the tokenizer. If not provided, a random ID (auto_ |
vocab_size |
None |
Optional[int] | 2 |
Limits the vocabulary range into [0, vocab_size-1] if provided. |
Data Examples
Click label column in multiple recommendation datasets:
[
{"user_id": "U37448", "item_id": "I7223", "click": 0},
{"user_id": "U95112", "item_id": "I6345", "click": 1},
{"user_id": "U23945", "item_id": "I5496", "click": 1},
{"user_id": "U98348", "item_id": "I1232", "click": 0},
{"user_id": "U56502", "item_id": "I9328", "click": 1}
]
user_id
and item_id
can be tokenized by the EntityTokenizer
class.
User occupation column in the Movielens dataset:
0: "other" or not specified
1: "academic/educator"
2: "artist"
3: "clerical/admin"
4: "college/grad student"
5: "customer service"
6: "doctor/health care"
7: "executive/managerial"
8: "farmer"
9: "homemaker"
10: "K-12 student"
11: "lawyer"
12: "programmer"
13: "retired"
14: "sales/marketing"
15: "scientist"
16: "self-employed"
17: "technician/engineer"
18: "tradesman/craftsman"
19: "unemployed"
20: "writer"
Unlike the user age column which needs EntityTokenizer
to further map the integers into indices, the occupation values are continuous and can be directly serves as indices.
Usage
from unitok import DigitTokenizer
label_tokenizer = DigitTokenizer(vocab='label')
label_tokenizer(1) # 1
label_tokenizer(0) # 0
label_tokenizer(2) # 2, vocabulary will be extended when error value comes
occupation_tokenizer = DigitTokenizer(vocab='occupation', vocab_size=21)
occupation_tokenizer(10) # 10
occupation_tokenizer(5) # 5
occupation_tokenizer(30) # ValueError
print(len(occupation_tokenizer.vocab)) # 21, the vocabulary will be created in the constructor. Even values larger than 10 not appear, the vocabulary size is set to vocab_size.