SplitTokenizer
SplitTokenizer
is designed for tokenizing a string that can be split into several atomic parts by a delimiter.
Constructor
CLASS
unitok.SplitTokenizer()
Parameter | Default | Type | Example | Description |
---|---|---|---|---|
vocab |
N/A |
Union[unitok.Vocab, str] | tokens |
The vocabulary to use. If a string is provided, a new vocabulary will be created if not exists in the current space. |
tokenizer_id |
None |
Optional[str] | 'tkn-abcd' |
The ID of the tokenizer. If not provided, a random ID (auto_ |
sep |
N/A |
str | ',' |
Delimiter that splits the string object |
Data Examples
User history in the MIND dataset
[
{"history": "N106403 N71977 N97080 N102132 N97212 N121652"},
{"history": "N97080 N192384 N43572 N7324 N71977"}
]