Skip to content

SplitTokenizer

SplitTokenizer is designed for tokenizing a string that can be split into several atomic parts by a delimiter.

Constructor

CLASS unitok.SplitTokenizer()

Parameter Default Type Example Description
vocab N/A Union[unitok.Vocab, str] tokens The vocabulary to use. If a string is provided, a new vocabulary will be created if not exists in the current space.
tokenizer_id None Optional[str] 'tkn-abcd' The ID of the tokenizer. If not provided, a random ID (auto_) will be generated. It should be unique in each space.
sep N/A str ',' Delimiter that splits the string object

Data Examples

User history in the MIND dataset

[
  {"history": "N106403 N71977 N97080 N102132 N97212 N121652"},
  {"history": "N97080 N192384 N43572 N7324 N71977"}
]

Usage

from unitok import SplitTokenizer

history_tokenizer = SplitTokenizer(vocab='item_id', sep=' ')
history_tokenizer("N106403 N71977 N97080 N102132 N97212 N121652")  # [0, 1, 2, 3, 4 ,5]
history_tokenizer("N97080 N192384 N43572 N7324 N71977")  # [2, 6, 7, 8, 1]