TextTokenizer Series

Text Tokenizers are designed for tokenizing a single string object, which is the most common type of data in the text processing field. Specifically, GloVeTokenizer and TransformersTokenizer are two built-in tokenizers for text tokenization, and BertTokenizer is a specific instance of TransformersTokenizer with key set to bert-base-uncased.

Constructor

`CLASS` unitok.GloVeTokenizer()

Parameter	Default	Type	Example	Description
`vocab`	`N/A`	unitok.Vocab	`glove`	The vocabulary to use. It cannot be a string here. A pre-defined GloVe vocab is required.
`tokenizer_id`	`None`	Optional[str]	`'tkn-abcd'`	The ID of the tokenizer. If not provided, a random ID (auto_) will be generated. It should be unique in each space.
`language`	`'english'`	str	`'english'`	Language used for `nltk.tokenize.word_tokenize` method

Please refer to here for GloVe vocabulary construction.

`CLASS` unitok.TransformersTokenizer()

Parameter	Default	Type	Example	Description
`vocab`	`N/A`	Union[unitok.Vocab, str]	`tokens`	The vocabulary to use. If a string is provided, a new vocabulary will be created if not exists in the current space.
`tokenizer_id`	`None`	Optional[str]	`'tkn-abcd'`	The ID of the tokenizer. If not provided, a random ID (auto_) will be generated. It should be unique in each space.
`key`	`N/A`	str	`'bert-base-uncased'`	Model key in Hugging Face

`CLASS` unitok.BertTokenizer()

Parameter	Default	Type	Example	Description
`vocab`	`N/A`	Union[unitok.Vocab, str]	`tokens`	The vocabulary to use. If a string is provided, a new vocabulary will be created if not exists in the current space.
`tokenizer_id`	`None`	Optional[str]	`'tkn-abcd'`	The ID of the tokenizer. If not provided, a random ID (auto_) will be generated. It should be unique in each space.

Data Examples

News titles and abstracts in the MIND dataset

[
  {"title": "The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By", "abstract": "Shop the notebooks, jackets, and more that the royals can't live without."},
  {"title": "Walmart Slashes Prices on Last-Generation iPads", "abstract": "Apple's new iPad releases bring big deals on last year's models."},
  {"title": "50 Worst Habits For Belly Fat", "abstract": "hese seemingly harmless habits are holding you back and keeping you from shedding that unwanted belly fat for good."}
]

Usage

from unitok import GloVeTokenizer, TransformersTokenizer, BertTokenizer

glove_vocab = ...  # a pre-defined GloVe vocab, please refer to the FAQ for construction
glove_tokenizer = GloVeTokenizer(vocab=glove_vocab)
bert_tokenizer = BertTokenizer(vocab='bert')
llama1_tokenizer = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7B')

print(glove_tokenizer('The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By'))
print(bert_tokenizer('The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By'))
print(llama1_tokenizer('The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By'))