UniTok in CLI

Download the required tiny dataset for demonstration!

Please download UniTok-tokenized tiny dataset and unzip the file before running the code snippets.

Please also download item parquet and try integrate action.

UniTok also provide a unitok command line interface (CLI) for users to interact with the library in shell.

You can use unitok to have an overview of the tokenized data directory, add new tokenization jobs, or remove existing jobs. unitok <path> is the main command to interact with the tokenized data directory. The <path> argument is the path to the tokenized data directory.

Action: summarize

To get an overview of the tokenized data directory, you can directly use unitok <path> without any additional arguments. It is identical to unitok <path> --action summarize.

unitok sample-ut/item

UniTok (v4), Data (v4)
Sample Size: 10
ID Column: nid

Tokenizer	Tokenizer ID	Column Mapping	Vocab	Max Length
TransformersTokenizer	auto_2VN5Ko	abstract -> abstract@llama	llama (size=32024)	50
EntityTokenizer	auto_C0b9Du	subcategory -> subcategory	subcategory (size=8)	N/A
TransformersTokenizer	auto_2VN5Ko	title -> title@llama	llama (size=32024)	20
EntityTokenizer	auto_4WQYxo	category -> category	category (size=4)	N/A
BertTokenizer	auto_Y9tADT	abstract -> abstract@bert	bert (size=30522)	46
BertTokenizer	auto_Y9tADT	title -> title@bert	bert (size=30522)	16
EntityTokenizer	auto_qwQALc	nid -> nid	nid (size=10)	N/A

Action: Integrate

You can directly define a new job to tokenize the original data and integrate it into the tokenized data directory.

Argument	Default	Type	Description
`--file` / `-f`	`N/A`	str	The path to the original data file, required `.parquet` format, or `.csv`, or `.tsv` format with `\t` delimiter.
`--column` / `-c`	`N/A`	str	The column name to tokenize.
`--name` / `-n`	`N/A`	str	The tokenized column name.
`--vocab` / `-v`	`None`	str	The vocabulary name.
`--tokenizer` / `-t`	`None`	str	The tokenizer classname.
`--tokenizer_id`	`None`	str	The tokenizer ID.
`--truncate`	`None`	int	The maximum length of the tokenized sequence.
`--lib`	`None`	str	The directory path of the custom tokenizers.

Example: Integrate with Built-in Tokenizers

unitok \
    sample-ut/item \
    --file item.parquet \
    --column title \
    --name title@llama2 \
    --vocab llama2 \
    --tokenizer transformers \
    --toeknizer.key meta-llama/llama-2-7b-hf \
    --tokenizer_id llama2 \
    --truncate 20 \
    --action integrate

unitok \
    sample-ut/item \
    --file item.parquet \
    --column abstract \
    --name abstract@llama2 \
    --vocab llama2 \
    --tokenizer_id llama2 \  # will use the existing tokenizer  
    --truncate 50 \
    --action integrate

Example: Integrate with Custom Tokenizers

tokenizers/CounterTokenizer.py

import string

from unitok import DigitTokenizer


class CounterTokenizer(DigitTokenizer):
    def __call__(self, obj: str):
        obj = obj.lower()
        count = 0
        for c in string.ascii_lowercase:
            count += c in obj
        return super().__call__(count)

unitok \
    sample-ut/item \
    --file item.parquet \
    --column title \
    --name title@counter \
    --vocab counter \
    --tokenizer CounterTokenizer \
    --tokenizer_id counter \
    --action integrate \
    --lib tokenizers

Action: Remove

You can remove an existing job from the tokenized data directory by specifying the column name.

unitok \
    sample-ut/item \
    --name title@bert \
    --action remove