UniTok

Download the required tiny dataset for demonstration!

Please download UniTok-tokenized tiny dataset and unzip the file before running the code snippets.

The UniTok class is the main class of the UniTok library. It is used to manage the tokenization process for a specific dataframe.

UniTok has three states: Initialized, Tokenized, and Organized.

Initialized: The initial state after creating a UniTok instance.
Tokenized: Achieved after applying tokenization to the dataframe or loading a tokenized dataframe.
Organized: Reached after combining multiple dataframes via operations like union and filter.
The disk state is not a real state but a representation of the saved tokenized dataframe.
An organized unitok can move back to the tokenized state only by loading the saved dataframe from disk.

The following graph illustrates the transitions between these states:

graph LR
  A[Initialized] --> |tokenize| B[Tokenized];
  B --> |union/filter| C[Organized];
  C --> |save| D[Disk];
  B --> |save| D;
  D --> |UniTok.load| B;

Constructor

`CLASS` unitok.UniTok()

Initialize an empty UniTok object.

from unitok import UniTok

with UniTok() as ut:
    print(ut.status)  # <initialized>

Attributes

Attribute	Type	Description
`data`	Dict[str, list]	The tokenized dataframe.
`meta`	unitok.Meta	The metadata of the UniTok object, including the vocabularies, tokenizers, and jobs.
`key_job`	unitok.Job	The key job of the UniTok object.
`save_dir`	str	The directory to save the tokenized dataframe.
`_legal_indices`	List[int]	The legal indices of the tokenized dataframe.
`_legal_flags`	List[bool]	The legal flags of the tokenized dataframe.
`_indices_is_init`	bool	Whether the indices are initialized.
`_sample_size`	int	The real sample size (including legal and illegal) of the tokenized dataframe.
`_union_type`	unitok.Symbol	The type of the union operation, can be either `Symbols.soft` or `Symbols.hard`.
`_soft_unions`	Dict[str, Set[unitok.UniTok]]	The soft unions of the UniTok object.

Properties

`PROPERTY` is_soft_union -> bool

Whether the UniTok object is a soft union.

from unitok import UniTok

with UniTok() as ut:
    print(ut.is_soft_union)  # False, the initial _union_type is None

    ut.set_union_type(soft_union=True)
    print(ut.is_soft_union)  # True

`PROPERTY` is_hard_union -> bool

Whether the UniTok object is a hard union.

from unitok import UniTok

with UniTok() as ut:
    print(ut.is_hard_union)  # False, the initial _union_type is None

    ut.set_union_type(soft_union=False)
    print(ut.is_hard_union)  # True

`PROPERTY` filepath -> str

The filepath of the tokenized dataframe.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    print(ut.filepath)  # 'sample-ut/item/data.pkl'

# an initialized unitok has no filepath until it is saved

Magic Methods

`METHOD` enter() -> unitok.UniTok

`METHOD` exit() -> None

Enter or exit the context manager or UniTok Space. The definition of the vocabularies, tokenizers, and jobs in with UniTok() as ut: will be stored in the ut space. The next time when you want to pass the vocabularies, tokenizers, and jobs to the UniTok methods, you can pass vocabulary names, tokenizer IDs, and job names instead.

The UniTok context manager can not be nested.

from unitok import UniTok, Vocab, EntityTokenizer

with UniTok() as ut:
    vocab = Vocab(name='fruits')  # 'fruits' vocab has been registered in the ut space
    vocab.extend(['apple', 'banana', 'cherry'])

    fruit_tokenizer = EntityTokenizer(vocab='fruits')  # same as EntityTokenizer(vocab=vocab)
    print(list(fruit_tokenizer.vocab))  # ['apple', 'banana', 'cherry']

with UniTok() as another_ut:
    another_fruit_tokenizer = EntityTokenizer(vocab='fruits')  # will create a new vocabulary named 'fruits' in the another_ut space
    print(list(another_fruit_tokenizer.vocab))  # []

    with UniTok() as nested_ut:  # ValueError: Space is already locked to another_ut
        ...

`METHOD` getitem() -> dict

State Required

Tokenized Organized Initialized

Index the item, and select the attributes (or jobs), and return the data in a dict format.

Parameter	Default	Type	Example	Description
`index`	`N/A`	Please refer to the example	`0`	The index or slice of the tokenized dataframe.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    # 1. Select all the attributes from the first item using item index 
    print(ut[0])  
    # {
    #   'category': 0, 
    #   'abstract@bert': [4497, 1996, 14960, ...], 
    #   'title@llama': [450, 1771, 4167, ...], 
    #   'title@bert': [1996, 9639, 3035, ...], 
    #   'abstract@llama': [1383, 459, 278, ...], 
    #   'subcategory': 0, 
    #   'nid': 0
    # }

    # 2. Select all the attributes from the first item using item key value
    print(ut.key_job.tokenizer.vocab[0])  # 'N88753', nid of the first item
    print(ut['N88753'])  # same as ut[0]

    # 3. Select the 'title@bert' attribute from the first item
    print(ut[0, 'title@bert'])  # {'title@bert': [1996, 9639, 3035, ...]}
    print(ut['N88753', 'title@bert'])  # same as ut[0, 'title@bert']

    # 4. Select the attribute tokenized by the 'title@bert' job from the first item
    job = ut.meta.jobs['title@bert']
    print(ut[0, job])  # same as ut[0, 'title@bert']

    # 5. Select the 'title@bert' and 'abstract@bert' attributes from the first item
    print(ut[0, ('title@bert', 'abstract@bert')])  # {'title@bert': [1996, 9639, 3035, ...], 'abstract@bert': [4497, 1996, 14960, ...]}

    # 6. Select the attribute tokenized by the specific tokenizer from the first item
    tokenizer = job.tokenizer
    print(ut[0, tokenizer])  # same as ut[0, ('title@bert', 'abstract@bert')]

`METHOD` len() -> int

Return the legal sample size of the tokenized dataframe.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    print(len(ut))  # 10
    print(ut._sample_size)  # 10

    ut.filter(lambda x: x % 2 == 0, job='nid')  # filter the even indices
    print(len(ut))  # 5
    print(ut._sample_size)  # 10

`METHOD` iter() -> Iterator

Iterate over the samples at the legal indices of the tokenized dataframe.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    for sample in ut:
        print(sample)  # {'nid': ..., 'category': ..., 'title@bert': ..., ...}

Methods

`METHOD` set_union_type() -> None

Set the union type of the UniTok object. Each UniTok can only set once.

Parameter	Default	Type	Example	Description
`soft_union`	`N/A`	bool	`True`	Whether the union type is soft or hard.

from unitok import UniTok

with UniTok() as ut:
    print(ut.is_soft_union)  # False, the initial _union_type is None

    ut.set_union_type(soft_union=True)
    print(ut.is_soft_union)  # True

    ut.set_union_type(soft_union=False)  # ValueError: Union type is already set

`METHOD` init_indices() -> None

State Required

Tokenized Organized Initialized

Initialize the indices of the tokenized dataframe. Legal indices will be reset.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:  # init_indices will be automatically called
    print(len(ut))  # 10

    ut.filter(lambda x: x % 2 == 0, job='nid')  # filter the even indices
    print(len(ut))  # 5

    ut.init_indices()
    print(len(ut))  # 10

`METHOD` load() -> None

State Transfer

Disk -> Tokenized

Load the tokenized dataframe from disk.

Parameter	Default	Type	Example	Description
`save_dir`	`N/A`	str	`'sample-ut/item'`	The directory to save the tokenized dataframe.
`tokenizer_lib`	`None`	str	`None`	When using custom tokenizers, you can set the directory of the tokenizer files to reconstruct these tokenizers. If not set, the unrecognized tokenizers will be set to `UnknownTokenizer`.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    ...

`METHOD` save() -> None

State Required

Tokenized Organized Initialized

Save the tokenized dataframe to disk.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    ut.save('sample-ut/another-item')

with UniTok.load('sample-ut/another-item') as another_ut:
    ...

`METHOD` add_job() -> None

State Required

Initialized Tokenized Organized

Add a tokenization job to the UniTok object.

Parameter	Default	Type	Example	Description
`tokenizer`	`N/A`	Union[unitok.Tokenizer, str]	`bert_tokenizer`	The tokenizer to tokenize the current column. If a string (`tokenizer_id`) is provided, the current space will be searched for the tokenizer.
`column`	`N/A`	str	`'title'`	The column name to tokenize in the dataframe.
`name`	`None`	Optional[str]	`'title@bert'`	The tokenized column name. If not provided, the column name will be used. It should be unique in the current unitok space.
`truncate`	`None`	Optional[int]	`50`	The maximum length of the tokenized sequence. When the current tokenizer does not return a list, set it to `None`. To keep the full sequence, set it to `0`. To truncate from the end, set it to a negative value.
`key`	`False`	bool	`False`	Whether to use the tokenized column as the key/primary column. The key column will be used as the index of the data.

`METHOD` tokenize() -> self

State Required

Initialized Tokenized Organized

State Transfer

Initialized -> Tokenized

Tokenize the dataframe based on the added jobs. The key job should be set before tokenization.

Parameter	Default	Type	Example	Description
`df`	`N/A`	pd.DataFrame	`data`	The dataframe to tokenize.

import pandas as pd

from unitok import UniTok, BertTokenizer, TransformersTokenizer, EntityTokenizer

item = pd.read_csv(
    filepath_or_buffer='news-sample.tsv',
    sep='\t',
    names=['nid', 'category', 'subcategory', 'title', 'abstract', 'url', 't_ent', 'a_ent'],
    usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],
)
item['abstract'] = item['abstract'].fillna('')  # Handle missing values

with UniTok() as item_ut:
    bert_tokenizer = BertTokenizer(vocab='bert')
    llama_tokenizer = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')

    item_ut.add_job(tokenizer=EntityTokenizer(vocab='item_id'), column='nid', key=True)
    item_ut.add_job(tokenizer=bert_tokenizer, column='title', name='title@bert', truncate=20)
    item_ut.add_job(tokenizer=llama_tokenizer, column='title', name='title@llama', truncate=20)
    item_ut.add_job(tokenizer=bert_tokenizer, column='abstract', name='abstract@bert', truncate=50)
    item_ut.add_job(tokenizer=llama_tokenizer, column='abstract', name='abstract@llama', truncate=50)
    item_ut.add_job(tokenizer=EntityTokenizer(vocab='category'), column='category')
    item_ut.add_job(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')

    item_ut.tokenize(item)
    print(item_ut.status)  # <tokenized>

`METHOD` union() -> None

State Required

Tokenized Organized Initialized

State Transfer

Tokenized -> Organized

Soft Union and Hard Union

Soft Union: Fast. The union UniToks will be saved in the _soft_unions attribute. When retrieving the data by ut[index], the data will be combined on the fly. ut.data will not be changed.
Hard Union: Slow. ut.data will be expanded to include the union data.

Both soft and hard unions will update the ut.meta.vocabularis. All the tokenizers from the other UniTok will be switched into the UnionTokenizer class. All the union jobs will be tagged with job.from_union = True.

However, there would be no differences between these two union types after saving the UniTok and loading it again.

Union the current UniTok object with another UniTok object.

Parameter	Default	Type	Example	Description
`other`	`N/A`	unitok.UniTok	`other_ut`	The other UniTok object to union with the current UniTok object.
`soft_union`	`True`	bool	`True`	Whether to perform a soft union.
`union_key`	`None`	str	`None`	The column in current UniTok that used as index (key_job) in other UniTok. If not set, the key job of the other UniTok will be used.

from unitok import UniTok

user_ut = UniTok.load('sample-ut/user')
inter_ut = UniTok.load('sample-ut/interaction')

# => {'uid': 0, 'history': [0, 1, 2]}
print(user_ut[0])

# => {'uid': 0, 'nid': 7, 'index': 0, 'click': 1}
print(inter_ut[0])

with inter_ut:
    inter_ut.union(user_ut)

    # => {'index': 0, 'click': 1, 'uid': 0, 'nid': 7, 'history': [0, 1, 2]}
    print(inter_ut[0])

`METHOD` summarize() -> None

State Required

Tokenized Organized Initialized

Summarize the tokenized dataframe.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    ut.summarize()

UniTok (v4), Data (v4)
Sample Size: 10
ID Column: nid

Tokenizer	Tokenizer ID	Column Mapping	Vocab	Max Length
TransformersTokenizer	auto_2VN5Ko	abstract -> abstract@llama	llama (size=32024)	50
EntityTokenizer	auto_C0b9Du	subcategory -> subcategory	subcategory (size=8)	N/A
TransformersTokenizer	auto_2VN5Ko	title -> title@llama	llama (size=32024)	20
EntityTokenizer	auto_4WQYxo	category -> category	category (size=4)	N/A
BertTokenizer	auto_Y9tADT	abstract -> abstract@bert	bert (size=30522)	46
BertTokenizer	auto_Y9tADT	title -> title@bert	bert (size=30522)	16
EntityTokenizer	auto_qwQALc	nid -> nid	nid (size=10)	N/A

`METHOD` pack() -> dict

State Required

Tokenized Organized Initialized

Pack the tokenized dataframe into a dictionary.

Comparison with ut[index]

The index of the pack method is the absolute index: ut._sample_size > index >= 0. The index of the ut[index] method is the legal index: len(ut) > index >= 0.

Parameter	Default	Type	Example	Description
`index`	`N/A`	int	`0`	The absolute index of the tokenized dataframe.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    ut.filter(lambda x: x % 2 == 0, job='nid')  # filter the even indices
    print(ut[1])  
    # {
    #   'abstract@bert': [2122, 9428, 19741, ...], 
    #   'subcategory': 2, 
    #   'title@llama': [29871, 29945, 29900, ...], 
    #   'nid': 2,
    #   'abstract@llama': [4525, 2833, 11687, ...], 
    #   'category': 2, 
    #   'title@bert': [2753, 5409, 14243, ...]
    # }
    print(ut.pack(1))  
    # {
    #   'abstract@bert': [6207, 1005, 1055, ...], 
    #   'subcategory': 1, 
    #   'title@llama': [5260, 28402, 14866, ...], 
    #   'nid': 1, 
    #   'abstract@llama': [12113, 29915, 29879, ...], 
    #   'category': 1, 
    #   'title@bert': [24547, 22345, 18296, ...]
    # }

`METHOD` select() -> dict

State Required

Tokenized Organized Initialized

Select the attributes (or jobs) of current sample, and return the data in a dict format.

Parameter	Default	Type	Example	Description
`sample`	`N/A`	dict	`sample`	The sample to select the attributes.
`selector`	`N/A`	Please refer to the example	`('nid', 'title@bert')`	The attributes to select.

* More usage examples can be found in the __getitem__ method.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    sample = ut[0]
    print(ut.select(sample, ('nid', 'title@bert')))  
    # {'nid': 0, 'title@bert': [1996, 9639, 3035, ...]}

`METHOD` filter() -> None

State Required

Tokenized Organized Initialized

State Transfer

Tokenized -> Organized

Filter the tokenized dataframe based on the condition.

Parameter	Default	Type	Example	Description
`func`	`N/A`	Any	`lambda x: x % 2 == 0`	The filter function. The function should return `True` to keep the sample, and `False` to remove it.
`job`	`None`	str	`'nid'`	The column name to apply the filter. If not set, the input of the filter function will be the sample dictionary.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    print(len(ut))  # 10

    ut.filter(lambda x: x % 2 == 0, job='nid')  # filter the even indices
    print(len(ut))  # 5

    ut.init_indices()
    print(len(ut))  # 10

    ut.filter(lambda x: x['nid'] <= 2)  # filter the nids less than or equal to 2
    print(len(ut))  # 3

`METHOD` retruncate() -> None

State Required

Tokenized Organized Initialized

Re-truncate the tokenized sequences based on the new truncate value.

Parameter	Default	Type	Example	Description
`job`	`N/A`	Union[str, unitok.Job]	`'title@bert'`	The job name or job object to re-truncate.
`truncate`	`N/A`	int	`10`	The new truncate value.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    print(ut[0, 'title@bert'])  # {'title@bert': [1996, 9639, 3035, ...]}
    ut.retruncate('title@bert', 2)
    print(ut[0, 'title@bert'])  # {'title@bert': [1996, 9639]}

`METHOD` remove_job() -> None

Delete the job from the UniTok object.

Parameter	Default	Type	Example	Description
`job`	`N/A`	Union[str, Job]	`'title@bert'`	The job name or job object to remove.

from unitok import UniTok

with UniTok.load('sample-ut/item') as ut:
    print(ut.data.keys())  # dict_keys(['nid', 'category', 'subcategory', 'title@bert', 'abstract@llama', 'title@llama', 'abstract@bert'])
    ut.remove_job('title@bert')
    print(ut.data.keys())  # dict_keys(['nid', 'category', 'subcategory', 'abstract@llama', 'title@llama', 'abstract@bert'])

UniTok

Constructor

CLASS unitok.UniTok()

Attributes

Properties

PROPERTY is_soft_union -> bool

PROPERTY is_hard_union -> bool

PROPERTY filepath -> str

Magic Methods

METHOD __enter__() -> unitok.UniTok

METHOD __exit__() -> None

METHOD __getitem__() -> dict

METHOD __len__() -> int

METHOD __iter__() -> Iterator

Methods

METHOD set_union_type() -> None

METHOD init_indices() -> None

METHOD load() -> None

METHOD save() -> None

METHOD add_job() -> None

METHOD tokenize() -> self

METHOD union() -> None

METHOD summarize() -> None

METHOD pack() -> dict

METHOD select() -> dict

METHOD filter() -> None

METHOD retruncate() -> None

METHOD remove_job() -> None

`CLASS` unitok.UniTok()

`PROPERTY` is_soft_union -> bool

`PROPERTY` is_hard_union -> bool

`PROPERTY` filepath -> str

`METHOD` enter() -> unitok.UniTok

`METHOD` exit() -> None

`METHOD` getitem() -> dict

`METHOD` len() -> int

`METHOD` iter() -> Iterator

`METHOD` set_union_type() -> None

`METHOD` init_indices() -> None

`METHOD` load() -> None

`METHOD` save() -> None

`METHOD` add_job() -> None

`METHOD` tokenize() -> self

`METHOD` union() -> None

`METHOD` summarize() -> None

`METHOD` pack() -> dict

`METHOD` select() -> dict

`METHOD` filter() -> None

`METHOD` retruncate() -> None

`METHOD` remove_job() -> None