UniTok: A Machine Learning Toolkit for Tokenization and Indexing
Welcome to the UniTok v4 Handbook! This library provides a unified preprocessing solution for machine learning datasets, handling diverse data types like text, categorical features, and numerical values.
To install the latest version, use the following command:
You can visit here for more installation options.
A quick example of using UniTok can be found here.
Call for Contribution
We welcome contributions to UniTok! We appreciate your feedback, bug reports, and pull requests.
Citation
If you use UniTok in your research, please cite the following paper:
@online{unitok,
author = {Jyonn},
title = {UniTok v4: A Machine Learning Toolkit for Tokenization and Indexing},
year = 2025,
url = {https://unitok.qijiong.work}
}
License
This project is licensed under the MIT License.