Skip to content

UniTok: A Machine Learning Toolkit for Tokenization and Indexing

Welcome to the UniTok v4 Handbook! This library provides a unified preprocessing solution for machine learning datasets, handling diverse data types like text, categorical features, and numerical values.


To install the latest version, use the following command:

pip install unitok

You can visit here for more installation options.

A quick example of using UniTok can be found here.


Call for Contribution

We welcome contributions to UniTok! We appreciate your feedback, bug reports, and pull requests.

Citation

If you use UniTok in your research, please cite the following paper:

@online{unitok,
  author = {Jyonn},
  title = {UniTok v4: A Machine Learning Toolkit for Tokenization and Indexing},
  year = 2025,
  url = {https://unitok.qijiong.work}
}

License

This project is licensed under the MIT License.