Quick Start
Data Preparation
Base on the MIND dataset, we construct a tiny dataset for demonstration purposes, including
- news-sample.tsv, a sample of news articles.
- user-sample.tsv, a sample of user profiles.
- interaction-sample.tsv, a sample of user-item interactions.
- sample-ut.zip, the tokenized data using UniTok. (Please unzip it before using)
news-sample.tsv
nid | category | subcategory | title | abstract | url | t_ent | a_ent |
---|---|---|---|---|---|---|---|
N88753 | lifestyle | lifestyleroyals | The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By | Shop the notebooks, jackets, and more that the royals can't live without. | https://assets.msn.com/labs/mind/AAGH0ET.html | [{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}] | [] |
N45436 | news | newsscienceandtechnology | Walmart Slashes Prices on Last-Generation iPads | Apple's new iPad releases bring big deals on last year's models. | https://assets.msn.com/labs/mind/AABmf2I.html | [{"Label": "IPad", "Type": "J", "WikidataId": "Q2796", "Confidence": 0.999, "OccurrenceOffsets": [42], "SurfaceForms": ["iPads"]}] | [{"Label": "IPad", "Type": "J", "WikidataId": "Q2796", "Confidence": 0.999, "OccurrenceOffsets": [12], "SurfaceForms": ["iPad"]}] |
N23144 | health | weightloss | 50 Worst Habits For Belly Fat | These seemingly harmless habits are holding you back and keeping you from shedding that unwanted belly fat for good. | https://assets.msn.com/labs/mind/AAB19MK.html | [{"Label": "Adipose tissue", "Type": "C", "WikidataId": "Q193583", "Confidence": 1.0, "OccurrenceOffsets": [20], "SurfaceForms": ["Belly Fat"]}] | [{"Label": "Adipose tissue", "Type": "C", "WikidataId": "Q193583", "Confidence": 1.0, "OccurrenceOffsets": [97], "SurfaceForms": ["belly fat"]}] |
N86255 | health | medical | Dispose of unwanted prescription drugs during the DEA's Take Back Day | https://assets.msn.com/labs/mind/AAISxPN.html | [{"Label": "Drug Enforcement Administration", "Type": "O", "WikidataId": "Q622899", "Confidence": 0.992, "OccurrenceOffsets": [50], "SurfaceForms": ["DEA"]}] | [] | |
N93187 | news | newsworld | The Cost of Trump's Aid Freeze in the Trenches of Ukraine's War | Lt. Ivan Molchanets peeked over a parapet of sand bags at the front line of the war in Ukraine. Next to him was an empty helmet propped up to trick snipers, already perforated with multiple holes. | https://assets.msn.com/labs/mind/AAJgNsz.html | [] | [{"Label": "Ukraine", "Type": "G", "WikidataId": "Q212", "Confidence": 0.946, "OccurrenceOffsets": [87], "SurfaceForms": ["Ukraine"]}] |
N75236 | health | voices | I Was An NBA Wife. Here's How It Affected My Mental Health. | I felt like I was a fraud, and being an NBA wife didn't help that. In fact, it nearly destroyed me. | https://assets.msn.com/labs/mind/AACk2N6.html | [] | [{"Label": "National Basketball Association", "Type": "O", "WikidataId": "Q155223", "Confidence": 1.0, "OccurrenceOffsets": [40], "SurfaceForms": ["NBA"]}] |
N99744 | health | medical | How to Get Rid of Skin Tags, According to a Dermatologist | They seem harmless, but there's a very good reason you shouldn't ignore them. The post How to Get Rid of Skin Tags, According to a Dermatologist appeared first on Reader's Digest. | https://assets.msn.com/labs/mind/AAAKEkt.html | [{"Label": "Skin tag", "Type": "C", "WikidataId": "Q3179593", "Confidence": 1.0, "OccurrenceOffsets": [18], "SurfaceForms": ["Skin Tags"]}] | [{"Label": "Skin tag", "Type": "C", "WikidataId": "Q3179593", "Confidence": 1.0, "OccurrenceOffsets": [105], "SurfaceForms": ["Skin Tags"]}] |
N5771 | health | cardio | Check Houston traffic map for current road conditions | https://assets.msn.com/labs/mind/AAEKnO1.html | [{"Label": "Houston", "Type": "G", "WikidataId": "Q16555", "Confidence": 0.941, "OccurrenceOffsets": [6], "SurfaceForms": ["Houston"]}] | [] | |
N124534 | sports | football_nfl | Should NFL be able to fine players for criticizing officiating? | Several fines came down against NFL players for criticizing officiating this week. It's a very bad look for the league. | https://assets.msn.com/labs/mind/AAJ4lap.html | [{"Label": "National Football League", "Type": "O", "WikidataId": "Q1215884", "Confidence": 1.0, "OccurrenceOffsets": [7], "SurfaceForms": ["NFL"]}] | [{"Label": "National Football League", "Type": "O", "WikidataId": "Q1215884", "Confidence": 1.0, "OccurrenceOffsets": [32], "SurfaceForms": ["NFL"]}] |
N51947 | news | newsscienceandtechnology | How to record your screen on Windows, macOS, iOS or Android | The easiest way to record what's happening on your screen, whichever device you're using. | https://assets.msn.com/labs/mind/AADlomf.html | [{"Label": "Microsoft Windows", "Type": "J", "WikidataId": "Q1406", "Confidence": 1.0, "OccurrenceOffsets": [29], "SurfaceForms": ["Windows"]}] | [] |
user-sample.tsv
uid | history |
---|---|
U12 | N88753,N45436,N23144 |
U34 | N86255,N93187,N75236,N99744 |
U56 | N88753,N93187,N5771 |
U78 | N124534,N51947,N99744 |
interaction-sample.tsv
uid | nid | click |
---|---|---|
U12 | N5771 | 1 |
U12 | N93187 | 1 |
U12 | N124534 | 0 |
U34 | N23144 | 1 |
U34 | N88753 | 0 |
U56 | N86255 | 1 |
U56 | N99744 | 0 |
U78 | N88753 | 1 |
U78 | N75236 | 0 |
Loading Data
import pandas as pd
item = pd.read_csv(
filepath_or_buffer='news-sample.tsv',
sep='\t',
names=['nid', 'category', 'subcategory', 'title', 'abstract', 'url', 't_ent', 'a_ent'],
usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],
)
item['abstract'] = item['abstract'].fillna('') # Handle missing values
user = pd.read_csv(
filepath_or_buffer='user-sample.tsv',
sep='\t',
names=['uid', 'history'],
)
interaction = pd.read_csv(
filepath_or_buffer='interaction-sample.tsv',
sep='\t',
names=['uid', 'nid', 'click'],
)
Defining Tokenization Jobs
from unitok import UniTok, Vocab
from unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer
item_vocab = Vocab(name='nid') # will be used across datasets
user_vocab = Vocab(name='uid') # will be used across datasets
with UniTok() as item_ut:
bert_tokenizer = BertTokenizer(vocab='bert')
llama_tokenizer = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')
item_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)
item_ut.add_job(tokenizer=bert_tokenizer, column='title', name='title@bert', truncate=20)
item_ut.add_job(tokenizer=llama_tokenizer, column='title', name='title@llama', truncate=20)
item_ut.add_job(tokenizer=bert_tokenizer, column='abstract', name='abstract@bert', truncate=50)
item_ut.add_job(tokenizer=llama_tokenizer, column='abstract', name='abstract@llama', truncate=50)
item_ut.add_job(tokenizer=EntityTokenizer(vocab='category'), column='category')
item_ut.add_job(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')
with UniTok() as user_ut:
user_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)
user_ut.add_job(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)
with UniTok() as inter_ut:
inter_ut.add_index_job(name='index')
inter_ut.add_job(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')
inter_ut.add_job(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')
inter_ut.add_job(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')
Please refer to the Vocabulary, Tokenizer, and Job pages for more information.
Tokenizing and Saving Data
item_ut.tokenize(item).save('sample-ut/item')
item_vocab.deny_edit() # will raise an error if new items are detected in the user or interaction datasets
user_ut.tokenize(user).save('sample-ut/user')
inter_ut.tokenize(interaction).save('sample-ut/interaction')
Combining UniToks
# => {'category': 0, 'nid': 0, 'title@bert': [1996, 9639, 3035, 3870, ...], 'title@llama': [450, 1771, 4167, 10470, ...], 'abstract@bert': [4497, 1996, 14960, 2015, ...], 'abstract@llama': [1383, 459, 278, 451, ...], 'subcategory': 0}
print(item_ut[0])
# => {'uid': 0, 'history': [0, 1, 2]}
print(user_ut[0])
# => {'uid': 0, 'nid': 7, 'index': 0, 'click': 1}
print(inter_ut[0])
with inter_ut:
inter_ut.union(user_ut)
# => {'index': 0, 'click': 1, 'uid': 0, 'nid': 7, 'history': [0, 1, 2]}
print(inter_ut[0])
Please refer to the UniTok page for more information.
Glance at the Terminal
Tokenizer | Tokenizer ID | Column Mapping | Vocab | Max Length |
---|---|---|---|---|
TransformersTokenizer | auto_2VN5Ko | abstract -> abstract@llama | llama (size=32024) | 50 |
EntityTokenizer | auto_C0b9Du | subcategory -> subcategory | subcategory (size=8) | N/A |
TransformersTokenizer | auto_2VN5Ko | title -> title@llama | llama (size=32024) | 20 |
EntityTokenizer | auto_4WQYxo | category -> category | category (size=4) | N/A |
BertTokenizer | auto_Y9tADT | abstract -> abstract@bert | bert (size=30522) | 46 |
BertTokenizer | auto_Y9tADT | title -> title@bert | bert (size=30522) | 16 |
EntityTokenizer | auto_qwQALc | nid -> nid | nid (size=10) | N/A |
Please refer to the UniTok in CLI page for more information.