19.4 C
London
Thursday, September 5, 2024

The right way to Use the Hugging Face Tokenizers Library to Preprocess Textual content Information


The right way to Use the Hugging Face Tokenizers Library to Preprocess Textual content Information
Picture by Creator

 

When you have studied NLP, you may need heard in regards to the time period “tokenization.” It is a crucial step in textual content preprocessing, the place we remodel our textual knowledge into one thing that machines can perceive. It does so by breaking down the sentence into smaller chunks, generally known as tokens. These tokens might be phrases, subwords, and even characters, relying on the tokenization algorithm getting used. On this article, we are going to see tips on how to use the Hugging Face Tokenizers Library to preprocess our textual knowledge.

 

Setting Up Hugging Face Tokenizers Library

 

To start out utilizing the Hugging Face Tokenizers library, you may want to put in it first. You are able to do this utilizing pip:

 

The Hugging Face library helps varied tokenization algorithms, however the three essential varieties are:

  • Byte-Pair Encoding (BPE): Merges essentially the most frequent pairs of characters or subwords iteratively, making a compact vocabulary. It’s utilized by fashions like GPT-2.
  • WordPiece: Much like BPE however focuses on probabilistic merges (does not select the pair that’s the most frequent however the one that may maximize the probability of the corpus as soon as merged), generally utilized by fashions like BERT.
  • SentencePiece: A extra versatile tokenizer that may deal with completely different languages and scripts, usually used with fashions like ALBERT, XLNet, or the Marian framework. It treats areas as characters somewhat than phrase separators.

The Hugging Face Transformers library supplies an AutoTokenizer class that may mechanically choose the perfect tokenizer for a given pre-trained mannequin. This can be a handy approach to make use of the proper tokenizer for a selected mannequin and might be imported from the transformers library. Nevertheless, for the sake of our dialogue relating to the Tokenizers library, we won’t comply with this strategy.

We’ll use the pre-trained BERT-base-uncased tokenizer. This tokenizer was educated on the identical knowledge and utilizing the identical strategies because the BERT-base-uncased mannequin, which implies it may be used to preprocess textual content knowledge appropriate with BERT fashions:

# Import the mandatory parts
from tokenizers import Tokenizer
from transformers import BertTokenizer

# Load the pre-trained BERT-base-uncased tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

 

Single Sentence Tokenization

 

Now, let’s encode a easy sentence utilizing this tokenizer:

# Tokenize a single sentence
encoded_input = tokenizer.encode_plus("That is pattern textual content to check tokenization.")
print(encoded_input)

 

Output:

{'input_ids': [101, 2023, 2003, 7099, 3793, 2000, 3231, 19204, 3989, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

 

To make sure correctness, let’s decode the tokenized enter:

tokenizer.decode(encoded_input["input_ids"])

 

Output:

[CLS] that is pattern textual content to check tokenization. [SEP]

 

On this output, you may see two particular tokens. [CLS] marks the beginning of the enter sequence, and [SEP] marks the top, indicating a single sequence of textual content.

 

Batch Tokenization

 

Now, let’s tokenize a corpus of textual content as an alternative of a single sentence utilizing batch_encode_plus:

corpus = [
    "Hello, how are you?",
    "I am learning how to use the Hugging Face Tokenizers library.",
    "Tokenization is a crucial step in NLP."
]
encoded_corpus = tokenizer.batch_encode_plus(corpus)
print(encoded_corpus)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

 

For higher understanding, let’s decode the batch-encoded corpus as we did incase of single sentence. This can present the unique sentences, tokenized appropriately.

tokenizer.batch_decode(encoded_corpus["input_ids"])

 

Output:

['[CLS] hi there, how are you? [SEP]',
 '[CLS] i'm studying tips on how to use the cuddling face tokenizers library. [SEP]',
 '[CLS] tokenization is an important step in nlp. [SEP]']

 

Padding and Truncation

 

When getting ready knowledge for machine studying fashions, guaranteeing all enter sequences have the identical size is usually mandatory. Two strategies to perform this are:

 

1. Padding

Padding works by including the particular token [PAD] on the finish of the shorter sequences to match the size of the longest sequence within the batch or max size supported by the mannequin if max_length is outlined. You are able to do this by:

encoded_corpus_padded = tokenizer.batch_encode_plus(corpus, padding=True)
print(encoded_corpus_padded)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}

 

Now, you may see that additional 0s are positioned, however for higher understanding, let’s decode to see the place the tokenizer has positioned the [PAD] tokens:

tokenizer.batch_decode(encoded_corpus_padded["input_ids"], skip_special_tokens=False)

 

Output:

['[CLS] hi there, how are you? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]',
 '[CLS] i'm studying tips on how to use the cuddling face tokenizers library. [SEP]',
 '[CLS] tokenization is an important step in nlp. [SEP] [PAD] [PAD] [PAD] [PAD]']

 

2. Truncation

Many NLP fashions have a most enter size sequence, and truncation works by chopping off the top of the longer sequence to satisfy this most size. It reduces reminiscence utilization and prevents the mannequin from being overwhelmed by very giant enter sequences.

encoded_corpus_truncated = tokenizer.batch_encode_plus(corpus, truncation=True, max_length=5)
print(encoded_corpus_truncated)

 

Output:

{'input_ids': [[101, 7592, 1010, 2129, 102], [101, 1045, 2572, 4083, 102], [101, 19204, 3989, 2003, 102]], 'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

 

Now, you may also use the batch_decode technique, however for higher understanding, let’s print this data differently:

for i, sentence in enumerate(corpus):
    print(f"Unique sentence: {sentence}")
    print(f"Token IDs: {encoded_corpus_truncated['input_ids'][i]}")
    print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded_corpus_truncated['input_ids'][i])}")
    print()

 

Output:

Unique sentence: Howdy, how are you?
Token IDs: [101, 7592, 1010, 2129, 102]
Tokens: ['[CLS]', 'hi there', ',', 'how', '[SEP]']

Unique sentence: I'm studying tips on how to use the Hugging Face Tokenizers library.
Token IDs: [101, 1045, 2572, 4083, 102]
Tokens: ['[CLS]', 'i', 'am', 'studying', '[SEP]']

Unique sentence: Tokenization is an important step in NLP.
Token IDs: [101, 19204, 3989, 2003, 102]
Tokens: ['[CLS]', 'token', '##ization', 'is', '[SEP]']

 

This text is a part of our wonderful sequence on Hugging Face. If you wish to discover extra about this matter, listed below are some references that can assist you out:

 
 

Kanwal Mehreen Kanwal is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with medication. She co-authored the e book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions variety and tutorial excellence. She’s additionally acknowledged as a Teradata Variety in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.

Latest news
Related news

LEAVE A REPLY

Please enter your comment!
Please enter your name here