How to lexically process textual data?

Ankit Gupta
7 min readSep 21, 2022
Photo by Rahul Pandit from Burst

We all converse. From the time we wake up to the time we go back to sleep, we use speech to convey our thoughts and ideas. Now in the contemporary world of digital connectivity, every passing moment we generate huge volume of textual data. Twitter, Facebook and other social media platforms, along with online shopping sites generate humongous amount of relevant data. These data are gold mines for governments, corporates and business houses who can leverage it create useful and actionable insights.

In our journey, we will start with basics of Regex and the proceed forward to other text processing methods like stemming, lemmatization etc.

Regular Expression

It basically means the string or pattern of string that we are searching for. So it allows us to easily search for any specific pattern in a huge corpus of text data.

Quantifiers

Quantifiers specify the number of instances of a character or character class that must be present in the input for a match to be found.

Following code snippets shows some of the quantifiers, and examples of their application.

Adding on to the capability of quantifiers, we can implicitly define the data type of pattern that we’re looking for using meta-sequences. Some examples have been provided below.

We can also write the above code using the elimination approach. So instead of writing a regex to look for two digit number, we can writ it as such it’ll eliminate all the alphabets, whitespaces and period.

#importing regex library
import re
#text data example
data = 'It is very bright and sunny today. The temperature outside is 38 C.'
#pattern that we want to search
pat = '[^A-Za-z\s | ^\.]{2}' #regex will look for a pattern where there is a two digit number.
#So it'll give the temperature value as output'.
print(re.search(pat, data))

Searching for phone number in text data.

Regular Expression (or NLP in general) is a vast subject and has its own set of elaborate rules and guidelines, kind of like a language. We just need to know the logical approach, and rest there are multiple resources to get the proper syntactical structure. So, there is no need to memorize each and every detail.

Source : https://www.dataquest.io/

Regex Functions

Let us see how regex function helps us in text processing. Till now we have seen how to create patterns to perform the mining. Now we will see some functions, and will use them as per our requirement.

Regex Search Function
Regex Match Function
Regex Findall Function
Regex Substitute Function

Zipf’s Law

In terms of linguistics, when words are ranked according to their frequencies in a large enough document(say, a hundred thousand words) and then the frequency is plotted against the rank, the result is a logarithmic curve.

What it actually means is generally in a text corpus, words with very high frequency are the ones which gives us the least information. For example, the, in etc.

So we can divide our text data with the composition of three kinds of words:
1. High frequency words, also known as stop words. Ex : in, on, at, the, etc.
2. Important words, which are significant for our analysis.
3. Seldom occurring words, which again doesn’t impart us with much information.

So before we start analyzing any kind of text data, we remove all the stop words present in it. In Python, NLTK provides us the necessary package to perform stop word removal.

Tokenization

It is the process by which a large quantity of text is divided into smaller parts called tokens. Depending on our use case, we divide large chunk of text into smaller words, sentences etc.

NLTK provides different package for different kind of tokenization.
1. Word Tokenization
2. Sentence Tokenization
3. Tweet Tokenization (Great for handling Twitter lingo)
4. Regexp Tokenization (to tokenize text using regex)

Word Tokenization
Sentence Tokenization
Tweet Tokenization
Regex Tokenization

Canonicalization

It is the process of reducing any given word to its base word. Some of the most popular canonicalization methods are stemming, lemmatization and phonetic hashing.

Stemming

Stemming is a rule based method to reduce a word to its root form by just chopping of the affixes. The base word here is called a ‘stem’. So, it‘ll convert ‘drive’, ‘driving’, etc. to ‘driv’. Stemming is faster but usually gives us an inaccurate output for root words.

The most popular stemmer for English language is Porter Stemmer.

Porter Stemming

Lemmatization

Lemmatization is a more advanced approach towards base word reduction. In this process, the input word is searched for its base word by iteratively going through all the variations of dictionary words. The base word here is called a ‘lemma’. Lemmatization gives us more accurate output, but it is much slower than stemmer.

WordNet lemmatizer is the most popular lemmatizer. Here we need to pass the POS tag of the word as a parameter.

Word Net Lemmatizer

Phonetic Hashing

A word’s phonetic hash is based on its pronunciation, rather than its spelling. Here we reduce all the variations of a word to a common word using Soundex algorithm. Soundex reduce all the words to a four-digit code.

Soundex Coding Rules

Steps Involved : Let us create four digit code for DELHI.
1. The first letter of the code is the first letter of the input word. Hence it is retained as is. So first letter is “D”.
2. We need to ignore all the vowels, & letters H, W, Y. So for next alphabet “L”, our code will be “4”.
3. Next step is to force the code into a four-letter code. We need to pad it with zeroes since in our case it is less than four characters in length.(If it is more than four characters then we need to truncate it from the right side). So our final output code will be “D400”.

We can see the functioning of Soundex in the following tool.

Please feel free to go through the below notebook for codes all that is explained above.

Thanks you for your time!

References :

  1. https://regexone.com/
  2. https://www.nltk.org/index.html
  3. https://docs.python.org/3.4/library/re.html
  4. https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf
  5. https://www.researchgate.net/publication/47390503_A_Bangla_Phonetic_Encoding_for_Better_Spelling_Suggestion

--

--