7 Natural Language Processing Techniques for Extracting Information

Chathurangi Jayawardana
Analytics Vidhya
Published in
3 min readDec 5, 2020

--

Photo by Chris Ried on Unsplash

Hi all,

We are covered 7 Natural Language Processing techniques in this article.

  1. Tokenization
  2. Stemming and Lemmatization
  3. Bag of Words
  4. Named Entity Recognition (NER)
  5. Sentence Segmentation
  6. Natural language generation
  7. Sentiment Analysis

1. Tokenization

Tokenization is one of the most common tasks when it comes to working with text data. Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens. Tokenization is an integral part of any Information Retrieval(IR) system, it not only involves the pre-process of text but also generates tokens respectively that are used in the indexing/ranking process.

2. Stemming & Lemmatization

Stemming and Lemmatization are Text Normalization techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. Stemming is the process of reducing inflected words to their word stem, base or root form generally a written form of the word. Lemmatization usually refers to do things with the proper use of vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

3. Bag of Words

Bag of words is a Natural Language Processing technique of text modelling. In technical terms, we can say that it is a method of feature extraction with text data. This approach is a simple and flexible way of extracting features from documents. A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts and disregard the grammatical details and the word order. It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

4. Named Entity Recognition (NER)

Named entity recognition (NER) helps you easily identify the key elements in a text, like names of people, places, brands, monetary values, and more. Extracting the main entities in a text helps sort unstructured data and detect important information, which is crucial if you have to deal with large datasets.

5. Sentence Segmentation

Sentence segmentation is the problem of dividing a string of written language into its component sentences. When dealing with text, it is always common that we need to break up text into its individual sentences. That is what is known as sentence segmentation: the process of obtaining the individual sentences from a text corpus. sentence segmentation also referred to as sentence boundary detection, sentence boundary disambiguation or sentence boundary recognition. There are many libraries available to do sentence segmentation like NLTK, Spacy, Stanford CoreNLP, etc.. that provide specific functions to do the task.

6. Natural Language Generation

Natural language generation (NLG) is a technique that uses raw structured data to convert it into plain English (or any other) language. We also call it data storytelling. This technique is very helpful in many organizations where a large amount of data is used, it converts structured data into natural languages for a better understanding of patterns or detailed insights into any business. As this can be viewed opposite of Natural Language Understanding (NLU) that we have already explained above. NLG makes data understandable to all by making reports that are mainly data-driven, like, stock-market and financial reports, meeting memos, reports on product requirements, etc.

7. Sentiment Analysis

Sentiment analysis (also known as opinion mining is a text analysis technique that detects polarity (e.g. a positive or negative opinion) within text, whether a whole document, paragraph, sentence, or clause. Understanding people’s emotions is essential for businesses since customers express their thoughts and feelings more openly than ever before. Automatically analyzing customer feedback, such as opinions in survey responses and social media conversations, allows brands to listen attentively to their customers, and tailor products and services to meet their needs.

Thank you!

--

--

Chathurangi Jayawardana
Analytics Vidhya

Software Engineer | Technical Writer | University of Moratuwa, Sri Lanka.