Natural language processing and Linguistic tools?

7 min readOct 7, 2020

The term natural language refers to the way we humans communicate with each other and the field of study that is concerned with the interactions between computers and human (natural) languages is called as natural language processing and NLP in short. Its this NLP that is concerned with programming computers to fruitfully process large sets of natural language corpora.

Now a days we are surrounded with natural languages in the form of speech and text like for example: Social media Posts, SMS, Signs, Email, comments, tweets, Web Pages and similarly we have natural language in the form of speech and this type of speech data is even higher because its easy to speak than to write. In order understand these types of data we need to have methods to understand, process and reason. With the help of computer science, artificial intelligence and computational linguists we now have the methods and these methods are evolving day by day. These sets of methods are called as natural language processing.

In simpler terms NLP is a smart way of analyzing, understanding and deriving meaning from human languages

Domain of the Natural language Processing (NLP):

Due to the overall advancements in the area of computer science , NLP has set its roots in most of the areas like:

Dialogue and Interactive Systems
Cognitive Modeling and Psycholinguistics
Information Extraction, Retrieval, Question Answering, Document Analysis and NLP Applications
Social Media
Biomedical
Tagging, Chunking, Syntax and Parsing
Vision, Robotics and Grounding
Discourse and Pragmatics
Generation and Summarization
Machine Learning
Machine Translation
Multilingualism
Phonology, Morphology and Word Segmentation
Resources and Evaluation
Semantics
Sentiment Analysis and Opinion Mining
Speech

At this stage of your learning phase you don’t need to go deep into the topics mentioned above. We will discuss the current work and future research options for each of these areas in a separate topic.

Applications

Search Engines (like: Google, Bing, yahoo etc.)
Speech engines ( like: Google Assistant, Apple Siri)
Google Translate

Spam filtering

Spelling Correction, Grammar

Problem Solver:
like Math solver: https://www.cymath.com/

Some of the other applications include:
Summarization,
Sentiment/Opinion/Review,
Social Media/Network applications,
Biomedical related applications
Facebook uses NLP to track trending topics and popular hashtags etc.

The two views of NLP

One view is based on linguistic analysis.
(also characterized as Symbolic because it consists of rules for the manipulation of symbols, like grammar rules)
Second view is based on statistical analysis of language.
( also characterized as Empirical because it involves deriving language data from relatively large text corpora such as web pages, news feeds).

The linguistic analysis of a text follows a layered fashion like documents are broken down into paragraphs, these paragraphs then into sentences and sentences then into words where most of the part of speech tagging is done (see figure below). We now discuss each of these layers step by step

Sentence delimiters and tokenizers
In order to parse sentence from a documents, we need to determine the scope of these sentences and identify constituents.

Sentence Delimiters:

The punctuation signs that mark the end of a sentence are often ambiguous like for example, the period can denote a decimal point, an end of a sentence, an abbreviation. or we can see another example, A capital letter doesn't always mean the start of the new sentence, it can be the titles like Mr. , Mrs., Dr.. So to remove such kind of ambiguities sentence delimiters mostly rely on regular expressions

Tokenizers:

Separating a piece of text into smaller units is called as tokenization and the units are called as tokens, these tokens can be either words, characters or sub words.
for example:

raw_text = """'When I'M a in class,'i prefer not to talk,(it's not a good sign to be silent), 
...'I won't say no to your friendship, I'am OKEY.You seem to be very 
... polite but I am very-very naughty, as I am also shot-Tempered,'..."""

for now ignore what is written inside this code block below, just focus on the output

With the help of regular expressions we can separate the delimiters [(,), (.), (‘)], characters, words easily.

re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw_text)Output: 
["'",  'When',  "I'M",  'a',  'in',  'class',  ',',  "'",  'i',  'prefer',  'not',  'to',  'talk',  ',',  '(',  "it's",  'not',  'a',  'good',  'sign',  'to',  'be',  'silent',  ')',  ',',  '...',  "'",  'I',  "won't",  'say',  'no',  'to',  'your',  'friendship',  ',',  "I'm",  'OKEY',  '.',  'You',  'seem',  'to',  'be',  'very',  '...',  'polite',  'but',  'I',  'am',  'very-very',  'naughty',  ',',  'as',  'I',  'am',  'also',  'shot-tempered',  '.']

2. Stemmers and taggers
For parsing, lexical analysis is a must step before it and without lexical we cant do any parsing, so its necessary to first identify the root forms of words and determine their part of speech.

Stemmers:

For example the words ‘go’, ‘goes’, ‘going’, ‘gone’ and ‘went’ will be associated as the root form of word ‘go’.
To do so we have two forms of morphological processes:

Inflectional and Derivational

As the name tells us ‘Inflection’ which means change of form: a noun, adjective, verb, etc. In inflectional morphology we express the syntactic relations between words of the same part of speech like inflate and inflates, city and cities, wish and wishes etc. while in derivational we express the creation of new words from old words like for example ‘Unhappy’ is formed from ‘happy’ and has the opposite meaning. derivational also involves change in the grammatical category of a word. e.g. Inflate and inflation

Part of speech taggers:

Also known as POS, word classes, or syntactic categories. In part of speech taggers we choose and label each word with noun, verb, adjective etc. POS are useful because they tell us a lot about a word and its neighbors. Once we get to know whether a word is a noun or verb, it gives us the information and syntactic structure about the neighbor words. like form example in the image below.
You can check using your own sentences on this link https://parts-of-speech.info/

3. Noun phrase and name recognizers

Noun phrase extractors:

To identify word clusters we need to go beyond part-of-speech tagging for this purpose we use noun phrase extractors, these programs identify the base noun phrases, i.e. the main noun in the phrase and its left identifiers ( i.e. the determiners and adjectives occurring just to the left of it.
Note: Noun phrase programs are typically partial sometimes also called as Shallow parsers rather than deep parser.

Name recognizers or name finders( mostly termed as “Named Entity” recognizers (NER)):

Name finders recognize the proper nouns in documents and also classify these proper nouns as to whether they are people, companies, places, organizations, countries and the like.
for example: Taiwan is one of the leading country in the sector of electronics, during last Thursday ‘Mrs. Tsai Ing-wen’ the president of Taiwan also appreciated the leading business tycoons in Taiwan and also the CEO of Taiwan semi conductor industry was awarded. You can check online tools to by entering example sentences.

In the sentence above
Taiwan would be identified as place,
last Thursday as date,
Mrs. Tsai Ing-wen as person,
Taiwan semi conductor industry as company

4. Parsers and grammars
When we have to do the deep parsing then at that time we require another set of tools that tell us about the sentence structure. Like in named entities we are able to generate or recognize a word or phrase as proper noun but here with the help of parsers and grammars we are able to identify the role of that proper noun in a sentence e.g. whether subject or object.

Natural language processing and Linguistic tools?

Domain of the Natural language Processing (NLP):

Applications

The two views of NLP

for now ignore what is written inside this code block below, just focus on the output

Written by Tatheer Hussain Mir

No responses yet