Image for post

NLP lecture series, from basic to advance level- (Part-2)

(part-1), (part-2.1), (part-3), (part-4), (part-5), (part-6), (part-7), (part-8), (part-9), (part-10)

In the previous topic (part-1) we discussed about NLP introduction, usage, applications and linguistic tools. If you came directly to part-2, I recommend to check my part-1.

Document Retrieval

You might have visited any old library where the document retrieval system was based on catalog system and have seen how the librarians used to search a book for us but the advent of the world wide web has changed everything and everyone these days is a kind of document retriever and everyone knows about document search and its limitations.

Before going deeper into the NLP, we need to…


Image for post
Image for post
image by author

NLP lecture series, from basic to advance level- (part-3)

Models that assign probabilities to sequences of words are called language models.

(part-1) ,(part-2), (part-2.1), (part-4), (part-5), (part-6), (part-7), (part-8), (part-9), (part-10)

Predicting future is difficult but how about predicting the next few words, immediately after the current word, for example:

Your program doesn’t _______? (work or allow or run … )
I like to eat Chinese _________? (food or cuisine or dumplings …)
She is my girl _______? (friend)
Please turn of your cell ________? (phone or membrane)

Hopefully, you can conclude which one is very likely next word. So, language modelling is about formalizing this intuition in machines and we do this by introducing models that assign a probability to each possible next word. These models are then used to assign the probability to an entire sentence. Based on those model we can predict that “work” in example first above has the highest probability of appearing in a text like “your program doesn’t work”. …


Image for post
Image for post
Image by author

NLP lecture series, from basic to advance level- (part-2.1)

(Part-1) ,(Part-2), (Part-3), (Part-4), (Part-5), (Part-6), (Part-7), (Part-8), (Part-9), (Part-10)

This part is just for an additional content in this lecture series, you can skip this part if you want. If you have come directly here you can read more topics by clicking on above linked parts.

Probabilistic Retrieval

When we estimate how relevant a document is to a given query by feeding usual term and document frequency as parameters to a Bayesian model is termed as probabilistic retrieval or we can say in other terms it is an attempt to formalize the idea behind ranked retrieval in terms of probability theory.

This model is actually based on some assumptions¹. …


Image for post
Image for post
image by author

NLP lecture series, from basic to advance level- (Part-1)

At the end of of this lecture series you will be:

  1. able to use major tools for text analysis,
  2. enough competent to do text preprocessing,
  3. skillful to extract features from texts,
  4. qualified to write your own NLP/IR programs,
  5. familiar with the fundamentals of NLP/IR,
  6. know the overview of most NLP/IR related research topics.


Image for post
Image for post
image by author

word2vec treats each word in corpus like an atomic entity and generates a vector for each word. In this sense Word2vec is very much like Glove — both treat words as the smallest unit to train on.

FastText (which is essentially an extension of word2vec model), treats each word as composed of character ngrams. So the vector for a word is made of the sum of this character n grams

Due to the enormous amount of data being generated by Facebook users every day, Facebook had a very challenging task to deal with such a huge amount of data. This data included an enormous amount of text in the form of status updates, comments, etc. In order to serve its users in the best possible ways, Facebook had to think of a different way to compute word representation of this generated data by billions of users. In order to deal with this large amount of data generated each day the Facebook came out with its own open-source library, FastText, for word representation and text classification. …

About

Tatheer Hussain Mir

NLP Researcher| Ph.D. | Research focus “Social Networking and Human-Centered Computing”.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store