My goals in this talk:
We deal with data every day, including natural language text.
There are a variety of tasks relevant to us, like extracting structured information from unstructured text (e.g., topic extraction from whitepapers), automation, and document search.
In order to perform these tasks, we--or the computers we are running--need to be able to parse and extract information from this text.
You can use a technique known as sentence diagramming to break sentences down into their key components. This is an example of diagramming using the Reed-Kellogg system.
Start with the following sentence:
The dog brought me his old ball in the morning.
The dog brought me his old ball in the morning.
Step 1: Diagram the subject noun and main predicate verb
The dog brought me his old ball in the morning.
Step 2: Add the direct object
The dog brought me his old ball in the morning.
Step 3: Add indirect object(s)
The dog brought me his old ball in the morning.
Step 4: Add prepositional phrases
The dog brought me his old ball in the morning.
Step 5: Add modifiers and articles
Here are some of the key terms you will often see in natural language processing discussions or the literature.
Natural language processing is a field within artificial intelligence, with the goal of recognizing and analyzing text and speech.
We use a variety of techniques to do this, including computational linguistics, rule-based modeling of human languages, statistical modeling of languages, machine learning, and deep learning.
Modern natural language processing builds heavily on top of neural networks and deep learning techniques.
We will not get into neural network architectures or detailed descriptions of the topic in this talk. This is a beginner-level NLP talk!
Natural language processing enables us to perform a variety of tasks, including but not limited to:
Natural language processing brings with it a series of challenges, including but not limited to:
SpaCy is a Python library for natural language processing.
It is designed to be fast, efficient, and easy to use.
It is not an API, software-as-a-service product, or chat bot engine.
Key benefits of spaCy include:
The Natural Language Toolkit (NLTK) is another open-source Python library for natural language processing.
NLTK is more research-friendly, emphasizing the ability to play with natural language over production speed.
Why do we focus on spaCy in this talk?
NLTK | SpaCy |
---|---|
Research-oriented | Production-oriented |
Slower | Faster |
More complex API | Simpler API |
More features | Fewer features |
NLTK is still a great choice for:
Some foundation in how natural languages work can be helpful when dealing with natural language processing. Not only does this help you understand the reason why you perform specific tasks, but can also point you toward structured ways of thinking about the problem.
This does not mean you need a PhD in linguistics, or even formal training in the field!
Most of the work we've done so far involves parts of speech and a syntactic understanding of language. This is the domain of traditional natural language processing and is still very useful today.
But it does lead to certain problems.
Traditional natural language processing relies on a syntactic understanding of language. This is great for many tasks, but it doesn't help us with the semantic meaning of a word.
For example, the word "bank" can refer to a financial institution, the side of a river, or a place to store something.
We also lack the ability to perform semantic comparisons: wolf is to dog as tiger is to ___?
We need a way to represent the meaning of a word in a way that is more flexible and can be used for a variety of tasks.
In 2013, Tomas Mikolov and his team at Google released a paper entitled Efficient Estimation of Word Representations in Vector Space. In this paper, they introduce Word2Vec.
Word2Vec is a technique for representing words as vectors in a high-dimensional space. This allows us to perform semantic comparisons and find similar words.
The key intuition is that we can represent words as dense vectors in a continuous vector space.
Vectors retain some amount of their semantic meaning in that vector space.
The key intuition is that we can represent words as dense vectors in a continuous vector space. "Vector space" is a fancy term for representing something as a fixed-size list of numbers.
Word2Vec introduced two techniques: Skip-Gram and Continuous Bag of Words.
Skip-Gram: Given a word, predict the surrounding words in a sentence.
___ score ___ ___ ___ ___ ->
"Four score and seven years ago"
Word2Vec introduced two techniques: Skip-Gram and Continuous Bag of Words.
Skip-Gram: Given a word, predict the surrounding words in a sentence.
Continuous Bag of Words: Given a set of surrounding words, predict the word in the middle.
Four ___ and seven years ago ->
"score"
We can perform mathematical operations on vectors, such as adding, subtracting, multiplying and dividing them.
The classic Word2Vec formulation is: "King - Man + Woman = ??"
Word2Vec shocked the natural language processing community when the paper first came out. Since then, the biggest development has been in vector space.
Embedding techniques like GloVe, FastText, and BERT are able to capture even more semantic information.
Vectorization is the process of translating data into numerical vectors.
Embeddings are learned vector representations of words, phrases, or even entire documents.
We generate embeddings using neural network models trained on large datasets of text. The model learns to position similar ideas close together in vector space.
Embeddings also allow us to use techniques like cosine similarity to determine how similar two vectors are. This ties back to a semantic similarity between concepts.
Over the course of this talk, we gained a high-level understanding of what natural language processing is. We looked at one very popular NLP library in spaCy and saw how NLP has transitioned from a syntax-heavy approach to a semantics-heavy approach thanks to vectorization and embeddings.
To learn more, go here:
https://csmore.info/on/basicsofnlp
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/contact