Introduction to Natural Language Processing

Friday March 2, 2017

Natural language processing (NLP) refers to the methods and technologies used to allow computers to process, understand, and perform tasks using human language. Common NLP tasks include sentiment analysis, part-of-speech tagging, named entity recognition, machine translation, document classification, clustering, and topic extraction. This course will introduce fundamental concepts in NLP including word and document representation, text processing, document classification, document similarity, and clustering, and dimensionality reduction. The course will be taught using jupyter notebooks in python. NLP tools covered will be sci-kit learn and ntlk.

Who: This course is targeted primarily at graduate students and researchers who have some experience with machine learning and python, but are new to NLP.

Requirements: Participants must bring a laptop with a few specific software packages installed (see Pre-Workshop Instructions).

Prerequisites: A previous course in programming is strongly recommended. Experience with basic machine learning is recommended.

Contact: Please mail [email protected] for more information.

Tentative Schedule

Time
8:30-9:00	Sign-in (coffee & bagels)
9:00-10:30	Text Processing and Document Classification
10:30 - 10:45	Break
10:45-1:00	Document Similarity and Clustering

Syllabus

Introduction/Preparation

Common NLP Tasks
Word and Document Representation
Text Processing

Document Classification

Text Processing
TFIDF
Evaluation

Clustering

Document Similartiy

Dimensionality Reduction

Topic Modeling
Visualization

Pre-Workshop Instructions

You will need the following programs to run the jupyter notebook:

Git
Python
Python modules (see requirements.txt)

Python modules can be installed using either anaconda (recommended for beginners) or pip.

Git

You will need to install git. After installing git, run the following command to clone the workshop repository:

git clone https://github.com/UCIDataScienceInitiative/NLP.git

Python

You will need python installed. If you do not already have python installed, we recommend downloading Anaconda, which will include python and the modules required for this workshop. The notebook will run with versions 3.5 and 2.9.

Anaconda

This is the recommended method of installation for users newer to python, or those that have not used pip. Anaconda should have all required modules. After installing Anaconda, run

conda update conda

to update all modules. If the LatentDirchletAllocation method will not import, it may help to update scikit-learn by running

conda update scikit-learn

Pip

To run this script you will need the packages listed in requirements.txt. To install run

pip install -r requirments.txt in the command line.

Operating Systems

I was able to get the notebook running using python 3.5 and 2.9 on both Mac and Windows machines. If you are having trouble installing any of the required software, please come to the workshop a few minutes early. Additionally, we will have scheduled setup time to address any problems.

Data

The data are comments and metadata from two mental health subreddits /r/SuicideWatch and /r/depression. The data were filtered from this dataset.

To unzip the data, run gunzip RC_2015-05.json.gz

Slides

If you would like to see the presentation in slide form, rather than an ipython notebook, you can run the following commands:

jupyter nbconvert --to slides IntroNLP.ipynb --post serve

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
figs		figs
.gitignore		.gitignore
IntroNLP.ipynb		IntroNLP.ipynb
IntroNLP_answer_key.ipynb		IntroNLP_answer_key.ipynb
README.md		README.md
get-pip.py		get-pip.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to Natural Language Processing

Tentative Schedule

Syllabus

Pre-Workshop Instructions

Git

Python

Anaconda

Pip

Operating Systems

Data

Slides

About

Releases

Packages

Languages

UCIDataScienceInitiative/NLP

Folders and files

Latest commit

History

Repository files navigation

Introduction to Natural Language Processing

Tentative Schedule

Syllabus

Pre-Workshop Instructions

Git

Python

Anaconda

Pip

Operating Systems

Data

Slides

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages