An Introduction to Modern NLP Techniques

Welcome to this workshop on Natural Language Processing (NLP) and Large Language Models (LLMs). The primary goal of this workshop is to provide a solid introduction to modern NLP techniques and demonstrate how these methods can be applied in research and business settings.

During the workshop, we will cover the following topics:

  1. Binary Classification of Tweets: We’ll start by building a simple binary classifier for Twitter data. This will serve as an introduction to basic NLP concepts and techniques.

  2. FewShot Classification and Data Curation: Next, we’ll move on to more challenging cases that were not easily solvable until autumn 2022. We’ll explore FewShot classification, which involves using only a few examples for each class. Additionally, we’ll learn about labeling tools used for data curation.

  3. LLMs for Data Labeling: In this section, we’ll examine how LLMs can be utilized to label data in the 0-shot case scenario. This technique is particularly useful when there are no labeled examples available.

  4. Topic Modeling with LLMs: We will then explore the use of LLMs alongside modern topic modeling using BerTopic. In this approach, an LLM acts as a domain expert and interprets our identified topics.

  5. Retrieval-Augmented Generation (RAG) with LangChain: Next, we’ll focus on RAG using LangChain. We will try out two approaches: one that retrieves data from a vector store based on semantic similarity and another more advanced approach that combines semantic retrieval with metadata filtering.

  6. LLMs for Knowledge Extraction and Data Structuration: Finally, we’ll look into the use of LLMs for knowledge extraction and data structuration. We will examine a use case involving long-form text as well as how to generate synthetic datasets and fine-tune relatively small language models using instruction-based techniques.

By the end of this workshop, you should have gained valuable insights into modern NLP techniques and their practical applications in research and business settings.