Add tokenization article and notebook used in Large Language Models (…

…LLMs)
smortezah · May 13, 2024 · 23961b0 · 23961b0
1 parent 0960dff
commit 23961b0
Show file tree

Hide file tree

Showing 5 changed files with 612 additions and 0 deletions.
diff --git a/README copy.md b/README copy.md
@@ -0,0 +1,94 @@
+# :rocket: Portfolio
+
+This repository includes my side projects on various applications of Data Science and Machine Learning.
+
+Check the documentaion [here](https://smortezah.github.io/portfolio/docs).
+
+Also, my articles on the Medium platform can be found [here](https://medium.com/@morihosseini/).
+
+**Note:** The following list is sorted alphabetically.
+
+## :rotating_light: Anomaly detection
+
+- [Credit card fraud detection](anomaly-detection/fraud-detection.ipynb): detecting fraudulent transactions in a dataset using neural networks
+
+## :factory: Automation
+
+- [Auto commit to GitHub](automation/auto-commit): automating the process of committing and pushing changes to GitHub
+
+## :camera: Computer Vision
+
+- [Ants vs bees image classification](computer-vision/ants-bees-classification/image-classification.ipynb): an app for classification of images, employing deep learning models
+
+## 🧩 Data Structures
+
+- [Hashing](data-structure/hashing.ipynb): an introduction to hashing, its applications, and Python implementation
+- [Sorting](data-structure/sorting-popular.ipynb): a guide to popular sorting algorithms in Python
+
+## :mag: EDA (Exploratory Data Analysis)
+
+- [Data balancing](eda/data-balancing.ipynb): balancing imbalanced datasets using different methods
+- [Handling missing data](eda/missing-data.ipynb): handling missing data in a dataset using various methods
+- [Polars](eda/polars.ipynb): using [polars](https://www.pola.rs) library for data manipulation and analysis
+
+## :hammer_and_wrench: ETL (Extract, Transform, Load)
+
+- [ETL pipeline with Airflow and Docker](etl/airflow-docker): automatization of extracting data from various sources, transforming them, and loading the transformed data into a database
+
+## :gear: Hyperparameter tuning
+
+- [KerasTuner](hypertune/kerasTuner.ipynb): hyperparameter tuning using [KerasTuner](https://keras.io/keras_tuner/) library
+- [Optuna](hypertune/optuna.ipynb): hyperparameter tuning with [Optuna](https://optuna.org/) library
+
+## :brain: LLM (Large Language Model)
+
+- [Tokenization](llm/tokenization.ipynb): exploring tokenization of text data
+
+## :robot: Machine Learning
+
+- [Best threshold for logistic regression](machine-learning/threshold-logistic-regression.ipynb): different methods to find the optimal threshold for logistic regression
+
+## :lock: Privacy
+
+- [Anonymization](privacy/anonymization.ipynb): an introduction to data anonymization and its applications
+- [Encryption](privacy/encryption.ipynb): a beginner's guide to Python encryption
+
+## :snake: Python
+
+- [Argument parsing](python/argparse.ipynb): a guide to argument parsing using `argparse` module
+- [Generators](python/generator.ipynb): a hands-on guide to generators
+- [Lambda](python/lambda.ipynb): an introduction to lambda functions
+- [Pattern matching](python/match-case.ipynb): a guide to pattern matching with `match-case` statement
+
+## :chart_with_upwards_trend: Statistical analysis
+
+- [A/B testing](stats/ab-test.ipynb): testing the effectiveness of a new feature in a web application by A/B testing
+- [Hypothesis testing: p-values around 0.05](stats/pvalue-around-0.05.ipynb): should we reject the null hypothesis if the p-value is around 0.05?
+
+## :bulb: Synthetic data generation
+
+- [Introduction](synthetic-data/intro.ipynb): generating synthetic data using Python and also, considerations for using synthetic data
+
+## :desktop_computer: Terminal
+
+- [jq](terminal/jq.ipynb): JSON manipulating with [jq](https://jqlang.github.io/jq/)
+- [Rich](terminal/rich/rich.ipynb): formatting text in the terminal using [Rich](https://github.com/Textualize/rich) library
+
+## :hourglass_flowing_sand: Time-series
+
+- [Forecasting with sktime](time-series/sktime.ipynb): time-series forecasting using [sktime](https://github.com/sktime/sktime) library
+- [Prevent overfitting](time-series/prevent-overfitting.ipynb): preventing overfitting in time series forecasting using different techniques
+
+## :art: Visualization
+
+- [lets-plot](visualization/lets-plot/codebook.ipynb): plotting with [lets-plot](https://lets-plot.org/index.html), a Python port of the R's [ggplot2](https://ggplot2.tidyverse.org/) library
+- [Pitfalls](visualization/pitfalls/pitfalls.ipynb): common pitfalls in data visualization and how to avoid them
+- [QR code](visualization/qrcode.ipynb): generating QR codes
+
+## :spider_web: Web scraping
+
+- [jobinventory](scrape/jobinventory.com/tutorial.ipynb): scraping job listings from jobinventory.com using Python
+
+## :memo: XAI (Explainable AI)
+
+- [Introduction](xai/intro.ipynb): an introduction to explainable AI and its importance
diff --git a/llm/tokenization.ipynb b/llm/tokenization.ipynb
@@ -0,0 +1,239 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Click [here]() to access the associated Medium article."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[2K\u001b[2mResolved \u001b[1m22 packages\u001b[0m in 314ms\u001b[0m                                                \u001b[0m\n",
+      "\u001b[2K\u001b[2mDownloaded \u001b[1m1 package\u001b[0m in 792ms\u001b[0m                                       \u001b[0m\n",
+      "\u001b[2K\u001b[2mInstalled \u001b[1m3 packages\u001b[0m in 219ms\u001b[0m4.40.2                                 \u001b[0m\n",
+      " \u001b[32m+\u001b[39m \u001b[1mnumpy\u001b[0m\u001b[2m==1.26.4\u001b[0m\n",
+      " \u001b[32m+\u001b[39m \u001b[1msafetensors\u001b[0m\u001b[2m==0.4.3\u001b[0m\n",
+      " \u001b[32m+\u001b[39m \u001b[1mtransformers\u001b[0m\u001b[2m==4.40.2\u001b[0m\n"
+     ]
+    }
+   ],
+   "source": [
+    "!uv pip install nltk tiktoken tokenizers sentencepiece transformers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Basic Tokenization Techniques\n",
+    "\n",
+    "## Sentence Tokenization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Tokenization is fascinating.\n",
+      "Sentence tokenization splits text into sentences.\n",
+      "It's crucial for NLP.\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package punkt to /Users/user/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n"
+     ]
+    }
+   ],
+   "source": [
+    "import nltk\n",
+    "from nltk.tokenize import sent_tokenize\n",
+    "\n",
+    "nltk.download(\"punkt\")\n",
+    "\n",
+    "text = \"Tokenization is fascinating. Sentence tokenization splits text into sentences. It's crucial for NLP.\"\n",
+    "sentences = sent_tokenize(text)\n",
+    "\n",
+    "for sentence in sentences:\n",
+    "    print(sentence)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Word Tokenization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['Word', 'tokenization', 'is', 'essential', 'for', 'NLP', 'tasks', '.']\n"
+     ]
+    }
+   ],
+   "source": [
+    "import nltk\n",
+    "from nltk.tokenize import word_tokenize\n",
+    "\n",
+    "sentence = \"Word tokenization is essential for NLP tasks.\"\n",
+    "words = word_tokenize(sentence)\n",
+    "\n",
+    "print(words)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Subword Tokenization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[3214, 1178, 4037, 2065, 449, 426, 1777, 374, 8147, 13]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import tiktoken\n",
+    "\n",
+    "enc = tiktoken.get_encoding(\"cl100k_base\")\n",
+    "\n",
+    "text = \"Subword tokenization with BPE is powerful.\"\n",
+    "encoded = enc.encode(text)\n",
+    "decoded = enc.decode(encoded)\n",
+    "\n",
+    "assert decoded == text\n",
+    "\n",
+    "print(encoded)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Advanced Tokenization Methods\n",
+    "\n",
+    "## Byte-Level BPE (Byte-Pair Encoding)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['[CLS]', 'byte', '-', 'level', 'bp', '##e', 'is', 'fascinating', '.', '[SEP]']\n"
+     ]
+    }
+   ],
+   "source": [
+    "from tokenizers import Tokenizer\n",
+    "\n",
+    "tokenizer = Tokenizer.from_pretrained(\"bert-base-uncased\")\n",
+    "encoded = tokenizer.encode(\"Byte-Level BPE is fascinating.\")\n",
+    "decoded = tokenizer.decode(encoded.ids)\n",
+    "\n",
+    "print(encoded.tokens)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Tokenization in Pretrained LLMs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "BERT tokens: ['token', '##ization', 'is', 'fascinating', '.']\n",
+      "GPT-2 tokens: ['Token', 'ization', 'Ġis', 'Ġfascinating', '.']\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import BertTokenizer, GPT2Tokenizer\n",
+    "\n",
+    "# Load BERT and GPT2 tokenizers\n",
+    "bert_tokenizer = BertTokenizer.from_pretrained(\"bert-base-uncased\")\n",
+    "gpt2_tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
+    "\n",
+    "# Tokenize a sentence\n",
+    "sentence = \"Tokenization is fascinating.\"\n",
+    "bert_tokens = bert_tokenizer.tokenize(sentence)\n",
+    "gpt2_tokens = gpt2_tokenizer.tokenize(sentence)\n",
+    "\n",
+    "print(\"BERT tokens:\", bert_tokens)\n",
+    "print(\"GPT-2 tokens:\", gpt2_tokens)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/website/docs/llm/img/token-bpe.png b/website/docs/llm/img/token-bpe.png
diff --git a/website/docs/llm/index.md b/website/docs/llm/index.md
@@ -0,0 +1,7 @@
+# Large Language Model (LLM)
+
+```mdx-code-block
+import DocCardList from '@theme/DocCardList';
+
+<DocCardList />
+```