Skip to content

Commit

Permalink
Add tokenization article and notebook used in Large Language Models (…
Browse files Browse the repository at this point in the history
…LLMs)
  • Loading branch information
smortezah committed May 13, 2024
1 parent 0960dff commit 23961b0
Show file tree
Hide file tree
Showing 5 changed files with 612 additions and 0 deletions.
94 changes: 94 additions & 0 deletions README copy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# :rocket: Portfolio

This repository includes my side projects on various applications of Data Science and Machine Learning.

Check the documentaion [here](https://smortezah.github.io/portfolio/docs).

Also, my articles on the Medium platform can be found [here](https://medium.com/@morihosseini/).

**Note:** The following list is sorted alphabetically.

## :rotating_light: Anomaly detection

- [Credit card fraud detection](anomaly-detection/fraud-detection.ipynb): detecting fraudulent transactions in a dataset using neural networks

## :factory: Automation

- [Auto commit to GitHub](automation/auto-commit): automating the process of committing and pushing changes to GitHub

## :camera: Computer Vision

- [Ants vs bees image classification](computer-vision/ants-bees-classification/image-classification.ipynb): an app for classification of images, employing deep learning models

## 🧩 Data Structures

- [Hashing](data-structure/hashing.ipynb): an introduction to hashing, its applications, and Python implementation
- [Sorting](data-structure/sorting-popular.ipynb): a guide to popular sorting algorithms in Python

## :mag: EDA (Exploratory Data Analysis)

- [Data balancing](eda/data-balancing.ipynb): balancing imbalanced datasets using different methods
- [Handling missing data](eda/missing-data.ipynb): handling missing data in a dataset using various methods
- [Polars](eda/polars.ipynb): using [polars](https://www.pola.rs) library for data manipulation and analysis

## :hammer_and_wrench: ETL (Extract, Transform, Load)

- [ETL pipeline with Airflow and Docker](etl/airflow-docker): automatization of extracting data from various sources, transforming them, and loading the transformed data into a database

## :gear: Hyperparameter tuning

- [KerasTuner](hypertune/kerasTuner.ipynb): hyperparameter tuning using [KerasTuner](https://keras.io/keras_tuner/) library
- [Optuna](hypertune/optuna.ipynb): hyperparameter tuning with [Optuna](https://optuna.org/) library

## :brain: LLM (Large Language Model)

- [Tokenization](llm/tokenization.ipynb): exploring tokenization of text data

## :robot: Machine Learning

- [Best threshold for logistic regression](machine-learning/threshold-logistic-regression.ipynb): different methods to find the optimal threshold for logistic regression

## :lock: Privacy

- [Anonymization](privacy/anonymization.ipynb): an introduction to data anonymization and its applications
- [Encryption](privacy/encryption.ipynb): a beginner's guide to Python encryption

## :snake: Python

- [Argument parsing](python/argparse.ipynb): a guide to argument parsing using `argparse` module
- [Generators](python/generator.ipynb): a hands-on guide to generators
- [Lambda](python/lambda.ipynb): an introduction to lambda functions
- [Pattern matching](python/match-case.ipynb): a guide to pattern matching with `match-case` statement

## :chart_with_upwards_trend: Statistical analysis

- [A/B testing](stats/ab-test.ipynb): testing the effectiveness of a new feature in a web application by A/B testing
- [Hypothesis testing: p-values around 0.05](stats/pvalue-around-0.05.ipynb): should we reject the null hypothesis if the p-value is around 0.05?

## :bulb: Synthetic data generation

- [Introduction](synthetic-data/intro.ipynb): generating synthetic data using Python and also, considerations for using synthetic data

## :desktop_computer: Terminal

- [jq](terminal/jq.ipynb): JSON manipulating with [jq](https://jqlang.github.io/jq/)
- [Rich](terminal/rich/rich.ipynb): formatting text in the terminal using [Rich](https://github.com/Textualize/rich) library

## :hourglass_flowing_sand: Time-series

- [Forecasting with sktime](time-series/sktime.ipynb): time-series forecasting using [sktime](https://github.com/sktime/sktime) library
- [Prevent overfitting](time-series/prevent-overfitting.ipynb): preventing overfitting in time series forecasting using different techniques

## :art: Visualization

- [lets-plot](visualization/lets-plot/codebook.ipynb): plotting with [lets-plot](https://lets-plot.org/index.html), a Python port of the R's [ggplot2](https://ggplot2.tidyverse.org/) library
- [Pitfalls](visualization/pitfalls/pitfalls.ipynb): common pitfalls in data visualization and how to avoid them
- [QR code](visualization/qrcode.ipynb): generating QR codes

## :spider_web: Web scraping

- [jobinventory](scrape/jobinventory.com/tutorial.ipynb): scraping job listings from jobinventory.com using Python

## :memo: XAI (Explainable AI)

- [Introduction](xai/intro.ipynb): an introduction to explainable AI and its importance
239 changes: 239 additions & 0 deletions llm/tokenization.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Click [here]() to access the associated Medium article."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[2K\u001b[2mResolved \u001b[1m22 packages\u001b[0m in 314ms\u001b[0m \u001b[0m\n",
"\u001b[2K\u001b[2mDownloaded \u001b[1m1 package\u001b[0m in 792ms\u001b[0m \u001b[0m\n",
"\u001b[2K\u001b[2mInstalled \u001b[1m3 packages\u001b[0m in 219ms\u001b[0m4.40.2 \u001b[0m\n",
" \u001b[32m+\u001b[39m \u001b[1mnumpy\u001b[0m\u001b[2m==1.26.4\u001b[0m\n",
" \u001b[32m+\u001b[39m \u001b[1msafetensors\u001b[0m\u001b[2m==0.4.3\u001b[0m\n",
" \u001b[32m+\u001b[39m \u001b[1mtransformers\u001b[0m\u001b[2m==4.40.2\u001b[0m\n"
]
}
],
"source": [
"!uv pip install nltk tiktoken tokenizers sentencepiece transformers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Basic Tokenization Techniques\n",
"\n",
"## Sentence Tokenization"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Tokenization is fascinating.\n",
"Sentence tokenization splits text into sentences.\n",
"It's crucial for NLP.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /Users/user/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
]
}
],
"source": [
"import nltk\n",
"from nltk.tokenize import sent_tokenize\n",
"\n",
"nltk.download(\"punkt\")\n",
"\n",
"text = \"Tokenization is fascinating. Sentence tokenization splits text into sentences. It's crucial for NLP.\"\n",
"sentences = sent_tokenize(text)\n",
"\n",
"for sentence in sentences:\n",
" print(sentence)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Word Tokenization"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Word', 'tokenization', 'is', 'essential', 'for', 'NLP', 'tasks', '.']\n"
]
}
],
"source": [
"import nltk\n",
"from nltk.tokenize import word_tokenize\n",
"\n",
"sentence = \"Word tokenization is essential for NLP tasks.\"\n",
"words = word_tokenize(sentence)\n",
"\n",
"print(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Subword Tokenization"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[3214, 1178, 4037, 2065, 449, 426, 1777, 374, 8147, 13]\n"
]
}
],
"source": [
"import tiktoken\n",
"\n",
"enc = tiktoken.get_encoding(\"cl100k_base\")\n",
"\n",
"text = \"Subword tokenization with BPE is powerful.\"\n",
"encoded = enc.encode(text)\n",
"decoded = enc.decode(encoded)\n",
"\n",
"assert decoded == text\n",
"\n",
"print(encoded)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Advanced Tokenization Methods\n",
"\n",
"## Byte-Level BPE (Byte-Pair Encoding)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['[CLS]', 'byte', '-', 'level', 'bp', '##e', 'is', 'fascinating', '.', '[SEP]']\n"
]
}
],
"source": [
"from tokenizers import Tokenizer\n",
"\n",
"tokenizer = Tokenizer.from_pretrained(\"bert-base-uncased\")\n",
"encoded = tokenizer.encode(\"Byte-Level BPE is fascinating.\")\n",
"decoded = tokenizer.decode(encoded.ids)\n",
"\n",
"print(encoded.tokens)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tokenization in Pretrained LLMs"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"BERT tokens: ['token', '##ization', 'is', 'fascinating', '.']\n",
"GPT-2 tokens: ['Token', 'ization', 'Ġis', 'Ġfascinating', '.']\n"
]
}
],
"source": [
"from transformers import BertTokenizer, GPT2Tokenizer\n",
"\n",
"# Load BERT and GPT2 tokenizers\n",
"bert_tokenizer = BertTokenizer.from_pretrained(\"bert-base-uncased\")\n",
"gpt2_tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n",
"\n",
"# Tokenize a sentence\n",
"sentence = \"Tokenization is fascinating.\"\n",
"bert_tokens = bert_tokenizer.tokenize(sentence)\n",
"gpt2_tokens = gpt2_tokenizer.tokenize(sentence)\n",
"\n",
"print(\"BERT tokens:\", bert_tokens)\n",
"print(\"GPT-2 tokens:\", gpt2_tokens)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Binary file added website/docs/llm/img/token-bpe.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions website/docs/llm/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Large Language Model (LLM)

```mdx-code-block
import DocCardList from '@theme/DocCardList';
<DocCardList />
```
Loading

0 comments on commit 23961b0

Please sign in to comment.