-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add tokenization article and notebook used in Large Language Models (…
…LLMs)
- Loading branch information
Showing
5 changed files
with
612 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# :rocket: Portfolio | ||
|
||
This repository includes my side projects on various applications of Data Science and Machine Learning. | ||
|
||
Check the documentaion [here](https://smortezah.github.io/portfolio/docs). | ||
|
||
Also, my articles on the Medium platform can be found [here](https://medium.com/@morihosseini/). | ||
|
||
**Note:** The following list is sorted alphabetically. | ||
|
||
## :rotating_light: Anomaly detection | ||
|
||
- [Credit card fraud detection](anomaly-detection/fraud-detection.ipynb): detecting fraudulent transactions in a dataset using neural networks | ||
|
||
## :factory: Automation | ||
|
||
- [Auto commit to GitHub](automation/auto-commit): automating the process of committing and pushing changes to GitHub | ||
|
||
## :camera: Computer Vision | ||
|
||
- [Ants vs bees image classification](computer-vision/ants-bees-classification/image-classification.ipynb): an app for classification of images, employing deep learning models | ||
|
||
## 🧩 Data Structures | ||
|
||
- [Hashing](data-structure/hashing.ipynb): an introduction to hashing, its applications, and Python implementation | ||
- [Sorting](data-structure/sorting-popular.ipynb): a guide to popular sorting algorithms in Python | ||
|
||
## :mag: EDA (Exploratory Data Analysis) | ||
|
||
- [Data balancing](eda/data-balancing.ipynb): balancing imbalanced datasets using different methods | ||
- [Handling missing data](eda/missing-data.ipynb): handling missing data in a dataset using various methods | ||
- [Polars](eda/polars.ipynb): using [polars](https://www.pola.rs) library for data manipulation and analysis | ||
|
||
## :hammer_and_wrench: ETL (Extract, Transform, Load) | ||
|
||
- [ETL pipeline with Airflow and Docker](etl/airflow-docker): automatization of extracting data from various sources, transforming them, and loading the transformed data into a database | ||
|
||
## :gear: Hyperparameter tuning | ||
|
||
- [KerasTuner](hypertune/kerasTuner.ipynb): hyperparameter tuning using [KerasTuner](https://keras.io/keras_tuner/) library | ||
- [Optuna](hypertune/optuna.ipynb): hyperparameter tuning with [Optuna](https://optuna.org/) library | ||
|
||
## :brain: LLM (Large Language Model) | ||
|
||
- [Tokenization](llm/tokenization.ipynb): exploring tokenization of text data | ||
|
||
## :robot: Machine Learning | ||
|
||
- [Best threshold for logistic regression](machine-learning/threshold-logistic-regression.ipynb): different methods to find the optimal threshold for logistic regression | ||
|
||
## :lock: Privacy | ||
|
||
- [Anonymization](privacy/anonymization.ipynb): an introduction to data anonymization and its applications | ||
- [Encryption](privacy/encryption.ipynb): a beginner's guide to Python encryption | ||
|
||
## :snake: Python | ||
|
||
- [Argument parsing](python/argparse.ipynb): a guide to argument parsing using `argparse` module | ||
- [Generators](python/generator.ipynb): a hands-on guide to generators | ||
- [Lambda](python/lambda.ipynb): an introduction to lambda functions | ||
- [Pattern matching](python/match-case.ipynb): a guide to pattern matching with `match-case` statement | ||
|
||
## :chart_with_upwards_trend: Statistical analysis | ||
|
||
- [A/B testing](stats/ab-test.ipynb): testing the effectiveness of a new feature in a web application by A/B testing | ||
- [Hypothesis testing: p-values around 0.05](stats/pvalue-around-0.05.ipynb): should we reject the null hypothesis if the p-value is around 0.05? | ||
|
||
## :bulb: Synthetic data generation | ||
|
||
- [Introduction](synthetic-data/intro.ipynb): generating synthetic data using Python and also, considerations for using synthetic data | ||
|
||
## :desktop_computer: Terminal | ||
|
||
- [jq](terminal/jq.ipynb): JSON manipulating with [jq](https://jqlang.github.io/jq/) | ||
- [Rich](terminal/rich/rich.ipynb): formatting text in the terminal using [Rich](https://github.com/Textualize/rich) library | ||
|
||
## :hourglass_flowing_sand: Time-series | ||
|
||
- [Forecasting with sktime](time-series/sktime.ipynb): time-series forecasting using [sktime](https://github.com/sktime/sktime) library | ||
- [Prevent overfitting](time-series/prevent-overfitting.ipynb): preventing overfitting in time series forecasting using different techniques | ||
|
||
## :art: Visualization | ||
|
||
- [lets-plot](visualization/lets-plot/codebook.ipynb): plotting with [lets-plot](https://lets-plot.org/index.html), a Python port of the R's [ggplot2](https://ggplot2.tidyverse.org/) library | ||
- [Pitfalls](visualization/pitfalls/pitfalls.ipynb): common pitfalls in data visualization and how to avoid them | ||
- [QR code](visualization/qrcode.ipynb): generating QR codes | ||
|
||
## :spider_web: Web scraping | ||
|
||
- [jobinventory](scrape/jobinventory.com/tutorial.ipynb): scraping job listings from jobinventory.com using Python | ||
|
||
## :memo: XAI (Explainable AI) | ||
|
||
- [Introduction](xai/intro.ipynb): an introduction to explainable AI and its importance |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,239 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Click [here]() to access the associated Medium article." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Setup" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 24, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"\u001b[2K\u001b[2mResolved \u001b[1m22 packages\u001b[0m in 314ms\u001b[0m \u001b[0m\n", | ||
"\u001b[2K\u001b[2mDownloaded \u001b[1m1 package\u001b[0m in 792ms\u001b[0m \u001b[0m\n", | ||
"\u001b[2K\u001b[2mInstalled \u001b[1m3 packages\u001b[0m in 219ms\u001b[0m4.40.2 \u001b[0m\n", | ||
" \u001b[32m+\u001b[39m \u001b[1mnumpy\u001b[0m\u001b[2m==1.26.4\u001b[0m\n", | ||
" \u001b[32m+\u001b[39m \u001b[1msafetensors\u001b[0m\u001b[2m==0.4.3\u001b[0m\n", | ||
" \u001b[32m+\u001b[39m \u001b[1mtransformers\u001b[0m\u001b[2m==4.40.2\u001b[0m\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!uv pip install nltk tiktoken tokenizers sentencepiece transformers" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Basic Tokenization Techniques\n", | ||
"\n", | ||
"## Sentence Tokenization" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Tokenization is fascinating.\n", | ||
"Sentence tokenization splits text into sentences.\n", | ||
"It's crucial for NLP.\n" | ||
] | ||
}, | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"[nltk_data] Downloading package punkt to /Users/user/nltk_data...\n", | ||
"[nltk_data] Package punkt is already up-to-date!\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import nltk\n", | ||
"from nltk.tokenize import sent_tokenize\n", | ||
"\n", | ||
"nltk.download(\"punkt\")\n", | ||
"\n", | ||
"text = \"Tokenization is fascinating. Sentence tokenization splits text into sentences. It's crucial for NLP.\"\n", | ||
"sentences = sent_tokenize(text)\n", | ||
"\n", | ||
"for sentence in sentences:\n", | ||
" print(sentence)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Word Tokenization" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 14, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"['Word', 'tokenization', 'is', 'essential', 'for', 'NLP', 'tasks', '.']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import nltk\n", | ||
"from nltk.tokenize import word_tokenize\n", | ||
"\n", | ||
"sentence = \"Word tokenization is essential for NLP tasks.\"\n", | ||
"words = word_tokenize(sentence)\n", | ||
"\n", | ||
"print(words)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Subword Tokenization" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 22, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"[3214, 1178, 4037, 2065, 449, 426, 1777, 374, 8147, 13]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import tiktoken\n", | ||
"\n", | ||
"enc = tiktoken.get_encoding(\"cl100k_base\")\n", | ||
"\n", | ||
"text = \"Subword tokenization with BPE is powerful.\"\n", | ||
"encoded = enc.encode(text)\n", | ||
"decoded = enc.decode(encoded)\n", | ||
"\n", | ||
"assert decoded == text\n", | ||
"\n", | ||
"print(encoded)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Advanced Tokenization Methods\n", | ||
"\n", | ||
"## Byte-Level BPE (Byte-Pair Encoding)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 18, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"['[CLS]', 'byte', '-', 'level', 'bp', '##e', 'is', 'fascinating', '.', '[SEP]']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from tokenizers import Tokenizer\n", | ||
"\n", | ||
"tokenizer = Tokenizer.from_pretrained(\"bert-base-uncased\")\n", | ||
"encoded = tokenizer.encode(\"Byte-Level BPE is fascinating.\")\n", | ||
"decoded = tokenizer.decode(encoded.ids)\n", | ||
"\n", | ||
"print(encoded.tokens)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Tokenization in Pretrained LLMs" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 33, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"BERT tokens: ['token', '##ization', 'is', 'fascinating', '.']\n", | ||
"GPT-2 tokens: ['Token', 'ization', 'Ġis', 'Ġfascinating', '.']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from transformers import BertTokenizer, GPT2Tokenizer\n", | ||
"\n", | ||
"# Load BERT and GPT2 tokenizers\n", | ||
"bert_tokenizer = BertTokenizer.from_pretrained(\"bert-base-uncased\")\n", | ||
"gpt2_tokenizer = GPT2Tokenizer.from_pretrained(\"gpt2\")\n", | ||
"\n", | ||
"# Tokenize a sentence\n", | ||
"sentence = \"Tokenization is fascinating.\"\n", | ||
"bert_tokens = bert_tokenizer.tokenize(sentence)\n", | ||
"gpt2_tokens = gpt2_tokenizer.tokenize(sentence)\n", | ||
"\n", | ||
"print(\"BERT tokens:\", bert_tokens)\n", | ||
"print(\"GPT-2 tokens:\", gpt2_tokens)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": ".venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Large Language Model (LLM) | ||
|
||
```mdx-code-block | ||
import DocCardList from '@theme/DocCardList'; | ||
<DocCardList /> | ||
``` |
Oops, something went wrong.