GitHub - ovaisq/LLM-RedditScraper: Reddit Data Scraper + GenAI responses to Posts and Comments

ROLLAMA-GPT

General Overview

%%{init: {'theme': 'base', "loglevel":1,'themeVariables': {'lineColor': 'Blue', "fontFamily": "Trebuchet MS"}}}%%
flowchart TD
    style Reddit fill:#fff
    style RR fill:#a7e0f2,stroke:#13821a,stroke-width:4px
    style LocalEnv fill:#a7e0f2
    style PSQL fill:#aaa
    style REDIS fill:#aaa

    classDef subgraph_padding fill:none,stroke:none
    Rr[Redditor]
    RR(("Reddit Scraper
    API Service"))


    Rr ==> Reddit ===> RR ===> ORD
    ORD ===> OLLAMA ===> PRD ====> PSQL
    ORD ==> PSQL
    PRD =="Analyzed Post/Comment IDs"==> REDIS

    subgraph Reddit["Reddit Website"]
        subgraph blank2[ ]
            RP(Posts)
            RC(Comments)
            RA(Subreddits)
        end
    end

    subgraph LocalEnv["`**Local Environment**`"]
        subgraph blank[ ]
            direction TB
            REDIS[(Redis)]
            PSQL[("`**PostgreSQL**`")]
            subgraph ORD["Original Reddit Conent"]
                subgraph blank3[ ]
                    ORP(Posts)
                    ORC(Comments)
                    ORA(Subreddits)
                    ORB(Subscriptions)
                end
            end
            subgraph OLLAMA
                subgraph blank4[ ]
                    LLM1[llama3.2]
                    LLM2[llama3.1]
                    LLM3[gemma2]
                end
            end
            subgraph PRD["Processed Reddit Data"]
                subgraph blank5[ ]
                    subgraph JSON
                        subgraph blank6[ ]
                            JAC["GPT Response to Comments"]
                            JAP["GPT Response to Posts"]
                        end
                    end
                end
            end
        end
    end
    class blank subgraph_padding
    class blank2 subgraph_padding
    class blank3 subgraph_padding
    class blank4 subgraph_padding
    class blank5 subgraph_padding
    class blank6 subgraph_padding

Environment Configuration

%%{init: {'theme': 'base','themeVariables': {'lineColor': 'Blue', 'primaryColor':'#acbdda','tertiaryColor': '#436092','fontSize':'38==28px'}}}%%
graph TD;
  style R fill:#a7e0f2,stroke:#13821a,stroke-width:2px

R(("`**Reddit Scraper API
    Dockerized**`"))
  subgraph Load_Balancer["Load Balancer"]
    Nginx[haproxy]
  end

  subgraph Nodes["Nodes"]
    Node1[MacOS Sequoia 15.1 beta
    MacBook Pro M1 Max
    32GB RAM]
    Node2["Debian 12 ESXi VM
    2 x nVidia GTX RTX3060
    12GB VRAM ea."]
    Node3[Debian 12 ESXI VM
    2 x NVIDIA RTX 4060 Ti
    16GB VRAM ea.
    ]
  end

  Nginx -- RR --> Node1
  Nginx -- RR --> Node2
  Nginx -- RR --> Node3
  R --> Nginx

API Overview

%%{init: {'theme': 'base', 'themeVariables': {'lineColor': 'Blue'}}}%%
graph LR
    sub["/login"] --> sub3
    sub["/analyze_post"] --> sub1
    sub["/analyze_posts"] --> sub2
    sub["/analyze_comment"] --> sub9
    sub["/analyze_comments"] --> sub10
    sub["/get_sub_post"] --> sub4
    sub["/get_sub_posts"] --> sub5
    sub["/get_author_comments"] --> sub6
    sub["/get_authors_comments"] --> sub7
    sub["/get_and_analyze_post"] --> sub12
    sub["/get_and_analyze_comment"] --> sub14
    sub["/join_new_subs"] --> sub11
    sub["CLIENT"] --> sub11
    sub1["GET: Analyze a single Reddit post"]
    sub2["GET: Analyze all Reddit posts in the database"]
    sub3["POST: Generate JWT"]
    sub4["GET: Get submission post content for a given post id"]
    sub5["GET: Get submission posts for a given subreddit"]
    sub6["GET: Get all comments for a given redditor"]
    sub7["GET: Get all comments for each redditor from a list of authors in the database"]
    sub9["GET: Chat prompt a given comment_id that's stored in DB"]
    sub10["GET: Chat prompt all comments that are stored in DB"]
    sub11["GET: Join all new subs from the post table in the database"]
    sub12["GET: Fetch post from Reddit, then Chat prompt a given post_id"]
    sub14["GET: Fetch comment from Reddit, then Chat prompt a given comment_id"]

From Reddit:

Service collects
- submissions
- comments for each submission
- author of each submission
- author of each comment to each submission
- and all comments for each author.
Subscribes to subreddit that a submission was posted to
Title and content of each post, and content of each comment are prompted for a response response by LLMS (llama3.1, gemma2, and internlm2), the responses along with metadata are stored in PostgreSQL

Requirements

PostgreSQL v15 or greater
Redis v20 or greater
Ollama 0.3.12 or greater

Build

> pip3 install -r requirements.txt --quiet
> ./build.sh
Creating directory: builds/0.1.65
Building rollama-0.1.65.tar
Compressing rollama-0.1.65.tar
rollama-0.1.65.tar.gz Done
Build Info
builds/0.1.65
├── BUILD_INFO.TXT
├── Dockerfile
├── build_docker.py
├── docker_install_srvc.sh
├── install_srvc.sh
├── rollama-0.1.65.tar.gz
├── setup.config.template
└── ver.txt

1 directory, 8 files
SERVICE=rollama
VERSION=0.1.65
PACKAGE=rollama-0.1.65.tar.gz
PKGSHA256=cd5fa64d87afe1238e1ce65dc9212ccfb070071e370d026b453b52967c0896b5
SRVC_DIR=/usr/local/rollama/

Copy the contents of the builds/x.x.x directory over to the target machine

scp -r builds/0.1.65 <remote host>:/var/tmp/

Update setup.config with secrets and service keys
Where applicable you can generate strong secrets using openssl

> openssl rand -base64 28
pjo2OaLXlTHXZj4jtOa+3b4JEUqcmKz7C8IJJg=

For example setup.config

[service]
...
APP_SECRET_KEY=pjo2OaLXlTHXZj4jtOa+3b4JEUqcmKz7C8IJJg=
...

Docker build (tested only on Debian 12 for now)

> ./build_docker.py
Building Docker image rollama:0.1.65 from /var/tmp/0.1.65...
sha256:275958fcd3a7a049cc465fbe556802ba40d8cf9fff58ffd4da0593b85d5dca1a
Docker image rollama:0.1.65 built successfully!

Run docker container

> docker run -d -p 5001:5001 rollama:0.1.65

OR

Deploy it on a Local Kubernetes cluster

Assumes Local Docker registry is serving over HTTP (not SSL), and Kubernetes cluster is deployed using containerd. Following steps are required:
- Set up local docker image registry
- Since when pulling images, Kubernetes defaults to HTTPS, you must manually add registry url to containerd on each node. Otherwise, the service will fail to deploy.

Set up local Docker Image Repository

> docker run -d -p 5000:5000 --restart=always --name registry registry:2

Manually add registery to containerd on each node

Assuming that locally hosted registry is hosted by "docker" host, the following configuration needs to be added under the [plugins."io.containerd.grpc.v1.cri".registry.mirrors] in the /etc/containerd/config.toml file:

> sudo su -
> cd /etc/containerd/
> vi config.toml

Configuration block

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker:5000"]
          endpoint = ["http://docker:5000"]

Updated config should look as follows:

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker:5000"]
          endpoint = ["http://docker:5000"]

Apply configuration change

> systemctl daemon-reload

Restart Service

> systemctl restart containerd

Deploy Service to Cluster

> kubectl apply -f deployment.yaml
> kubectl apply -f service.yaml

OR

SYSTEMD install on a Debian 12 host

For now you have to be logged in as a root user: On the machine run

> ./install_srvc.sh

Install Pkg
Group rollama already exists.
User rollama already exists.
Creating directory: /usr/local/rollama/
tar xfz ./rollama-0.1.47.tar.gz -C /usr/local/rollama/ 2> /dev/null
Setting up Service
Creating directory: /etc/rollama/
Created symlink /etc/systemd/system/multi-user.target.wants/rollama.service → /etc/systemd/system/rollama.service.

Remember to edit and update /etc/rollama/setup.config
May want to also create SSL cert/key - copy the cert/key in the /usr/local/rollama/ directory
Start Service

> systemctl status rollama
○ rollama.service - Rollama-GPT
     Loaded: loaded (/etc/systemd/system/rollama.service; enabled; preset: enabled)
     Active: inactive (dead)

> systemctl start rollama
> systemctl status rollama
● rollama.service - Rollama-GPT
     Loaded: loaded (/etc/systemd/system/rollama.service; enabled; preset: enabled)
     Active: active (running) since Thu 2024-05-02 11:22:48 PDT; 2s ago
   Main PID: 2644 (bash)
      Tasks: 4 (limit: 4649)
     Memory: 87.9M
        CPU: 534ms
     CGroup: /system.slice/rollama.service
             ├─2644 bash /usr/local/rollama/run_srvc.sh
             ├─2645 /usr/bin/python3 /usr/local/bin/gunicorn rollama:app --bind 0.0.0.0:5001 --timeout 2592000 --workers 2 --log-level info
             ├─2646 /usr/bin/python3 /usr/local/bin/gunicorn rollama:app --bind 0.0.0.0:5001 --timeout 2592000 --workers 2 --log-level info
             └─2647 /usr/bin/python3 /usr/local/bin/gunicorn rollama:app --bind 0.0.0.0:5001 --timeout 2592000 --workers 2 --log-level info

May 02 11:22:48 debian systemd[1]: Started rollama.service - Rollama-GPT.
May 02 11:22:48 debian run_srvc.sh[2644]: SSL CERT/KEY cert.pem and key.pem not found. Running unsecured HTTP
May 02 11:22:48 debian run_srvc.sh[2645]: [2024-05-02 11:22:48 -0700] [2645] [INFO] Starting gunicorn 22.0.0
May 02 11:22:48 debian run_srvc.sh[2645]: [2024-05-02 11:22:48 -0700] [2645] [INFO] Listening at: http://0.0.0.0:5001 (2645)
May 02 11:22:48 debian run_srvc.sh[2645]: [2024-05-02 11:22:48 -0700] [2645] [INFO] Using worker: sync
May 02 11:22:48 debian run_srvc.sh[2646]: [2024-05-02 11:22:48 -0700] [2646] [INFO] Booting worker with pid: 2646
May 02 11:22:48 debian run_srvc.sh[2647]: [2024-05-02 11:22:48 -0700] [2647] [INFO] Booting worker with pid: 2647

Deployed as WSGI

Uses Gunicorn WSGI

How-to Run this

Install Python Modules:

pip3 install -r requirements.txt
Get Reddit API key: https://www.reddit.com/wiki/api/
Gen SSL key/cert for secure connection to the service

openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 3650
Gen Symmetric encryption key for encrypting any text

./tools/generate_keys.py Encrption Key File text_encryption.key created
Create Database and tables: See reddit.sql

Install Ollama-gpt

Linux

https://github.com/ollama/ollama/blob/main/docs/linux.md
Sample Debian Service config file: /etc/systemd/system/ollama.service

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
..
..
..

MacOS

I installed Ollama-gpt on my MacMini M1 - using brew

> brew install ollama

Start/Stop Service

> brew services start ollama
> brew services stop ollama

Bind Ollama server to local IPV4 address

create a run shell script

> /opt/homebrew/opt/ollama/bin

Create a script named ollama.sh add the following

#!/usr/bin/env bash
export OLLAMA_HOST=0.0.0.0
/opt/homebrew/bin/ollama $1

Make script "executable"

chmod +x ollama.sh

Edit .plist file for the ollama homebrew service

  > cd /opt/homebrew/Cellar/ollama
  > cd 0.1.24 #this may be different for your system
  > vi homebrew.mxcl.ollama.plist

Change original line

/opt/homebrew/opt/ollama/bin/ollama

TO this:

/opt/homebrew/opt/ollama/bin/ollama.sh
Save file
stop/start service

> brew services stop ollama && brew services start ollama

Add following models to ollama-gpt: deepseek-llm,llama3,llama-pro

> for llm in deepseek-llm llama3 llama-pro gemma
  do
      ollama pull ${llm}
  done

Update setup.config with pertinent information (see setup.config.template)

# update with required information and save it as
#	setup.config file
[psqldb]
host=
port=5432
database=
user=
password=

[redis]
redis_host=
redis_port=
redis_password=

[reddit]
client_id=
client_secret=
username=
rpassword=
user_agent=

[service]
APP_SECRET_KEY=
CSRF_PROTECTION_KEY=
DJANGO_SECRET_KEY=
ENCRYPTION_KEY=
ENDPOINT_URL=
IDENTITY=
JWT_SECRET_KEY=
LLMS=
OLLAMA_API_URL=
PROC_WORKERS=
SRVC_SHARED_SECRET=

Run Rollama-GPT Service: (see https://docs.gunicorn.org/en/stable/settings.html for config details)

    > gunicorn --certfile=cert.pem \
               --keyfile=key.pem \
               --bind 0.0.0.0:5000 \
               rollama:app \
               --timeout 2592000 \
               --threads 4 \
               --reload

Customize it to your hearts content!
LICENSE: The 3-Clause BSD License - license.txt
TODO:
- Add Swagger Docs
- Add long running task queue
  - Queue: task_id, task_status, end_point
- Revisit Endpoint logic add robust error handling
- Add scheduler app - to schedule some of these events
  - scheduler checks whether or not a similar tasks exists
- Add logic to handle list of lists with NUM_ELEMENTS_CHUNK elements
  - retry after 429
  - break down longer list of items into list of lists with small chunks

Example

These examples assume that environment variable API_KEY is using a valid API_KEY

Get All comments for all Redditors in the database

> export api_key=<api_key>
>
> export AT=$(curl -sk -X POST -H "Content-Type: application/json" -d '{"api_key":"'${api_key}'"}' \
  https://127.0.0.1:5000/login | jq -r .access_token) && curl -sk -X GET -H \
  "Authorization: Bearer ${AT}" 'https://127.0.0.1:5000/get_authors_comments'

On Service Console:

    INFO:root:Getting comments for Redditor
    INFO:root:Redditor 916 new comments
    INFO:root:Processing Author Redditor
    INFO:root:Processing Author Redditor
    INFO:root:Processing Author Redditor
    INFO:root:Processing Author Redditor

Analyze a Post using Post ID that already exists in a post table in the database

> export AT=$(curl -sk -X POST -H "Content-Type: application/json" -d '{"api_key":"'${API_KEY}'"}' \
  https://127.0.0.1:5001/login | jq -r .access_token) && curl -sk -X GET -H \
  "Authorization: Bearer ${AT}" 'https://127.0.0.1:5001/analyze_post?post_id=<Reddit Post ID>'

Get and Analyze a Post using Post ID that has not yet been added to the post table in the database

> export AT=$(curl -sk -X POST -H "Content-Type: application/json" -d '{"api_key":"'${API_KEY}'"}' \
  https://127.0.0.1:5001/login | jq -r .access_token) && curl -sk -X GET -H \
  "Authorization: Bearer ${AT}" 'https://127.0.0.1:5001/get_and_analyze_post?post_id=<Reddit Post ID>'

General Workflow

flowchart TD
    A[Start] --> B[Read Configuration]
    B --> C[Connect to PostgreSQL]
    C --> D[Get new post IDs]
    D --> E{Any new post IDs?}
    E -- Yes --> F[Analyze Posts]
    F --> G[Get Post Details]
    G --> H[Process Author Information]
    H --> I[Get Post Comments]
    I --> J[Get Comment Details]
    J --> K[Insert Comment Data into Database]
    I --> L{More Comments?}
    L -- Yes --> I
    L -- No --> M[Sleep to Avoid Rate Limit]
    E -- No --> N[Sleep to Avoid Rate Limit]
    N --> D
    M --> D
    N --> O[Get New Subreddits]
    O --> P{Any new subreddits?}
    P -- Yes --> Q[Join New Subreddits]
    Q --> O
    P -- No --> R[End]

pg_cron settings for scheduling function triggers

See rollama.sql for PGSQL functions and pg_cron installation

Schedule pg_cron jobs

rollama=> SELECT cron.schedule('*/5 * * * *', 'select schedule_update()');
SELECT cron.schedule('*/10 * * * *', $$delete from comments where comment_author = 'AutoModerator'$$);
SELECT cron.schedule('*/10 * * * *', $$delete from posts where post_author = 'AutoModerator'$$);

Confirm that pg_cron jobs are loaded

rrollama=> select * from cron.job;
 jobid |   schedule   |                           command                           | nodename  | nodeport | database | username | active | jobname
-------+--------------+-------------------------------------------------------------+-----------+----------+----------+----------+--------+---------
    16 | */5 * * * *  | select schedule_update()                                    | localhost |     5432 | rollama  | rollama  | t      |
    24 | */10 * * * * | delete from posts where post_author = 'AutoModerator'       | localhost |     5432 | rollama  | rollama  | t      |
    25 | */10 * * * * | delete from comments where comment_author = 'AutoModerator' | localhost |     5432 | rollama  | rollama  | t      |

unschedule pg_cron jobs

rollama=> SELECT cron.unschedule(<jobid from cron.jobs table>);

Confirm pg_cron job runs

rollama=> select * from cron.job_run_details order by runid desc limit 5;
 jobid | runid | job_pid | database | username |                                                        command                                                         |  status   |                return_message                |          start_time           |           end_time
-------+-------+---------+----------+----------+------------------------------------------------------------------------------------------------------------------------+-----------+----------------------------------------------+-------------------------------+-------------------------------
    25 |   948 |  286605 | rollama  | rollama  | delete from comments where comment_author = 'AutoModerator'                                                            | succeeded | DELETE 2                                     | 2024-07-31 13:00:00.004014-07 | 2024-07-31 13:00:00.191934-07
    16 |   947 |  286604 | rollama  | rollama  | select schedule_update()                                                                                               | succeeded | SELECT 1                                     | 2024-07-31 13:00:00.002889-07 | 2024-07-31 13:00:00.338836-07
    24 |   946 |  286603 | rollama  | rollama  | delete from posts where post_author = 'AutoModerator'                                                                  | succeeded | DELETE 0                                     | 2024-07-31 13:00:00.002287-07 | 2024-07-31 13:00:00.005317-07

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
frontend/analysis_frontend		frontend/analysis_frontend
playwright		playwright
tools		tools
.gitignore		.gitignore
Dockerfile		Dockerfile
FE_Dockerfile		FE_Dockerfile
README.md		README.md
auditEventClass.py		auditEventClass.py
backup_encrypt_db.sh		backup_encrypt_db.sh
build.sh		build.sh
build_docker.py		build_docker.py
build_frontend.sh		build_frontend.sh
cache.py		cache.py
config.py		config.py
dashboard.png		dashboard.png
data_analytics.sql		data_analytics.sql
database.png		database.png
database.py		database.py
deployment.yaml		deployment.yaml
docker_install_srvc.sh		docker_install_srvc.sh
docker_run_srvc.sh		docker_run_srvc.sh
encryption.py		encryption.py
external.py		external.py
fe_build_docker.py		fe_build_docker.py
fe_docker_install_srvc.sh		fe_docker_install_srvc.sh
fe_run.sh		fe_run.sh
file_manifest.txt		file_manifest.txt
git_tag_push.sh		git_tag_push.sh
gptutils.py		gptutils.py
install_srvc.sh		install_srvc.sh
jenkins_build_script.sh		jenkins_build_script.sh
job_scheduler.py		job_scheduler.py
license.txt		license.txt
llama-gpt-vm.png		llama-gpt-vm.png
logit.py		logit.py
migrate_set_to_key.py		migrate_set_to_key.py
paas.md		paas.md
postgresql-vm.png		postgresql-vm.png
pricing.json		pricing.json
reddit-data-scraper-vm.png		reddit-data-scraper-vm.png
reddit_api.py		reddit_api.py
redditutils.py		redditutils.py
reload_svc.sh		reload_svc.sh
requirements.txt		requirements.txt
rollama.py		rollama.py
rollama.service		rollama.service
rollama.sql		rollama.sql
rollama_service.py		rollama_service.py
run_srvc.sh		run_srvc.sh
service.yaml		service.yaml
setup.config.template		setup.config.template
srvc_run_config.env		srvc_run_config.env
store_service_logs.py		store_service_logs.py
testit.py		testit.py
utils.py		utils.py
ver.txt		ver.txt
websearch.py		websearch.py
workflow.md		workflow.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROLLAMA-GPT

General Overview

Environment Configuration

API Overview

Requirements

Build

Docker build (tested only on Debian 12 for now)

Run docker container

Deploy it on a Local Kubernetes cluster

Set up local Docker Image Repository

Manually add registery to containerd on each node

Deploy Service to Cluster

SYSTEMD install on a Debian 12 host

How-to Run this

Install Ollama-gpt

Linux

MacOS

Example

General Workflow

pg_cron settings for scheduling function triggers

Database Schema

Agentic-AI Platform Performance Metrics Dashboard

About

Releases

Packages

Contributors 2

Languages

License

ovaisq/LLM-RedditScraper

Folders and files

Latest commit

History

Repository files navigation

ROLLAMA-GPT

General Overview

Environment Configuration

API Overview

Requirements

Build

Docker build (tested only on Debian 12 for now)

Run docker container

Deploy it on a Local Kubernetes cluster

Set up local Docker Image Repository

Manually add registery to containerd on each node

Deploy Service to Cluster

SYSTEMD install on a Debian 12 host

How-to Run this

Install Ollama-gpt

Linux

MacOS

Example

General Workflow

pg_cron settings for scheduling function triggers

Database Schema

Agentic-AI Platform Performance Metrics Dashboard

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages