Stock Market Data Pipeline with Kafka, S3, Glue, and Athena

This project demonstrates a stock market data pipeline using Apache Kafka, AWS S3, AWS Glue, and Amazon Athena. The pipeline allows for streaming stock market data from a CSV file, which is sent to a Kafka cluster hosted on an EC2 instance. The Kafka consumers then push the data to an S3 bucket. The data is cataloged using AWS Glue and queried with Amazon Athena.

Prerequisites

Before you begin, ensure you have the following tools and services:

Amazon EC2 instance running (preferably Amazon Linux 2)
Java 8 (Amazon Corretto) installed on your EC2 instance
Kafka 3.9.0 installed on your EC2 instance
Python 3.x with necessary libraries for Kafka producer
AWS CLI configured with appropriate credentials to interact with S3, Glue, and Athena

Setup Instructions

Step 1: Download and Install Kafka

Download Kafka 3.9.0:

wget https://downloads.apache.org/kafka/3.9.0/kafka_2.12-3.9.0.tgz

Extract the downloaded file:
```
tar -xvf kafka_2.12-3.9.0.tgz
```

Step 2: Install Java 8 (Amazon Corretto)

Apache Kafka requires Java to run. Install Java 8 using the following command:

sudo yum install -y https://corretto.aws/downloads/latest/amazon-corretto-8-x64-linux-jdk.rpm

Step 3: Start Zookeeper

Kafka depends on Zookeeper to manage broker metadata. Start Zookeeper with the following command:

bin/zookeeper-server-start.sh config/zookeeper.properties

Step 4: Configure Kafka Server

You need to configure the Kafka server to use your EC2 instance's public IP address.

Open the server.properties configuration file:
```
sudo nano config/server.properties
```

Modify the advertised.listeners line to point to your EC2 instance's public IP:

advertised.listeners=PLAINTEXT://<your_public_ip>:9092
listeners=PLAINTEXT://0.0.0.0:9092

Step 5: Create a Kafka Topic

Create a Kafka topic for the stock market data. You can use the following command:

bin/kafka-topics.sh --create --topic test_topic --bootstrap-server <your_public_ip>:9092 --replication-factor 1 --partitions 1

Step 6: Start Kafka Producer

Start the Kafka producer, which will send stock market data to the test_topic:

bin/kafka-console-producer.sh --topic test_topic --bootstrap-server <your_public_ip>:9092

Step 7: Start Kafka Consumer

Start the Kafka consumer, which will read from the test_topic and send the data to an S3 bucket:

bin/kafka-console-consumer.sh --topic test_topic --bootstrap-server <your_public_ip>:9092

Step 8: Data Transfer to S3 and Glue

Once Kafka is up and running, a Python script can be used to send the stock market data (from a CSV file) to the Kafka producer. The consumer will pick up the data and upload it to an S3 bucket.

Then, set up an AWS Glue crawler to catalog the data stored in the S3 bucket.

Create an S3 bucket to store the data.
Set up an AWS Glue crawler to scan the data in the S3 bucket and create a Glue table.
Use Amazon Athena to query the cataloged data.

Python Producer Script

You will need to implement a Python producer script to read the CSV file and send the data to the Kafka producer. Example Python code:

from kafka import KafkaProducer
import pandas as pd

# Read CSV file
data = pd.read_csv('stock_market_data.csv')

# Set up the Kafka producer
producer = KafkaProducer(bootstrap_servers='<your_public_ip>:9092')

# Send each row of the CSV file to the Kafka topic
for index, row in data.iterrows():
    message = row.to_json().encode('utf-8')
    producer.send('test_topic', value=message)

producer.flush()

AWS Glue Setup

Create a Glue crawler to scan your S3 bucket where the Kafka consumer is uploading data.
Create a Glue table to catalog the data.
Use Amazon Athena to query the cataloged data from the Glue table.

Example Athena Query:

SELECT * FROM stock_market_data WHERE stock_symbol = 'AAPL';

Notes

Ensure that your EC2 instance has the appropriate security group settings to allow inbound traffic on port 9092 for Kafka.
Replace <your_public_ip> with the actual public IP address of your EC2 instance.
The Kafka producer and consumer use the topic test_topic—you can change this to suit your needs.
Make sure AWS permissions for S3, Glue, and Athena are configured correctly for your user/role.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
readme.md		readme.md
stock-data.csv		stock-data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stock Market Data Pipeline with Kafka, S3, Glue, and Athena

Prerequisites

Setup Instructions

Step 1: Download and Install Kafka

Step 2: Install Java 8 (Amazon Corretto)

Step 3: Start Zookeeper

Step 4: Configure Kafka Server

Step 5: Create a Kafka Topic

Step 6: Start Kafka Producer

Step 7: Start Kafka Consumer

Step 8: Data Transfer to S3 and Glue

Python Producer Script

AWS Glue Setup

Example Athena Query:

Notes

About

Releases

Packages

cnajmeddine/kafka-streaming

Folders and files

Latest commit

History

Repository files navigation

Stock Market Data Pipeline with Kafka, S3, Glue, and Athena

Prerequisites

Setup Instructions

Step 1: Download and Install Kafka

Step 2: Install Java 8 (Amazon Corretto)

Step 3: Start Zookeeper

Step 4: Configure Kafka Server

Step 5: Create a Kafka Topic

Step 6: Start Kafka Producer

Step 7: Start Kafka Consumer

Step 8: Data Transfer to S3 and Glue

Python Producer Script

AWS Glue Setup

Example Athena Query:

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages