This project demonstrates a stock market data pipeline using Apache Kafka, AWS S3, AWS Glue, and Amazon Athena. The pipeline allows for streaming stock market data from a CSV file, which is sent to a Kafka cluster hosted on an EC2 instance. The Kafka consumers then push the data to an S3 bucket. The data is cataloged using AWS Glue and queried with Amazon Athena.
Before you begin, ensure you have the following tools and services:
- Amazon EC2 instance running (preferably Amazon Linux 2)
- Java 8 (Amazon Corretto) installed on your EC2 instance
- Kafka 3.9.0 installed on your EC2 instance
- Python 3.x with necessary libraries for Kafka producer
- AWS CLI configured with appropriate credentials to interact with S3, Glue, and Athena
-
Download Kafka 3.9.0:
wget https://downloads.apache.org/kafka/3.9.0/kafka_2.12-3.9.0.tgz
-
Extract the downloaded file:
tar -xvf kafka_2.12-3.9.0.tgz
Apache Kafka requires Java to run. Install Java 8 using the following command:
sudo yum install -y https://corretto.aws/downloads/latest/amazon-corretto-8-x64-linux-jdk.rpm
Kafka depends on Zookeeper to manage broker metadata. Start Zookeeper with the following command:
bin/zookeeper-server-start.sh config/zookeeper.properties
You need to configure the Kafka server to use your EC2 instance's public IP address.
-
Open the
server.properties
configuration file:sudo nano config/server.properties
-
Modify the
advertised.listeners
line to point to your EC2 instance's public IP:advertised.listeners=PLAINTEXT://<your_public_ip>:9092 listeners=PLAINTEXT://0.0.0.0:9092
Create a Kafka topic for the stock market data. You can use the following command:
bin/kafka-topics.sh --create --topic test_topic --bootstrap-server <your_public_ip>:9092 --replication-factor 1 --partitions 1
Start the Kafka producer, which will send stock market data to the test_topic
:
bin/kafka-console-producer.sh --topic test_topic --bootstrap-server <your_public_ip>:9092
Start the Kafka consumer, which will read from the test_topic
and send the data to an S3 bucket:
bin/kafka-console-consumer.sh --topic test_topic --bootstrap-server <your_public_ip>:9092
Once Kafka is up and running, a Python script can be used to send the stock market data (from a CSV file) to the Kafka producer. The consumer will pick up the data and upload it to an S3 bucket.
Then, set up an AWS Glue crawler to catalog the data stored in the S3 bucket.
- Create an S3 bucket to store the data.
- Set up an AWS Glue crawler to scan the data in the S3 bucket and create a Glue table.
- Use Amazon Athena to query the cataloged data.
You will need to implement a Python producer script to read the CSV file and send the data to the Kafka producer. Example Python code:
from kafka import KafkaProducer
import pandas as pd
# Read CSV file
data = pd.read_csv('stock_market_data.csv')
# Set up the Kafka producer
producer = KafkaProducer(bootstrap_servers='<your_public_ip>:9092')
# Send each row of the CSV file to the Kafka topic
for index, row in data.iterrows():
message = row.to_json().encode('utf-8')
producer.send('test_topic', value=message)
producer.flush()
- Create a Glue crawler to scan your S3 bucket where the Kafka consumer is uploading data.
- Create a Glue table to catalog the data.
- Use Amazon Athena to query the cataloged data from the Glue table.
SELECT * FROM stock_market_data WHERE stock_symbol = 'AAPL';
- Ensure that your EC2 instance has the appropriate security group settings to allow inbound traffic on port
9092
for Kafka. - Replace
<your_public_ip>
with the actual public IP address of your EC2 instance. - The Kafka producer and consumer use the topic
test_topic
—you can change this to suit your needs. - Make sure AWS permissions for S3, Glue, and Athena are configured correctly for your user/role.