Skip to content

lubanskakinga/banks-project-etl

Repository files navigation

Banks Project ETL

Top 10 largest banks in the world ranked by market capitalization in billion USD. Further, the data needs to be transformed and stored in GBP, EUR and INR as well, in accordance with the exchange rate information that has been made available to you as a CSV file. The processed information table is to be saved locally in a CSV format and as a database table.

Your job is to create an automated system to generate this information so that the same can be executed in every financial quarter to prepare the report.

Particulars of the code to be made have been shared below.

Parameter Value
Code name banks_project.py
Data URL https://web.archive.org/web/20230908091635 /https://en.wikipedia.org/wiki/List_of_largest_banks
Exchange rate CSV path https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-PY0221EN-Coursera/labs/v2/exchange_rate.csv
Table Attributes (upon Extraction only) Name, MC_USD_Billion
Table Attributes (final) Name, MC_USD_Billion, MC_GBP_Billion, MC_EUR_Billion, MC_INR_Billion
Output CSV Path ./Largest_banks_data.csv
Database name Banks.db
Table name Largest_banks
Log file code_log.txt

Task 1 : Logging function

Write the function to log the progress of the code, log_progress().

The function accepts the message to be logged and enters it to a text file code_log.txt.

The format to be used for logging must have the syntax:

<time_stamp> : <message>

You must associate the correct log entries with each of the executed function calls. Use the following table to note the logging message at the end of each function call that follows.

Task Log message on completion
Declaring known values Preliminaries complete. Initiating ETL process
Call extract() function Data extraction complete. Initiating Transformation process
Call transform() function Data transformation complete. Initiating Loading process
Call load_to_csv() Data saved to CSV file
Initiate SQLite3 connection SQL Connection initiated
Call load_to_db() Data loaded to Database as a table, Executing queries
Call run_query() Process Complete
Close SQLite3 connection Server Connection closed

Task 1 : Answer

def log_progress(message):
    ''' This function logs the mentioned message of a given stage of the
    code execution to a log file. Function returns nothing'''

    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Month-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("code_log.txt","a") as f: 
        f.write(timestamp + ' : ' + message + '\n')

Task 2 : Extraction of data

Analyze the webpage on the given URL:

https://web.archive.org/web/20230908091635/https://en.wikipedia.org/wiki/List_of_largest_banks

Identify the position of the required table under the heading By market capitalization. Write the function extract() to retrieve the information of the table to a Pandas data frame.

Note: Remember to remove the last character from the Market Cap column contents, like, '\n', and typecast the value to float format.

Write a function call for extract() and print the returning data frame.

Zrzut ekranu 2024-10-03 195101

Task 2 : Answer

def extract(url, table_attribs):
    ''' This function aims to extract the required
    information from the website and save it to a data frame. The
    function returns the data frame for further processing. '''

    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    
    df = pd.DataFrame(columns=table_attribs)
    
    tables = soup.find_all("tbody")
    rows = tables[0].find_all("tr")
    
    for row in rows:
        if row.find("td") is not None:
            col = row.find_all("td")
            bank_name = col[1].find_all('a')[1]['title']
            market_cap = col[2].contents[0][:-1]
            data_dict = {"Name": bank_name,
                         "MC_USD_Billion": float(market_cap)}
            df1 = pd.DataFrame(data_dict, index=[0])
            df = pd.concat([df, df1], ignore_index=True)

    return df

Task 3 : Transformation of data

The Transform function needs to perform the following tasks:

  1. Read the exchange rate CSV file and convert the contents to a dictionary so that the contents of the first columns are the keys to the dictionary and the contents of the second column are the corresponding values.

  2. Add 3 different columns to the dataframe, viz. MC_GBP_Billion, MC_EUR_Billion and MC_INR_Billion, each containing the content of MC_USD_Billion scaled by the corresponding exchange rate factor. Remember to round the resulting data to 2 decimal places.

Task 3 : Answer

def transform(df, csv_path):
    ''' This function accesses the CSV file for exchange rate
    information, and adds three columns to the data frame, each
    containing the transformed version of Market Cap column to
    respective currencies'''
    
    dataframe = pd.read_csv(csv_path)
    exchange_rate = dataframe.set_index('Currency').to_dict()['Rate']
    
    df['MC_GBP_Billion'] = [np.round(x * exchange_rate['GBP'], 2) for x in df['MC_USD_Billion']]
    df['MC_EUR_Billion'] = [np.round(x * exchange_rate['EUR'], 2) for x in df['MC_USD_Billion']]
    df['MC_INR_Billion'] = [np.round(x * exchange_rate['INR'], 2) for x in df['MC_USD_Billion']]
    
    return df

Task 4: Loading to CSV

Write the function to load the transformed data frame to a CSV file, like load_to_csv(), in the path mentioned in the project scenario.

Task 4 : Answer

def load_to_csv(df, output_path):
    ''' This function saves the final data frame as a CSV file in
    the provided path. Function returns nothing.'''

    df.to_csv(output_path)

Task 5: Loading to Database

Write the function to load the transformed data frame to an SQL database, like, load_to_db(). Use the database and table names as mentioned in the project scenario.

  • Before calling this function, initiate the connection to the SQLite3 database server with the name Banks.db. Pass this connection object, along with the required table name Largest_banks and the transformed data frame, to the load_to_db() function in the function call.

Upon successful function call, you will have loaded the contents of the table with the required data and the file Banks.db will be visible in the Explorer tab of the IDE under the project folder.

Task 5 : Answer

def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final data frame to a database
    table with the provided name. Function returns nothing.'''

    df.to_sql(table_name, sql_connection, if_exists = 'replace', index = False)

Task 6: Function to Run queries on Database

Write the function run_queries() that accepts the query statement, and the SQLite3 Connection object, and generates the output of the query. The query statement should be printed along with the query output.

Execute 3 function calls using the queries as mentioned below.

  1. Print the contents of the entire table

Query statement:

SELECT * FROM Largest_banks
  1. Print the average market capitalization of all the banks in Billion USD.

Query statement:

SELECT AVG(MC_GBP_Billion) FROM Largest_banks
  1. Print only the names of the top 5 banks

Query statement:

SELECT Name from Largest_banks LIMIT 5

Task 6 : Answer

def run_query(query_statement, sql_connection):

    print(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)

Task 7: Verify log entries

After updating all the log_progress() function calls, you have to run the code for a final execution.

  • However, you will first have to remove the code_log.txt file, that would have been created and updated throughout the multiple executions of the code in this lab. You may remove the file using the following command on a terminal.
rm code_log.txt

Once the existing file is removed, now run the final execution. Upon successful completion of execution, open the code_log.txt file by clicking on it in the Explorer tab of the toolbar on left side of the programming pane of the IDE, under the project folder. You should see all the relevant entries made in the text file in relation to the stages of code execution.

''' Here, you define the required entities and call the relevant
functions in the correct order to complete the project. Note that this
portion is not inside any function.'''

url = 'https://web.archive.org/web/20230908091635 /https://en.wikipedia.org/wiki/List_of_largest_banks'
table_attribs = ['Name', 'MC_USD_Billion']
db_name = 'Banks.db'
sql_connection = sqlite3.connect(db_name)
table_name = 'Largest_banks'
csv_path = 'exchange_rate.csv'
output_path = "./Largest_banks_data.csv"
log_file = "code_log.txt"
log_progress("Preliminaries complete. Initiating ETL process")

df = extract(url, table_attribs)
print(df)

log_progress("Data extraction complete. Initiating Transformation process")

df = transform(df, csv_path)
print(df)

log_progress("Data transformation complete. Initiating Loading process")

load_to_csv(df, output_path)

log_progress("Data saved to CSV file")

sql_connection = sqlite3.connect(db_name)

log_progress("SQL Connection initiated")

load_to_db(df, sql_connection, table_name)

log_progress("Data loaded to Database as a table, Executing queries")

query_statement = f"SELECT * from {table_name}"
run_query(query_statement, sql_connection)

query_statement = f"SELECT AVG(MC_GBP_Billion) FROM {table_name}"
run_query(query_statement, sql_connection)

query_statement = f"SELECT Name from {table_name} LIMIT 5"
run_query(query_statement, sql_connection)

log_progress("Process Complete")

sql_connection.close()

log_progress("Server Connection closed")
with open(log_file, "r") as log:
    LogContent = log.read()
    print(LogContent)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages