This repository contains a detailed data analysis project that performs data cleaning, analysis, and visualization on a dataset. The project leverages various Python libraries like Pandas, NumPy, Seaborn, Matplotlib, and Plotly to explore and analyze the dataset. The key analyses involve spatial and descriptive statistics, as well as visualizations to present key findings.
- Removed invalid entries and records with missing values.
- Identified and removed duplicate records to ensure data integrity.
- Conducted a basic descriptive analysis on the dataset, providing summary statistics and an overview of the dataset’s distribution.
- Extracted data in quantile form for further analysis and to identify patterns in the distribution of key variables.
- Performed spatial analysis to understand the geographic distribution of guests by country.
- Cleaned data further by removing bookings that were cancelled.
Visualization:
- Converted the cleaned data into a choropleth world map using Plotly to visualize the home countries of guests. This map helps understand the concentration of guests across different regions.
map_guest = px.choropleth(data_frame=country_wise_data,
locations=country_wise_data['country'],
color=country_wise_data['No of guests'],
hover_name=country_wise_data['country'],
title="Home country of Guests")
map_guest.show()
- Analyzed if there is any difference between the reserved and assigned room types.
- Created a pivot table between the
reserved_room_type
andassigned_room_type
to find intersections, normalized the data by the index (reserved_room_type), and made the resulting data more readable.
- 6.a. Market Segment Distribution: Analyzed the distribution of bookings by market segment and visualized the count of bookings in each market segment using a pie chart.
Visualization:
-
Plotted a pie chart to show the market segments and the count of bookings per segment.
-
6.b. Average Price per Night (ADR) Analysis: Analyzed the Average Daily Rate (ADR) for various room types across market segments.
-
Plotted a bar chart to compare the ADR across market segments, further breaking it down by room type (
reserved_room_type
).
-
Investigated whether there is a pattern in guest arrivals over time.
-
Created a dictionary
dict_month
to map month names to their respective numeric values. -
Added a new column
arrival_date_month_index
to convert month names into numeric values, then concatenated columns likearrival_date_year
,arrival_date_month_index
, andarrival_date_day_of_month
to create a combined date column. -
Analysis: Analyzed total guests arriving on each day and looked for any patterns in the data. The analysis revealed no significant pattern in guest arrivals over time.
-
Mapped the distribution of guest arrivals using a combined histogram-line plot, where both the X and Y axes represented guest arrival data.
-
Performed an analysis of the distribution to understand the spread and trends.
-
Mean and Median Trend Analysis: Conducted a trend analysis on the mean and median values of guest arrivals and other key variables extracted from the previous charts.
- Pandas: Used for data manipulation, cleaning, and preparation.
- NumPy: Used for numerical computations and array handling.
- Seaborn: Used for statistical data visualization and generating beautiful plots.
- Matplotlib: Used for static, animated, and interactive visualizations.
- Plotly: Used for interactive visualizations and mapping. We used Plotly Express for choropleth maps, pie charts, bar charts, and other visualizations.
- Chart Studio: A Plotly web service for hosting and sharing graphs.
This project demonstrates various essential steps in the data analysis workflow, including data cleaning, statistical analysis, and visualization. The use of libraries like Plotly helps create interactive and insightful visualizations, while the analyses provide actionable insights into guest booking behavior, market segment distribution, and guest arrival patterns.
Feel free to explore and extend the project with additional analyses or visualizations based on your needs!