ABOUT DATASET:

This project's dataset is the "Historical Hourly Weather Data", which was collected from Kaggle. Encompassing a rich compilation of meteorological information across various cities, the dataset consists of multiple CSV files, each representing a crucial facet of weather data. Here's a concise overview of the primary datasets:

  1. city_attributes.csv: Details about the cities featured in the dataset, such as geographical coordinates, country, and elevation.
  2. humidity.csv: Information on recorded humidity levels over time for different cities.
  3. pressure.csv: Data related to atmospheric pressure measurements, offering insights into air density variations.
  4. temperature.csv: Temperature data recorded over time for various cities, providing insights into temperature fluctuations.
  5. weather_description.csv: Detailed descriptions of weather conditions, facilitating the categorization and understanding of the atmospheric state.
  6. wind_direction.csv: Information on prevailing wind directions recorded over time in different locations.
  7. wind_speed.csv: Data on the speed of the wind over time, aiding in the analysis of wind patterns and intensities.

This rich dataset enables in-depth investigation and analysis, allowing the project to discover trends, anomalies, and insights into the dynamic nature of meteorological conditions. Furthermore, the project uses data from the OpenWeatherAPI to supplement the Kaggle dataset with real-time weather information, enabling for complete exploration of numerous weather occurrences and the development of effective machine learning models for weather prediction and insights.

SOURCE:

  1. API: OpenWeatherMapAPI

    Endpoint: https://api.openweathermap.org/data/2.5/weather

    API Call Example: https://api.openweathermap.org/data/2.5/weather?q=Denver&appid={API key}

  2. Historical Hourly Weather Data (csv): Kaggle Data

1. OpenWeatherMap API:

API Call:

The step involved iterating through each city, creating a unique API request URL with the city name and API key, then issuing a GET request to retrieve the weather data. After a successful request (status code 200), relevant information such as humidity, pressure, temperature, weather description, wind speed, wind direction, latitude, and longitude was extracted from the JSON response. The obtained data was then organised into a DataFrame, which represented the weather conditions in the specified cities.

apicall

BEFORE CLEANING (API Data):

apibc

AFTER CLEANING (API Data):

apiac

STEPS IN DATA PREPROCESSING:

  1. Parsing the JSON Data:
  2. This technique is used to access certain fields or elements within the JSON structure to acquire the relevant data points such as humidity, pressure, temperature, weather description, wind speed, wind direction, and geographic coordinates.

    After Parsing:
    aftercombining

  3. Converting datetime to required format:
  4. This step converts the 'datetime' column in the DataFrame to the format 'YYYY-MM-DD HH:MM:SS', improving readability and compatibility with other data processing tasks.

    After converting datetime to required format:
    apidt

  5. Converting Temperature from Kelvin to Fahrenheit:
  6. The temperature column, initially in Kelvin, was converted to Fahrenheit for better interpretability.

    After Temperature conversion :
    apiktof

2. Historical Hourly Weather Data (Kaggle)

BEFORE CLEANING (Historical Hourly Weather Data):

Before Cleaning 1
Before Cleaning 2
Before Cleaning 3
Before Cleaning 4
Before Cleaning 5
Before Cleaning 6

AFTER CLEANING (Historical Hourly Weather Data):

aftercleaning

STEPS IN DATA PREPROCESSING:

  1. Merging the Kaggle Data:
  2. Multiple datasets comprising weather-related information, such as humidity, pressure, temperature, weather description, wind speed, and wind direction, were first melted and then combined to make a single dataset. The melting technique converted the original wide-format data to a long-format, making it easier to merge based on shared columns. The records were then integrated based on the 'datetime' and 'City' columns, yielding a comprehensive dataset with consolidated meteorological information for numerous cities. Furthermore, a merge with a city dataset based on the 'City' column was undertaken to enhance the weather dataset with additional information such as each city's country, latitude, and longitude. This final combined dataset provides a comprehensive perspective of weather conditions in different cities and geographical areas.

    After Merging:
    aftercombining

  3. Sampling the Data and Dropping Country column:
  4. The dataset initially had 1,629,108 rows and 11 columns. To aid analysis and management of the vast dataset, a sample centred on 2017 and particular to the United States was created, resulting in a significantly smaller set of 215,811 rows. The 'datetime' column was converted to datetime format, and the 'Country' column, which contained a consistent value for the United States, was later removed because it was no longer useful. This sampling and cleaning process enabled more efficient handling of the data for subsequent analysis, specifically targeting weather conditions in the United States during the year 2017.

    After Sampling and dropping Country column:
    aftersampling

  5. Making Column names consistent:
  6. To ensure consistency in column names, the column names of the dataset were reviewed and subsequently converted to lowercase. This step aids in standardizing the naming conventions, promoting clarity and simplicity throughout the analysis process.

    After renaming the Columns:
    afterrenaming

  7. Checking Duplicate Values:
  8. Checking duplicated rows to ensure data integrity and there were no duplicate records. The absence of duplicate rows in the dataset suggests that each entry is unique, contributing to the data's reliability and reducing redundancy.

    afterduplicates

  9. Cleaning Missing Values:
    1. Forward Fill (ffill): Initial imputation was performed using forward fill to propagate the last observed values within the same day and city, ensuring temporal consistency.
    2. Mean Imputation: For the remaining missing values, mean imputation was applied based on the values within the same day and city group. This method helped to maintain the integrity of the data while filling gaps.
    3. Dropping Rows: Rows with multiple missing values were eventually dropped from the dataset to ensure a more complete and consistent dataset for subsequent analysis.
    Shape of Data after cleaning Missing Values: (213435, 10)

    After Cleaning Missing Values :
    aftermissingvalues

  10. Converting Temperature from Kelvin to Fahrenheit:
  11. The temperature column, initially in Kelvin, was converted to Fahrenheit for better interpretability.

    After temperature conversion :
    ktof

  12. Creating Label:
  13. The weather description column was analyzed, and a new categorical column named 'weather' was created based on the types of weather conditions. The function weather_labels was defined to categorize weather descriptions into labels such as 'clear,' 'rainy,' 'snowy,' 'thunderstorm,' 'foggy,' 'cloudy,' and 'other.' The original weather description column was converted to lowercase to ensure consistency, and the new 'weather' column was added to the dataset. The resulting dataset, now with the 'weather' column, provides a more simplified and informative representation of weather conditions for further analysis, with 213,435 rows and 11 columns.

    Label Function :
    labelfunction
    After Creating Label:
    afterlabel

  14. Checking Data Types:
  15. The data types of the columns in the dataset were initially inspected, revealing datetime and numerical types. Converted the qualitative, nominal, and categorical data (weather) to the 'Category' type.

    After Checking Data Types:
    afterdtypes

EXPLORATORY DATA ANALYSIS:

  1. Checking Outliers using Box Plot:
  2. eda1

    The box plot provides a graphical representation of the distribution of values, highlighting potential outliers based on their deviation from the interquartile range. Outliers were identified in pressure, temperature and wind_speed. However, after further inspection using values_count() method, it was determined that they are due to natural variation in the data, also known as true outliers. True outliers should be left as they are in the dataset.

  3. Distribution of Weather:
  4. eda2

    From the above Histogram, it is evident that for most of the days the weather is clear or cloudy across the cities in 2017.

  5. Which cities had the most days with a particular weather condition?
  6. eda3clear
    Clear Weather
    eda3cloudy
    Cloudy Weather
    eda3foggy
    Foggy Weather
    eda3rainy
    Rainy Weather
    eda3snowy
    Snowy Weather
    eda3ts
    Thunderstorm Weather
    eda3others
    Other Weather

    The above Horizontal Bar Plots interprets Top 5 cities having most days of particular weather condition.

    1. Las Vegas had most days with Clear weather
    2. Albuquerque had most days with Cloudy weather
    3. San Diego had most days with Foggy Weather
    4. Seattle had most days with Rainy weather
    5. Pittsburgh had most days with Snowy weather
    6. Miami had most days with Thunderstorm weather
    7. Los Angeles had most days with Other weather

  7. How do temperature and humidity vary over different time intervals in Denver (2017) within the dataset, and are there noticeable trends or anomalies?
  8. eda4

    The above Time Series Line plot provides below information:

    1. Temperature starts to increase from January to July and then decreases gradually from July to November in Denver (2017)
    2. January is the coldest month and July is the hottest month.
    3. There is no significant pattern in Humidity.
    4. January and May recorded highest humidity and March recorded Lowest Humidity.

  9. Box plots of Temperature across the cities:
  10. eda5

    The above Box Plot shows the distribution of Temperature across the cities.

    1. Las Vegas recorded the highest Temperature in 2017
    2. Kansas City recorded the Lowest Temperature in 2017

  11. Correlation Heatmap:
  12. eda6

    The above Correlation Heatmap depicts the correlations coefficient between the numerical variables like temperature, humidity, pressure and wind_speed.

    1. There is a weak negative correlation between Temperature and humidity, which suggests that as temperatures increase, humidity tends to decrease.
    2. There are no significant correlations between other variables.

  13. Average Temperature distribution across the Cities in 2017:
  14. eda7

    The above Map in Tableau depicts the Average Temperature distribution across the cities in 2017. City with highest average temperature is in Dark Blue color whereas the City with lowest average temperature color is in Light Blue color.

    1. Miami recorded the highest Average Temperature of 78.76 Fahrenheit in 2017.
    2. Minneapolis recorded the lowest Average Temperature of 49.68 Fahrenheit in 2018

  15. Weather Distribution in Denver in 2017
  16. eda8

    The above highlight table in Tableau depicts the weather distribution in Denver in 2017. The weather of Denver is clear for most of the time.

  17. Month wise Weather distribution in Denver in 2017
  18. eda9

    The above line plot in Tableau depicts the Month Wise Weather distribution in Denver in 2017.

  19. Cities clusters by Snowy Weather
  20. eda10

    The above Map in Tableau depicts clusters of the cities based on frequency of Snowy Weather. Cities in Blue cluster experienced Snowy Weather most times followed by Cities in Orange and Red Clusters.