The sports retail sector, situated at the intersection of dynamic market shifts, changing consumer preferences, and an expanding product landscape, faces the imperative of thriving in an environment marked by intense competition and evolving trends. Our research initiative responds to this imperative, aiming to unveil intricate patterns and predictive insights from the Sports Products Sales analysis dataset [1] to comprehensively understand the factors driving success in sports retail. Our project unfolds through stages, starting with initial data cleaning and exploratory data analysis (EDA), followed by advanced methodologies like association rule mining, time series analysis with ARIMA, and regression modeling. Each stage contributes to our overarching goal of deriving actionable insights from the dataset. Recognizing the transformative potential of machine learning and advanced analytics in reshaping operational strategies, our research is driven by the goal of providing sports retailers with a deeper market understanding. In the digital age, characterized by unprecedented opportunities, our project aims to enable effective navigation of challenges and capitalization on emerging trends. Our primary research focus is on understanding the intricate interactions influencing sporting goods sales. Utilizing Apriori association analysis, we identify co-occurrence patterns between products, providing insights into complementary relationships guiding consumer purchasing decisions. This knowledge proves instrumental in inventory control and serves as a strategic guide for product packaging and advertising campaigns. Addressing temporal dynamics in the sports retail industry, ARIMA models facilitate trend identification in sales. Retailers can enhance supply chain management and mitigate risks by leveraging temporal insights to predict and respond to market fluctuations. Analyzing complex relationships among sales-related variables, we employ various regression models, including lasso, decision trees, ridge, and linear regression, for operating profit estimation. Subsequent sections will delve into the methodologies applied, showcase key findings, and culminate in a comparative analysis of predictive models. This includes Linear, Ridge, Lasso, and the Decision Tree model which stands out as a particularly accurate and promising methodology, achieving the best score (R^2) of 0.997. These exceptionally low error metrics highlight the model's effectiveness in capturing the underlying patterns in the data, showcasing its potential as a robust solution for the given task.
The Sports Products Sales Analysis dataset employed in this project is sourced from the FP20 Analytics Challenge [1]. This dataset serves as a rich resource for delving into the complexities of the sports retail sector. It provides comprehensive insights across multiple dimensions, including details about retailers, sales specifics, geographical variables, and product characteristics.
Our approach encompasses a multifaceted methodology tailored to extract comprehensive insights from the sports retail dataset. The following sections delineate the sequential stages of our methodology:
To enhance the predictive capabilities, the following feature engineering techniques were applied:
The Exploratory Data Analysis (EDA) process involved a comprehensive examination of the dataset to extract meaningful insights and patterns. Initial steps included renaming columns for consistency and checking data types to ensure accurate representation. Subsequent analyses investigated potential data errors and distribution patterns across various categorical variables such as retailer, region, state, city, product type, and sales method. Descriptive statistics, like value counts and summary statistics, were employed to gain a deeper understanding of the dataset's characteristics. The analysis revealed a diverse distribution of retailers, regions, states, cities, product types, and sales methods. For instance, 'Foot Locker' emerged as the most frequent retailer, 'West' as the dominant region, and 'Men's Street Footwear' as the predominant product type. Temporal patterns were explored through sorting the dataset by the 'invoice_date,' providing a chronological view of sales data. The examination of 'operating_profit,' column uncovered a wide range of values, leading to the normalization of operating profit using MinMaxScaler. Overall, the EDA process not only facilitated a better understanding of the dataset's structure and distributions but also laid the groundwork for subsequent modeling and analysis, ensuring a robust and informed approach to extracting insights from the sports retail dataset.
The Apriori algorithm is a classic method used for mining frequent itemsets and generating association rules in transactional data. Following is the explanation of the Apriori methodology applied:
Autoregressive Integrated Moving Average (ARIMA) model is aimed to forecast operating profit over time.
Below bar plots display the total sales and operating profit for each retailer. West Gear has the highest total sales and operating profit.
Below time series plot depicts the operating profit over time which is useful before performing statistical tests to identify the stationarity of the time series.
Below Pie chart displays distribution of product types. All the product types are almost equally distributed.
Below table represents the frequent items set with a minimum support of 0.3
Below table represents the Association rules with a minimum confidence of 0.7
Below Scatter plot depicts association rules support and confidence colored by lift value.
Below are the ADF and KPSS test results before differencing. ADF test concluded that the data is stationary and KPSS concluded that the data is non-stationary.
Differenced the Normalized Operating Profit to make it stationary (d =1). Below is the time series plot of differenced normalized operating profit.
Below are the ADF and KPSS test results after differencing. Both ADF and KPSS tests concluded that the data is stationary.
From below Autocorrelation plot, there are 6 lags beyond significant region (95% confidence interval). Therefore, q can be tuned from 0 to 6.
From below Partial Autocorrelation plot, there are 6 lags beyond significant region (95% confidence interval). Therefore, p can be tuned from 0 to 6.
Below output tells the best lag order (p,d,q) along with the least AIC score.
Below output displays the MSE and MAE of the best ARIMA model.
Below time series plot displays the ARIMA model predictions with a rolling window of 100.
Below table displays the best score (R^2), best params, best mse and best mae for the regression models.
Below bar plot compares the best score (R^2) for the regression models. Decision Tree is the best model with the highest R^2 value of 0.9973.
In this comprehensive analysis, we delved into the intricate landscape of predicting operating profit in the dynamic realm of business and finance. Employing a multifaceted approach, we explored the realms of Association Rule Mining, Time Series Analysis using ARIMA, and various Regression techniques. The findings and insights drawn from each methodology contribute to a holistic understanding of the factors influencing operating profit. The application of the Apriori algorithm revealed intriguing patterns and associations among product categories. We identified key associations, such as the correlation between Men's Street Footwear and Men's Athletic Footwear, providing valuable insights for inventory management and marketing strategies. By employing ARIMA models, we successfully captured the temporal dynamics of normalized operating profit. The optimal ARIMA configuration, (2, 1, 5), showcased the algorithm's ability to discern patterns and make accurate predictions. The results shed light on the temporal trends that businesses can leverage for strategic decision-making. The regression analysis, featuring Ridge Regression, Lasso Regression, Linear Regression, and Decision Tree Regression, underscored the significance of diverse modeling approaches. Each algorithm offered unique advantages, and their hyperparameters were meticulously tuned to achieve optimal predictive performance. The decision tree model exhibited exceptional performance, achieving a R^2 score of 0.997306, closely followed by Ridge Regression and Linear Regression. These results validate the efficacy of our models in predicting operating profit with high precision. The insights derived from this study carry implications for businesses aiming to optimize their operational profitability. The identified associations and predictive models can inform strategic decisions related to inventory management, marketing campaigns, and overall financial planning. As we conclude this project, it is essential to acknowledge that the field of predictive analytics is dynamic, and continuous refinement and adaptation are crucial. Future work could explore ensemble methods, deep learning architectures such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs), can potentially capture more intricate temporal dependencies within the operating profit data.