About Us | WeDo

Business data modelling for predictive analysis

This project focuses on exploring and analyzing customer data from the CRM system of GoodTasteMarket, a U.S.-based supermarket chain with a strong presence in Northern Europe, primarily in the food retail sector. The goal of this analysis is to uncover behavioral patterns and actionable insights to support future marketing strategies, product positioning, and customer engagement initiatives.

The dataset used in this analysis, marketing_campaign.csv, contains anonymized customer-level data that includes demographic attributes, transaction behavior, product preferences, and interaction history. The output of this project includes a structured exploratory data analysis (EDA), key visualizations, and derived variables that will inform segmentation and targeting efforts.

Dataset description

The following variables are included in the CRM dataset:

ID — Unique customer identifier
Age — Customer’s age (numeric)
Education — Educational background of the customer
- 0: Basic
- 1: Secondary
- 2: Bachelor’s
- 3: Master’s
- 4: Doctorate (PhD)
Marital_Status — Marital status
- 0: Single
- 1: Domestic Partnership
- 2: Married
- 3: Divorced
- 4: Widowed
Income — Annual household income
Children — Number of children in the household
Seniority — Tenure (in years) as a customer of GoodTasteMarket
Recency — Days since the last purchase
Complain — Indicates if the customer has submitted any complaints in the past 2 years (0 = No, 1 = Yes)
MntWines — Amount spent on wine in the past 2 years
MntFruits — Amount spent on fruits
MntMeatProducts — Amount spent on meat products
MntFishProducts — Amount spent on fish products
MntSweetProducts — Amount spent on sweets
MntGoldProds — Amount spent on premium (gold) products
MntTotalSpent — Total amount spent on all products (derived field)
NumWebPurchases — Total number of purchases via the website
NumCatalogPurchases — Purchases made through printed or digital catalogs
NumStorePurchases — In-store purchases
NumTotalPurchases — Total number of purchases across all channels (derived field)
NumWebVisitsMonth — Visits to the company website in the past month

This foundational analysis serves as the first step toward more advanced modeling efforts, including customer segmentation and campaign optimization.

A preliminary inspection of the dataset reveals several key findings regarding data quality and distribution:

The Income variable contains missing values — 2,216 non-null entries compared to the 2,240 present in other variables — indicating the need for imputation or exclusion in further analyses.
Significant skewness is observed in Income, MntWines, and MntMeatProducts, where the large gap between the third quartile and the maximum suggests the presence of high-spending customers distorting the mean.
Potential outliers are identified in variables such as Income, MntMeatProducts, MntGoldProds, MntTotalSpent, NumWebPurchases, and NumWebVisitsMonth, due to extreme values well beyond the upper quartile range. Additionally, an anomalous age value of 130 years is likely an outlier.
From a demographic perspective, the Children variable shows that most customers have exactly one child, while the Age distribution is concentrated between 46 and 64 years, indicating a mature customer base.

The distribution of the Income variable shows signs of positive skewness. Although the mean and median are relatively close — with a median of €51,382 — the presence of an extreme maximum value of €666,666 indicates the existence of outliers.

The interquartile range (IQR) spans from €35,303 (Q1) to €68,522 (Q3), which contains the central 50% of observations. However, the substantial deviation of the maximum value from Q3 suggests a few high-income entries that heavily inflate both the mean and standard deviation.

This skewness highlights the importance of applying robust statistics or transformations when using income as a predictor in future modeling stages.

The dataset consists entirely of numeric variables. Among them, 20 are stored as integers, while only the Income variable is of float type, indicating the presence of decimal values.

The only variable with missing values is Income, which contains 24 null entries out of 2,240 observations.

To handle these missing values, a practical and robust approach would be to impute them using the median. This method is simple to implement, minimizes the impact of outliers, and preserves the central tendency of the data without introducing significant bias.

The most common age is around 50 years old. The youngest individuals are around 18 years old, and the oldest are around 85. There are anomalous values, as there are individuals over 120 years old, which is impossible in reality and is likely an incorrect value.

The correlation matrix shows the linear relationship between variables. The values range from -1 to 1. The closer a value is to 1, the stronger the positive linear relationship, meaning that if one variable increases, the other tends to increase as well. If the value is close to -1, the relationship is negative, meaning that when one variable increases, the other tends to decrease.

There is a strong relationship between the variables MntWines and MntTotalSpent, which indicates that wine consumption has a significant influence on the increase in total spending.

There is also a high correlation between NumStorePurchases and NumTotalPurchases, but this relationship is expected, as in-store purchases are part of the total purchases.

We can also find a high correlation between MntMeatProducts and MntTotalSpent, which is a similar case to what happens with wine.

On the other hand, there is no strong negative relationship between any of the variables. The most notable negative correlation is between Income and NumWebVisitsMonth, which suggests that the higher the annual income, the fewer monthly visits to the website.

There are outliers in all variables except ID, Recency, NumStorePurchases, and NumTotalPurchases.

This query calculates daily total sales from 2011 to 2014 and breaks it down by region (Global, North America, Europe, and Pacific). It enables the comparison of sales trends between geographic territories, useful for identifying growth patterns and seasonality in each market.

The impact of outliers can include, first of all, the distortion of descriptive statistics, as the mean is affected by these values, and although the median and quartiles are more robust, they are also influenced. This also affects the interpretation of graphs and histograms, making them harder to interpret and less visually clear. Finally, the presence of outliers can also affect statistical models, skewing certain coefficients.

Once we have examined how the data is distributed, identified any missing or anomalous values, and carried out the necessary cleaning and transformations, we will conduct a more in-depth analysis of the variables we believe provide relevant insights into our customers’ behavior and characteristics. This information may include: spending based on age, the preferred sales channel used, or the education level of those who purchase a specific type of product.

Customer tipology

Most of our customers are between 36 and 65 years old, not married or without a registered partner (combining widowed, single, and divorced into one category), have at least one child, and possess a high level of education.

Customer behavior

Wine and meat are the star products, with wine being by far the clear favorite. Most customers prefer to shop in physical stores or spend more there, making it the predominant channel. On the other hand, we have found that most customers visit the website between 4 and 8 times per month. Finally, we have observed that customers with a partner but not married have the highest customer seniority.

Conclusions

We observe that a higher level of education correlates positively with higher spending.

Although customers without children tend to spend more on average, the median expenditure is higher among families with children, making this segment particularly worth considering.

Income is not a determining factor for spending levels, as the data shows that higher income does not imply higher expenditure.

Customers spend significantly more in physical stores compared to online.

The seniority variable suggests that most customers have been acquired recently or do not have a long-standing relationship with the brand. The most loyal customers are those in a registered partnership, followed by married customers. This is confirmed by our analysis of marital status and the recency variable, as they exhibit the lowest recency values.

Customers who have submitted complaints tend to have lower incomes.