About Us | WeDo

1. Descriptive Analysis

The script begins by inspecting the structure of the dataset and validating its quality. We check the number of columns and rows, the data types, and any missing values.

2. Logistic regression

We modelled the probability of a bike purchase ( BikePurchase ) using demographic & behavioural fields.

Key conclusions

8 numeric, 9 text and 2 date columns – no missing values.
Spending is heavily right‑skewed; 50 % of customers bought a bike.
Age outlier at 110; most customers 0‑3 children, 1‑2 cars; 67 % homeowners.
Strong corr. between BikePurchase & TotalAmount (0 .72).

3. Decision Tree + Clustering

We built a CART tree and k‑means segmentation.

Key conclusions

Significant ↑ effect: TotalAmount (+1 .8 % per extra €) & being Single (+79 %).
Significant ↓ effect: holding a Graduate Degree (‑51 %).
Accuracy ≈ 99 .7 % on validation.

Decision tree conclusions

The CART model retained a single split: TotalAmount < €526 → 0 % bike buyers; ≥ €526 → buyers.
Accuracy ≈ 99.8 % (train) and 99.76 % (validation) – virtually identical to the logistic model, confirming spend is the dominant determinant.

4. Time‑Series Forecast (ARIMA)

We predicted two‑month future sales using auto.arima.

ARIMA conclusions

The model successfully identifies the recurring seasonal patterns and the overall trend in sales.
It tends to place the major peaks and valleys a few days ahead of when they actually occur.
With Theil’s U close to 1, its accuracy is similar to a simple “use yesterday’s value” forecast—useful for general direction but not for exact daily figures.
The final ARIMA, trained on the full series, offers a two-month forecast. When using it, allow for this slight timing offset.

K - means conclusions

Optimal k judged at 3. Resulting clusters align almost entirely with spend bands:

• Cluster 1 (High spend > €4 107, ~16 %)
• Cluster 2 (Low spend < €1 389, ~63 %)
• Cluster 3 (Mid spend, ~21 %)

A follow-up decision tree shows TotalAmount alone classifies cluster membership with high purity, reinforcing that spend level is the primary segment differentiator.

Customer Analytics & Forecasting in RStudio

This project showcases a complete customer analytics pipeline developed in R. It covers exploratory data analysis, logistic regression modeling, decision tree construction, clustering, and time series forecasting. Below, each section includes a code block and conclusions based on the outputs.