Identify Patterns in Data
- Sarita Upadhya
- Jun 10
- 6 min read
Data that is available to us for analysis, may have challenges related quality, completeness, relevance etc. Sometimes, the information that is required has not been collected at all or is partially collected. So, we need to get started with whatever reliable information is available and see if we can identify patterns in the data and make it useful for business. Unsupervised learning technique like clustering is one of the powerful techniques that will help us to identify these patterns. In this article, data from e-commerce website is analysed, to identify patterns and seek information relevant for taking business decisions.
About Data:
The dataset is picked up from UC Irvine Machine learning Repository and is available here. The "Online shoppers purchasing intention" dataset has details of 12,330 sessions and each session corresponds to a different user. The information has been collected for a period of one year.
Sr # | Feature/Column | Description |
1 | Administrative | Represents the number of administrative pages visited by the visitor in that session |
2 | Administrative_Duration | Represents total time spent by the visitor in administrative pages |
3 | Informational | Represents the number of informational pages visited by the visitor in that session |
4 | Informational_Duration | Represents total time spent by the visitor in informational pages |
5 | ProductRelated | Represents the number of ProductRelated pages visited by the visitor in that session |
6 | ProductRelated_Duration | Represents total time spent by the visitor in ProductRelated pages |
7 | BounceRates | Pages that do not meet the engagement criteria during a session are considered bounces |
8 | ExitRates | The number of exits from the page by the total number of page views |
9 | PageValues | The average value for a web page that a user visited before completing an e-commerce transaction |
10 | SpecialDay | The closeness of the site visiting time to a specific special day in which the sessions are more likely to be finalized with transaction |
11 | Month | Month of the year |
12 | OperatingSystems | Operating System Used |
13 | Browser | Browser Used |
14 | Region | Region where the session was activated |
15 | TrafficType | Online traffic type when session was active |
16 | VisitorType | Returning or new visitor |
17 | Weekend | Session active during weekend or not |
18 | Revenue | Session leading to a Sale |
Above table is the Data Dictionary explaining the features or columns available in the dataset.
Observation:
The e-commerce site has pages that provide administrative information, information about the e-commerce website and product related information.
The time spent by users in each of these pages has been captured in the data.
Information regarding the bounce rate, exit rate and average pages browsed by user in each session is also available.
The “Revenue” field indicates whether the user made a purchase or not.
From the above list we see that there are a total of 18 features out of which 10 are numerical and 8 are categorical data types.
Below is the statistical summary of the numeric and categorical information in the data.
Sr # | Feature/ Column | mean | Std | min | 25% | 50% | 75% | max |
1 | Administrative | 2.32 | 3.32 | 0.0 | 0.0 | 1.0 | 4.0 | 27.0 |
2 | Administrative_Duration | 80.82 | 176.78 | 0.0 | 0.0 | 7.5 | 93.26 | 3398.75 |
3 | Informational | 0.50 | 1.27 | 0.0 | 0.0 | 0.0 | 0.0 | 24.0 |
4 | Informational_Duration | 34.47 | 140.75 | 0.0 | 0.0 | 0.0 | 0.0 | 2549.38 |
5 | ProductRelated | 31.73 | 44.48 | 0.0 | 7.0 | 18.0 | 38.0 | 705.0 |
6 | ProductRelated_Duration | 1194.75 | 1913.67 | 0.0 | 184.14 | 598.94 | 1464.16 | 63973.52 |
7 | BounceRates | 0.02 | 0.05 | 0.0 | 0.0 | 0.003 | 0.016 | 0.2 |
8 | ExitRates | 0.04 | 0.05 | 0.0 | 0.014 | 0.025 | 0.05 | 0.2 |
9 | PageValues | 5.89 | 18.57 | 0.0 | 0.0 | 0.0 | 0.0 | 361.76 |
10 | SpecialDay | 0.06 | 0.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Observation:
From the above summary we see that maximum time is spent by users on product related pages.
Bounce and Exit rates range between 0% to 20%.
“SpecialDay” is a binary field with values 0 and 1, “1” indicating that site was visited near to a special day.
Sr # | Index | Unique | Top | Frequency |
1 | Month | 10 | May | 3364 |
2 | OperatingSystems | 8 | 2 | 6601 |
3 | Browser | 13 | 2 | 7961 |
4 | Region | 9 | 1 | 4780 |
5 | TrafficType | 20 | 2 | 3913 |
6 | VisitorType | 3 | Returning_Visitor | 10551 |
7 | Weekend | 2 | false | 9462 |
8 | Revenue | 2 | false | 10422 |
Observation:
Data has been captured from 8 different operating systems across 13 different browsers and 20 traffic types.
86% (10551/12330) of the sessions captured in the data is about returning visitors to the website.
85% (10422/12330) of the sessions have “Revenue” field as false indicating that there were no purchases made in these sessions.
Objective:
Analyse 85% of the sessions where the user did not make a purchase, identify patterns in the user behaviour which can be used for targeted marketing thus helping in increasing customer turnaround.
Further we proceed with our analysis on 10422 sessions where user did not make any purchase.
Analysis Using Visualization:

There are 3 types of visitors to the e-commerce website.
Out of 10422 visitors, 9081 correspond to “Returning_Visitor”, 1272 are “New_Visitor” while around 69 are “Others”.

“BounceRates” and “ExitRates” show positive relation i.e. as “BounceRates” increase, “ExitRates” also increase.
“BounceRates” of most of visitors from “Other” category is 0, while it ranges from 0 to 5% for “New_Visitors”.

The heatmap demonstrates the relation between the numeric features in the data.
As observed earlier, “BounceRates” and “ExitRates” show a strong positive correlation of 0.91.
There is a positive relation observed between
"ProductRelated_Duration” & “Administrative_Duration”
“ProductRelated_Duration” & “Informational_Duration”
Using the numeric features analysed in the heatmap, we split the sessions in the data into 3 clusters using KMeans clustering.
Note:
Choice of the features to be given for clustering is a decision taken by the business.
As we have chosen numeric features, Kmeans algorithm should work well for this purpose.
Check for “within sum of squares” value for different number of clusters to choose on how many clusters can be derived from the given data.
For the above data and the choice of features made, analysis suggests to go ahead with 3 clusters. Below is the statistical summary of the 3 clusters:
Cluster | Administrative_Duration | Informational_Duration | ProductRelated_Duration | Bounce Rates | Exit Rates | Page Values | Freq |
0 | 1.21 | 0.09 | 49.53 | 0.1685 | 0.1822 | 0.00 | 1022 |
1 | 365.06 | 236.89 | 3833.74 | 0.0071 | 0.0209 | 15.34 | 914 |
2 | 51.10 | 11.61 | 895.21 | 0.0100 | 0.0340 | 0.77 | 8486 |
Observation:
From the above table we see that the algorithm has split 10422 users/sessions into 3 clusters.
Cluster 0 has 1022 users, cluster 1 has 914 users while cluster 2 has 8486 users.
Average duration spent by users in administrative or informational or product related pages is high in case of users from cluster 1. Average Bounce rate and Exit rate is low for these users and the average number of pages they have visited is higher than the other 2 clusters.
Cluster 0 users have a higher bounce and exit rates with minimum time spent on the e-commerce website.
Cluster 2 with 8466 users correspond to the major crowd where the users have spent some time on the website and visited few pages.
From the above pattern it is observed that 914 users out of the 10422 in cluster 1 seem to be prospective customers for targeted marketing. The probability of turnaround of these customers is high considering that they have already spent a good time browsing the website.
Below is the visualization to check which region the users from cluster 1 belong to:

Maximum users from cluster 1 i.e. 358/914 are from Region 1.
Marketing team can approach these users for targeted marketing and increase the turnaround.
Further analysis of cluster 1 users can be performed using other features to help the marketing team with more details about these users.
Hence, from the above analysis we see that from a dataset of 12330 users/sessions we were able to identify the 914 users who have a high probability of making a purchase. Hence, business can first focus on the 914 users or 358 users from Region 1 and try to increase the sales. With clustering which is an unsupervised learning technique, patterns have been identified in data to take informed decisions.
Additionally, the clustered data can be used to further build a supervised learning prediction model so that any new user who falls in cluster 1 category, can be given real time support/recommendations thus increasing the purchase rate of the e-commerce website.
Comments