top of page
Search

The Data Platter

  • Writer: Sarita Upadhya
    Sarita Upadhya
  • Dec 3, 2024
  • 6 min read

Updated: Apr 21

Data being the starting point and most important aspect of our data science journey, is available from multiple sources and in different forms. When we capture the data, it is important that we collect it from legitimate source and it is of good quality. Traditional methods of collecting data i.e. manual filling of forms or collecting survey data manually is not preferred as it is prone to error, is time consuming and expensive.


Data Capture & Types of Data
Data Capture & Types of Data

Banks, Financial institutes, Insurance companies advise us to fill in the forms available on their website or in their mobile application to ensure completeness, correctness and quick processing of the data. Google forms are also being widely used by human resource departments, educational institutions to capture candidate or student related information. QR codes are widely being used in supply chain and retail industry to capture stock, item related information thus easing business processes.

Today, Artificial Intelligence (AI) is also being used to capture data to reduce manual efforts and improve data quality.

• An IoT wearable device like a smart watch, connected to your mobile can capture your personal data (sleep time, number of steps walked and more) on the go using an application installed on your mobile.

• With IoT enabled sensors in a farm and use of drones, farmers can get real time data on soil moisture levels, temperature changes, nutrient availability, pH level, infection in crops and more.

• Optical Character Recognition (OCR) reads and scans the number plates of cars thus enabling toll collection at highways and law enforcement by traffic police. It is also used by banks to analyse bank statements and extract required information from the same for further analysis.

• While applying for a job, you will be given a provision to read through your resume and autofill the application form, thus ensuring quick response time and good quality data.

This is done using Natural Language Processing a branch of AI which is used to process

text information.

As we see, the data being captured is large in volume and has variety of information. The data can be in a structured format i.e. organized in a table with rows and columns or it can be unstructured i.e. in the form a text (e.g. emails, blogs, documents), images (e.g. photos, X Rays, MRI scans), video files or voice messages. Additionally, velocity at which the data is received also plays a key role. A supermarket store may be interested in collecting inventory data twice or thrice a week to manage the stocks in the store while an ecommerce store is interested in knowing what you just added to your cart, so that it can recommend you related products.

We also need to be mindful about value the data brings to us in current business scenario. Consider a new movie will be released in the next week and we need to analyse past data to predict the footfall in the theatre so that the management can plan the number of shows to be scheduled, inventory of the food items, number of housekeeping staffs to be deployed and the ticket pricing. If we consider, data from 2019 or before it may not give us a right insight. During covid, people have found several other sources of entertainment and there are several other alternatives to choose from. What will be useful in our analysis would be recent footfall data for the movies of similar category, star caste of the movie, trailer reviews, high demand IPL or T20 cricket match schedule, new releases on OTT, public holidays and more.

Complexity of dealing with the data increases as we move from structured to unstructured format or we move from fixed quantum of business data to streaming data flow. As the complexity increases, the complexity of the tools and techniques required to store, understand, relate and make use of the data also increases thus increasing the computational requirements and hence the cost. However, there are many trained products available in the market today which have already consumed and understood a big volume of data and can be repurposed by business based on their specific needs.

Let us look deeper into structured data. In a structured data format, we will have columns representing fields and rows containing the actual data. Each field in the column will be either:

• Numeric: A numeric data can be a continuous number i.e. can hold all possible values in the real number series or it can be a discrete value where whole numbers will be used. For example, an “Amount” field can hold continuous values like 2500.75 while a column “Number of Student” will hold discrete values and can never be continuous.

• Categorical (e.g. “Gender” with values as “Male”, “Female”; “Ratings” with values as “Low”, “Medium”, “High”). It is possible that a categorical column will hold numeric values – For example the “Rating” column holding values of “Low”, “Medium”, “High” may be represented as 0, 1 and 2. However, “0, 1, 2” are used here to express the rating levels rather than its actual value.

• Date: Date can be represented in date only or in date-time format.

It is important to understand the data within each column and what values it can hold as the analysis we perform on the data will differ accordingly. For example, if we want to summarize a numeric data we will use measures like summation, average, minimum and maximum value etc. while for categorical data we will work with frequency of each category within the data.

Let us pick a small case study and get a real-time experience of working with data and getting insights from it for decision making. We have some money to invest in equity and we have day closing stock price information of two companies for 15 days. We need to analyse the data to take our investment decision.

Sample Data of Day Closing Stock Price
Sample Data of Day Closing Stock Price

Data Explanation: Above table lists the day closing stock price of 2 companies “C1” and “C2” in the columns “C1_Day_Closing_Price” and “C2_Day_Closing_Price” respectively. “C1_Return (%)” is calculated field and is the percentage increase or decrease in the price compared to previous day. Same calculation has been done for company “C2” and is available in column “C2_Return (%)”. Are you able to take an investment decision yet?

Let us represent the same numbers in a different way.

Data Visualization and Summary:

Trend analysis of the Day Closing Stock Price for C1 and C2
Trend analysis of the Day Closing Stock Price for C1 and C2

Data Summary
Data Summary

Observations:

  1. There is no major difference in the stock price trends of the two companies.

  2. Average stock price of C1 is more than C2.

  3. If we look at the minimum, maximum value and the standard deviation, C1 stock price value deviates more from its average value compared to C2.

  4. In 15 days, at 8 instances both C1 and C2 have given positive returns.

  5. The C1 return has gone as low as -2.72% and as high as 2.55%.

Analysis:

From the above observations we can see that investing in C1 is riskier than investing in C2. From a long-term investment point of view, it may be fair to invest in C2. However, if we have the risk appetite and are looking for short term investments, we can invest in C1 as it may give us higher returns compared to C2.

In this analysis, we just picked up 15 days data. If the stock prices have not gone through a roller coaster ride, we can consider past 3 months data for our analysis. Also, we just picked up few data measures to do our analysis. We can further look at the distribution of the data, apply forecasting techniques to get more insights before taking a decision.

Microsoft Excel is one of the powerful tools with which you can play around with data. Apply mathematical formulas, design charts, use pivot tables etc. to do your analysis. There are free tutorials available from Microsoft which can be your starting point.

So, if you are a student seeking admission for your bachelor's degree; start comparing the shortlisted colleges across different cities/countries based on the different parameters like fees, faculty ratings, infrastructure, placement percentage, starting salary and more for past 3-5 years before deciding on the final one. If you are into business, look at your process dealing with data and check if there are opportunities to automate or ease your processes using the advanced AI tools that are available to decrease your response time and thus improve business efficiency.

 
 
 

Comments


bottom of page