Hi there,
It’s been a while that I have not updated anything new on this blog. But from now on, I’m happy to share with you several interesting things that I have learned since the day I joined the data career.
Okay, fasten your seatbelt and follow me to get some interesting insight.
Introduction
This analysis had been taken at the time that I applied for my previous job. Because of my procrastination, I’m now abling to share these with you. But trust me, It’s totally worth it.
Back to the day that I received the entry test from the employers, I’m quite surprised by the potential of this data set. And thanks to the brief requirement (Analysing this dataset), I have a chance to discover and present interesting insights.
The following attachment was the analysis that I presented to the employer. If you just want to get some insights, please take a look and come back to review this post later.
Business Performance Analysis by Tuan Anh Le
Okay, let’s take a look at our dataset.
Dataset
Provided by the employer, the dataset contains information on purchasing details of customers of a company in the US. I later found out that this is the Super Store Dataset that’s quite familiar with Tableau developers. Let’s scratch the tip of this iceberg.
The provided dataset has:
- 9994 observations (rows)
- Of over 5000 orders
- from 793 customers
- It has 17 fields (columns)
- From 07/01/2015 – 31/1/2018
Next, I will clean this stuff.
Data Cleaning
Following this guidance from Tableau, I clean this dataset up by:
1. Remove duplicate/irrelevant obseration
While working with data, you’ll find out that duplications are quite common. The reason for duplicated observations might come from the Backend but will happen most often during data collection. When you combine data from multiple sources or departments, there are opportunities to create duplicate data. And the process of removing these duplicates is called De-Duplication.
At the time of analyzing this dataset, I used R to find duplicates and figure out there is 1 duplicated row. But there is a more versatile way that you could use SQL or Tableau to find out these duplicates. Try to determine key fields, then grouped by these (SQL) or fixed by them (Tableau LOD calc).
2. Fix structural errors
The structural errors might cause messing your fields’ format, naming conventions, typos. And if you combine multiple data sources, there is a chance that people use different kinds of annotations (E.g: You may find “N/A”, “Not Applicable” or “” both appear, but they should be treated as the same category).
In this case, there was a typo in the header so I renamed it.
3. Filter unwanted outliners
Removing outliers is a consideration process while cleaning your data. If you have a legitimate reason to remove an outlier, like improper data-entry.
E.g: You’re handling the customer’s birthday dataset and there are a few people who have DOB: 01/01/1900. I’m pretty sure that none of your customers have this data of birth.
4. Handle missing data
You also should not ignore missing data cause there are many algorithms that will not accept missing values. For example, column Quantity of your data has N/A values and you decide to multiply it to column Price (has no N/A value) to create the new column Sales. Now your Sales column has N/A values. There are a couple of ways to deal with missing data.
- The first option is drop observations that have missings values. Though doing this will lose informatioin, remove a couple of row in a million-row data doesn’t seem pretty bad.
- The second is to input missing values based on other observations. This might be time-consuming and there is an opportunity to lose intergrity of the data by your assumptions not actual observations.
In my case, there were 11 out of 9994 observations that do not have Postal codes, so I decided to remove these records.
5. Validate and QA
As people said, Garbage in, garbage out. After taking the garbage out, you should be able to answer these questions:
- Does the data make sense?
- Does the data follow the appropriate rules for its field?
- Does it prove or disprove your working theory, or bring any insight to light?
- Can you find trends in the data to help you form your next theory?
- If not, is that because of a data quality issue?
In this case, I have a little notion of what this dataset talks about, so I’ll leave these questions to the end. And after the cleaning process, Let’s sum up what I’ve got:
- 9982 observations (rows)
- Of over 5000 orders
- from 793 customers
- It has 17 fields (columns)
- From 07/01/2015 – 31/1/2018
In the upcoming post, I will demonstrate interesting insights that I discovered in this dataset.
Thank you for visiting this site.
Happy reading ༼つ ் ▽ ் ༽つ