Menu
Games Sales Analysis
Data Analysis with RStudio
This project analyzes the games sales of X-Box One and PlayStation 4 that will help to define interesting metrics in answering business questions as follow:
- List the Top 10 Games that make the most sales.
- What Genre that make the most sales?
- Which Game Publisher is leading in the market?
- What’s the average of game sold in each continent?
- Display the total sales by year.
Image Source: https://monkeylearn.com/blog/data-wrangling/.
The pre-processing steps involved in this project are as follow:
1. Collect 2 datasets from Kaggle using appropriate functions and library packages.
2. Merge/combine the datasets into one and select useful variables to work with.
3. Discover the data to provide description of each variable.
4. Scan the data to find any missing values, errors, or special values, then find a suitable method to deal with it.
5. Scan for any outliers using summary statistics, apply appropriate plots on the data, explain the methodology and transform it for better insight.
6. Perform Tidy and Manipulate technique such as mutate a new variable with calculation, reshape the data from untidy to tidy format for better analysis.
1. Collect 2 datasets from Kaggle using appropriate functions and library packages.
2. Merge/combine the datasets into one and select useful variables to work with.
3. Discover the data to provide description of each variable.
4. Scan the data to find any missing values, errors, or special values, then find a suitable method to deal with it.
5. Scan for any outliers using summary statistics, apply appropriate plots on the data, explain the methodology and transform it for better insight.
6. Perform Tidy and Manipulate technique such as mutate a new variable with calculation, reshape the data from untidy to tidy format for better analysis.
1. Discovering
The first dataset is X-box One Games Sales, collected from Kaggle Open Data which can be downloaded from the link: https://www.kaggle.com/sidtwr/videogames-sales-dataset?select=XboxOne_GameSales.csv.
And the second dataset is Playstation 4 Games Sales, collected also from Kaggle Open Data which can be downloaded from the link: https://www.kaggle.com/sidtwr/videogames-sales-dataset?select=PS4_GamesSales.csv.
Both datasets have 9 main variables and described as follow:
The two datasets are then merged into one using the full_join(), based on Game, Year, Publisher, and Genre.
After joined the datasets, we ended up with only 12 variables where:
4 variables are Game, Year, Genre, and Publisher,
4 variables are the X-box One Sales in each continent and,
4 variables are the Playstation 4 sales in each continent.
Below is the table of the combined dataset before data cleaning process:
And the second dataset is Playstation 4 Games Sales, collected also from Kaggle Open Data which can be downloaded from the link: https://www.kaggle.com/sidtwr/videogames-sales-dataset?select=PS4_GamesSales.csv.
Both datasets have 9 main variables and described as follow:
- Game: name/title of the game
- Year: year when the game was published
- Genre: genre of the game
- Publisher: name of the company that publish the game
- North America: sales in North America (in Million)
- Europe: sales in Europe (in Million)
- Japan: sales in Japan (in Million)
- Rest of World: sales in the Rest of the world (in Million)
- Global: sales made globally
The two datasets are then merged into one using the full_join(), based on Game, Year, Publisher, and Genre.
After joined the datasets, we ended up with only 12 variables where:
4 variables are Game, Year, Genre, and Publisher,
4 variables are the X-box One Sales in each continent and,
4 variables are the Playstation 4 sales in each continent.
Below is the table of the combined dataset before data cleaning process:
2. Structuring
We need to check the data structure of the joined table to ensure that the variables have appropriate data format and will be ready for analysis.
There are combination of numeric and characters datatypes. Some of them need to be converted as follow:
1. Year: convert from Char to Factor, relabelled and ordered as True.
2. Genre: convert from Char to Factor, the order is not important.
There are combination of numeric and characters datatypes. Some of them need to be converted as follow:
1. Year: convert from Char to Factor, relabelled and ordered as True.
2. Genre: convert from Char to Factor, the order is not important.
3. Enriching
Data Wrangling steps don't always have to be sequential as depicted on the picture above. As long as the Data Analyst understand the business goal and workflow of the analysis, it is fine to do it the other way around. We will perform the data enriching first then data cleaning after.
Enriched data will help create personalized relationships that provide a more holistic understanding of the business’ needs.
Noticed that the joined table ended up with 2 similar continent sales (e.g. North America.x for X-box One and North America.y for Playstation 4). Since we are interested in the sales from both consoles, we need to combine the 2 continent sales into one using the mutate() function and call the new variables as Continent_Total_Sales (eg. North_America_Total_Sales, Europe_Total_Sales, Japan_Total_Sales, and Rest_Of_World_Total_Sales).
For better analysis, we need to manipulate it by following the tidy data principle. The continent sales are currently in wide format, we can unite these and have one variable that represent as continent.
In case if we are interested in the total sales only regardless of where the games were sold, having four continents may be difficult to see. We can reshape the data to have only 2 variables; Continents and Sales.
Below are the comparison of Wide format and Long format tables. By reducing certain similar variables that can be grouped as one, the analysis will become much easier.
Enriched data will help create personalized relationships that provide a more holistic understanding of the business’ needs.
Noticed that the joined table ended up with 2 similar continent sales (e.g. North America.x for X-box One and North America.y for Playstation 4). Since we are interested in the sales from both consoles, we need to combine the 2 continent sales into one using the mutate() function and call the new variables as Continent_Total_Sales (eg. North_America_Total_Sales, Europe_Total_Sales, Japan_Total_Sales, and Rest_Of_World_Total_Sales).
For better analysis, we need to manipulate it by following the tidy data principle. The continent sales are currently in wide format, we can unite these and have one variable that represent as continent.
In case if we are interested in the total sales only regardless of where the games were sold, having four continents may be difficult to see. We can reshape the data to have only 2 variables; Continents and Sales.
Below are the comparison of Wide format and Long format tables. By reducing certain similar variables that can be grouped as one, the analysis will become much easier.
4. Cleaning
Scan Missing values, special values, or obvious errors.
There are 3 variables with missing values detected using colsums() function. Below are the detail and solution to handle it:
1. Year: 1068 missing values
Solution: Since Year is a Factor datatype, a new level name "Unknown" is addded to replace the missing value. Assuming that the publisher did not provide it since the beginning.
2. Publisher: 1068 missing values
Solution: Replacing with string: "Unknown". Assuming that the publisher did not provide it since the beginning.
3. Sales: 2828 missing values
Solution: The missing values occurence are due to the combined datasets from XboxOne and Playstation 4 where both continents are duplicated. In this report, we will omit the value.
There are no special values or obvious errors found.
Scan Outliers
We are interested to observe the total games sales. For any non-sale games or 0 sales, we will remove it from the data.
Methodology:
1. The statistic summary is performed and grouped by Continents.
There are 3 variables with missing values detected using colsums() function. Below are the detail and solution to handle it:
1. Year: 1068 missing values
Solution: Since Year is a Factor datatype, a new level name "Unknown" is addded to replace the missing value. Assuming that the publisher did not provide it since the beginning.
2. Publisher: 1068 missing values
Solution: Replacing with string: "Unknown". Assuming that the publisher did not provide it since the beginning.
3. Sales: 2828 missing values
Solution: The missing values occurence are due to the combined datasets from XboxOne and Playstation 4 where both continents are duplicated. In this report, we will omit the value.
There are no special values or obvious errors found.
Scan Outliers
We are interested to observe the total games sales. For any non-sale games or 0 sales, we will remove it from the data.
Methodology:
1. The statistic summary is performed and grouped by Continents.
As can be seen, the Standard Deviation for Japan and Rest of the World are much lower than North America and Europe. This mean the market in North America and Europe performed much better.
2. Upon checking the outliers using boxplot, there seems to be so many outliers. We must keep this record as it is part of the sales data.
2. Upon checking the outliers using boxplot, there seems to be so many outliers. We must keep this record as it is part of the sales data.
5. Validating
Data validation is the process of authenticating your data and confirming that it is standardized, consistent, and high quality. Verify that it is clean and regularly structured. This process often uses automated programs to check the accuracy and validity of cells or fields by cross-checking data, although it may also be done manually.
6. Publishing
All the data points have been standardized, making the data easier to analyze. Let’s compare the raw data to the ones that have already been cleaned:
See how the data is more readable after performing data wrangling. Now, we can start analysing the data and answer the business questions as we determined at the beginning.
Analyze
1. List the Top 10 Games that make the most sales.
The result of our analysis shows that "Grand Theft Auto V" is the game with the highest sales with a total of $28,110,000.
2. What Genre that make the most sales?
2. What Genre that make the most sales?
The result of our analysis shows that "Shooter" is the genre with the highest sales with a total of $182,480,000.
3. Which Game Publisher is leading in the market?
3. Which Game Publisher is leading in the market?
The result of our analysis shows that "Activision" is the top leading Publisher with a total sales of $105,720,000.
4. What’s the average game sold in each continent?
4. What’s the average game sold in each continent?
5. Display the total sales by year.
Other Visualization
Conclusion
European countries perform as the best sales among the other continents followed by North America, the rest of the world and Japan. We can assume that many gamers are based in Europe, or possibly because there are more countries in Europe than the rest of the world.
For marketing strategy, it is probably the best to hold more events in European and North American countries.
European countries perform as the best sales among the other continents followed by North America, the rest of the world and Japan. We can assume that many gamers are based in Europe, or possibly because there are more countries in Europe than the rest of the world.
For marketing strategy, it is probably the best to hold more events in European and North American countries.