The Google Play Store is a digital distribution service that allows users to download apps that were developed using the Android software development kit. It offers a wide variety of applications, such as, games, books, and maps. There are applications that require payment, while others are free. According to Business of Apps, the Play Store has 2.56 million apps available. Distribution of an app through the store has the potential of being a great success. For example, according to Statista, in 2020, Candy Crush Saga has made a revenue of 473 million U.S dollars across the Play Store and the App Store.
However, it seems that certain applications catch the public’s eye more than others. Therefore, someone who wants to develop applications for the Google play store, would be curious about what qualities led to success. Therefore, I will create a regression model that allows for the prediction of the numbers of installs an app would receive, and a prediction for the rating an app would get based on a variety of categories.
The dataset has 13 columns:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from random import randint
# Read in the dataset from the csv file
google_play = pd.read_csv('googleplaystore.csv')
google_play.head()
The above table shows a section of the dataset. By calling shape, we are able to see the total number of entries, or rows, in the dataset, as well as the columns. Format: (row, column)
google_play.shape
This means that at the time the dataset was created, there was 10,841 apps on the Play Store.
Now that we have our data, we must tidy it. Tidying data makes it so the data can be analyzed easier.
Remember that my aim is to analyze what factors makes an app successful. I measure success by 2 measures: (1) the app's rating and (2) the number of installs an app has. This means that some columns are not required to gauge the success of our app. For example, we do not need to know the Current Version of the app.
To do this, I will begin by removing columns that are not necessary for my analysis. Dropping columns is pretty easy in python. I can call the drop () function with a list of the column names that I want dropped.
After dropping the unneeded columns, I standardize the names of my columns. By standardize, I mean making sure each column name starts with a lowercase value. If the column name has multiple words, I connect the words with the upper score "_" key.
# May need to return to this
google_play.drop(['Size', 'Last Updated', 'Current Ver', 'Android Ver'], axis=1, inplace=True)
# Standardize column names
google_play.columns = ['app', 'category', 'rating', 'reviews', 'installs', 'type', 'price', 'content_rating', 'genres']
google_play.head()
Now, the dataset has 9 columns of information that will help determine the success of an app. We can also see the difference in the column names.
Data cleaning is exactly what it sounds like. Now that we have narrowed our data down to the columns we want, and dealt with missing data, we want to make sure that our data within each row is in a form that we can work with. That includes making sure that the data is the correct type. By using google_play.info(), we can expectedly, receive information about our dataframe. This includes the number of entries, the columns, the number of non-null elements, and finally, the data type of each column.
google_play.info()
The only column that contains numeric values is rating. Aside from that, everything else appears to be objects, such as a string. Columns, such as category, type, and content_rating, are categorical variables, so we would expect them to be strings. However, we want columns, such as, installs, reviews, and price, to be numeric.
You may notice that the Installs column uses values such as "+" and ",". For example, one entry may be 5,000+ Installs. Similarly, the price column uses the $ character. If we attempted to convert the install, and price columns to numeric values right now, they would all become NaN. That means that first, we must remove those characters. An easy way of removing unwanted characters, is by replacing them with an empty string.
# Remove position 10472 due to data malformation
google_play.drop([10472], inplace=True)
# Cleaning the Installs column
def cleanInstall(install):
install = install.replace('+', '')
install = install.replace(',', '')
return install
google_play['installs'] = google_play['installs'].map(lambda x: cleanInstall(x))
# Cleaning the Price column
def cleanPrice(price):
return price.replace('$', '')
google_play['price'] = google_play['price'].map(lambda x: cleanPrice(x))
google_play.head()
Now that we have removed those characters, we can convert the Reviews, Installs, and Price columns to numeric.
# Convert to Numerics
google_play['reviews'] = pd.to_numeric(google_play['reviews'])
google_play['installs'] = pd.to_numeric(google_play['installs'])
google_play['price'] = pd.to_numeric(google_play['price'])
Now these columns are numeric.
Missing data is exactly what it sounds like: any data that we would like to know, but do not have. Missing data is my dataset is represented by "NaN", which means not a number.
missing = google_play[google_play.isnull().any (axis=1)]
missing
This is the total amount of entries that contain missing data. The majority of them is missing the Rating column. There is 1 entry that is missing the Content Rating column (but has other data malformities). Finally, there is 1 entry that is missing both the Rating Column, and the Type Column. For the mean time, I will remove all entries that have missing values.
# Drops values with NaN
google_play.dropna(inplace=True)
# Fixes the index column.
google_play.reset_index(drop=True, inplace=True)
google_play
Exploratory Data Analysis involves analyzing our data to observe patterns and trends. One nice way of doing this is by visualizing our data using graphs. Recall that we want to analyze which factors impact the rating and number of downloads an app gets.
To begin, we want to see the overall distribution of ratings and installs across the entire Play Store. I choose to graph the distributions using a histogram. With a histogram, we can easily see if the data is normally distributed or is skewed.
plt.hist(google_play['rating'], bins=20, edgecolor='black', linewidth=1.1)
plt.title('Distribution of Ratings in Play Store')
plt.xlabel('Rating')
plt.ylabel('Number of Apps')
plt.show()
The distribution of rating is left-skewed, with most ratings being over 4.0. It would appear that most apps do not get lower ratings.
plt.hist(google_play['installs'], bins=30, edgecolor='black', linewidth=1.1)
plt.title('Distribution of Installs in Play Store')
plt.xlabel('Installs')
plt.ylabel('Number of Apps')
plt.show()
The distribution of installs in the Play Store is highly right skewed. However, there appears to be some outliers, with a very small numbers of apps closer to 1.0e9 installs.
I wonder if rating has any impact on the number of installs? For example, could an app with a higher number of installs have a lower rating? To observe this, I shall create a scatter plot of installs vs rating.
plt.scatter(google_play['installs'], google_play['rating'])
plt.title("Installs vs Rating for Play Store")
plt.xlabel('Installs')
plt.ylabel('Rating')
plt.show()
It appears that the only apps with rating below 3.0 have lower number of installs.
One interesting thing to explore the performance of games based on category. For example, do apps that involve art have better ratings or have more installs than apps that involve beauty?
To begin, I want to find out how many apps belong to each category on the store. A nice way of visualizing data like this is with a pie chart.
# Create dataframe that has total count for each category
value_counts_category = google_play['category'].value_counts()
df1 = pd.DataFrame(value_counts_category)
df1 = df1.reset_index()
df1.columns = ['category', 'cat_count']
# Create dataframe with percent for each category
value_counts_cat_pct = google_play['category'].value_counts(normalize=True) * 100
df_temp = pd.DataFrame(value_counts_cat_pct)
df_temp = df_temp.reset_index()
df_temp.columns = ['category', 'cat_pct']
# Merge dataframes
df_cat = df1.merge(df_temp, how='inner', on='category')
df_cat.head()
Now that I have my data, I can begin creating my pie chart. However, there is an issue. In a pie chart, each slice has a separate color to make it easy to identify. However, there is a limit on distinct colors. Therefore, my pie chart with 33 slices would have many colors repeating, making it hard to understand.
Luckily, pie charts have a colors argument, where I can pass my own colors into. First, I will create a list of custom colors, then I will pass that list into my pie chart construction.
# Used this Stack Overflow post for help generating random colors: https://stackoverflow.com/questions/56290650/python-how-to-get-a-list-of-n-different-colors-from-a-colormap
colors = []
for i in range(34):
colors.append('#%06X' % randint(0, 0xFFFFFF))
# Turn dataframe into Pie Chart
fig, pie_cat = plt.subplots(figsize=(8, 10))
pie_cat.pie(df_cat['cat_count'], startangle=90, colors=colors)
plt.title('Categories of Google Play Store')
pie_cat.axis('equal')
# Create label
# Used this stackoverflow post for help with legend: https://stackoverflow.com/questions/44076203/getting-percentages-in-legend-from-pie-matplotlib-pie-chart
labels = ['%s, %1.1f%%' % (l,s) for l, s in zip(df_cat['category'], df_cat['cat_pct'])]
pie_cat.legend(labels, loc='upper left')
plt.subplots_adjust(right=1.75)
Based on this, the majority of apps belong to the Family category, followed by game and tools. Beauty and Events have the least amount of apps on the store. Less apps could mean less competition to an aspiring creator, therefore, it's not necessary a bad thing that those categories are smaller.
Now, I want to see how successful apps of each category is. I will begin by plotting the average rating and the average number of installs based on category.
cat_rating_mean = google_play.groupby('category', as_index=False)['rating'].mean()
cat_rating_mean.columns = ['category', 'rating_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(3, 10))
bar_cat.barh(cat_rating_mean['category'], cat_rating_mean['rating_mean'], align='center', height=0.6)
plt.title('Average Rating based on Category')
plt.show()
Recall, that the overall distribution of ratings was right skewed, with most rating being around 4. Here, we can see that most categories have an average rating of around 4. Therefore, it seems that the categories are performing around the same rating wise.
cat_install_mean = google_play.groupby('category', as_index=False)['installs'].mean()
cat_install_mean.columns = ['category', 'install_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(3, 10))
bar_cat.barh(cat_install_mean['category'], cat_install_mean['install_mean'], align='center', height=0.6)
plt.title('Average number of Installs based on Category')
plt.show()
This chart is rather different than the previous. Despite the average rating across category being really similar, the average number of installs for categories varies greatly. For example, we saw before that communication apps only make up 3.5%, but they have by far the highest average number of installs. Apps that involve beauty, auto and vehicles, medicine, and parenting seem to have very low number of installs.
Another aspect that could have an impact on installs and rating is price. Is it the case that apps with a price tag turn away users, leading to less installs? Do apps with a price tag have higher ratings?
To begin, we will get the proportion of apps that are free vs paid.
# Create dataframe that has total count for each category
value_counts_category = google_play['type'].value_counts()
df1 = pd.DataFrame(value_counts_category)
df1 = df1.reset_index()
df1.columns = ['category', 'type']
# Create dataframe with percent for each category
value_counts_cat_pct = google_play['type'].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_cat_pct)
df2 = df2.reset_index()
df2.columns = ['category', 'type_pct']
# Merge dataframes
df_price = df1.merge(df2, how='inner', on='category')
df_price
A vast majority, 93%, of the apps are free. However, does that mean that free apps perform better than paid apps? I will perform a similar analysis as I did with category. I decided against turning this into a pie chart because looking at the table alone is enough to understand how the Play Store is distributed payment wise.
price_rating_mean = google_play.groupby('type', as_index=False)['rating'].mean()
price_rating_mean.columns = ['type', 'rating_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(6, 5))
bar_cat.bar(price_rating_mean['type'], price_rating_mean['rating_mean'], align='center', width=0.5)
plt.title('Average Rating based on Free or Paid')
plt.show()
Interestingly, paid apps have a slightly higher mean rating. However, it is not by much. Now, let's look at installs.
price_install_mean = google_play.groupby('type', as_index=False)['installs'].mean()
price_install_mean.columns = ['type', 'install_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(6, 5))
bar_cat.bar(price_install_mean['type'], price_install_mean['install_mean'], align='center', width=0.5)
plt.title('Average Number of Installs based on Free or Paid')
plt.show()
Free apps appear to have massively more average installs than paid apps. However, there are much more free apps on the Play Store than Paid games to begin with.
So far, I have analyzed price without taking category into account. However, it is possible that paid games perform better or worse depending on what category they belong to.
# Create dataframe with counts of category and type
value_counts = google_play[['category', 'type']].value_counts()
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['category', 'type', 'type_count']
# Create dataframe with percentage of category and type
value_counts_pct = google_play[['category','type']].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_pct)
df2 = df2.reset_index()
df2.columns = ['category', 'type', 'type_pct']
df_cat_price = df1.merge(df2, how='inner', on=['category', 'type'])
df_cat = df_cat.merge(df_cat_price, how='inner', on='category')
df_cat
# Get average rating based on category and type
cat_price_rating_mean = google_play.groupby(['category', 'type'], as_index=False)['rating'].mean()
cat_price_rating_mean.columns = ['category', 'type', 'rating_mean']
df_cat = df_cat.merge(cat_price_rating_mean, how='inner', on=['category', 'type'])
# Get average installs based on category and type
cat_price_install_mean = google_play.groupby(['category', 'type'], as_index=False)['installs'].mean()
cat_price_install_mean.columns = ['category', 'type', 'install_mean']
df_cat = df_cat.merge(cat_price_install_mean, how='inner', on=['category', 'type'])
df_cat
This table shows the proportion of free vs paid apps based on category. Across each category, there appears to be much more free apps than paid apps.
df_free = df_cat.loc[df_cat['type'] == 'Free']
df_paid = df_cat.loc[df_cat['type'] == 'Paid']
# Want to
fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_free['category'], df_free['rating_mean'])
ax.barh(df_paid['category'], df_paid['rating_mean'], height=0.5)
plt.title('Average Rating based on Category and whether Free or Paid')
plt.show()
The average rating for free apps of a particular category is represented by the color blue. Orange represents paid apps. The first thing we notice is that some categories (e.g. Beauty, Events) do not have an average rating for paid apps. This means that those categories do not contain paid apps. Aside from that, the average ratings appear to be very similar. The news and magazine category seems to have a higher average rating for paid apps. While the social, and parenting categories, have higher average rating for free apps.
fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_free['category'], df_free['install_mean'])
ax.barh(df_paid['category'], df_paid['install_mean'], height=0.5)
plt.title('Average Number of Installs based on Category and whether Free or Paid')
plt.show()
This graph seems a bit strange, as we cannot see the average number of installs for paid apps of any category. So, let's try plotting just paid apps.
fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_paid['category'], df_paid['install_mean'], height=0.5, color='orange')
plt.title('Average Number of Installs for Paid Apps based on Category')
plt.show()
Now it makes sense. Notice the x axis on this plot compared to the last plot. It's much smaller! Because there are so much more free apps being installed, the paid apps could not be seen on the graph.
Finally, I want to analyze how content rating impacts the average number of installs, and the average rating.
I will begin by checking how the Play Store is distributed content rating wise. For example, people of all ages use the store. Does that mean that most apps are for Everyone?
# Create dataframe that has total count for each content rating
value_counts_content = google_play['content_rating'].value_counts()
df1 = pd.DataFrame(value_counts_content)
df1 = df1.reset_index()
df1.columns = ['content_rating', 'content_count']
# Create dataframe with percent for each content rating
value_counts_content_pct = google_play['content_rating'].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_content_pct)
df2 = df2.reset_index()
df2.columns = ['content_rating', 'content_pct']
# Merge dataframes
df_content_rating = df1.merge(df2, how='inner', on='content_rating')
df_content_rating
# Create Pie Chart
fig, pie_content = plt.subplots(figsize=(8, 10))
pie_content.pie(df_content_rating['content_count'], startangle=90,
labels=df_content_rating['content_rating'], autopct='%1.1f%%', rotatelabels=True)
plt.title('Content Ratings of Google Play Store', pad=60)
pie_cat.axis('equal')
plt.show()
The majority of apps are rated Everyone, with close to 80%. Then it's Teen, Mature, Everyone 10+, Adults Only, and finally Unrated. Though there is only 1 app that is Unrated. It appears that apps overall want the biggest audience possible. Though, that would imply that Everyone 10+ should have more entries than Teen and Mature, but it doesn't. Now, let's check how content rating impacts average rating and the average number of installs.
rating_mean = google_play.groupby('content_rating', as_index=False)['rating'].mean()
rating_mean.columns = ['content_rating', 'rating_mean']
# Create bar plot
fig, ax = plt.subplots(figsize=(6, 5))
ax.bar(rating_mean['content_rating'], rating_mean['rating_mean'], align='center', width=0.5)
plt.title('Average Rating based on Content Rating')
plt.show()
Like category and price before it, content rating has a similar average rating across varieties.
install_mean = google_play.groupby('content_rating', as_index=False)['installs'].mean()
install_mean.columns = ['content_rating', 'install_mean']
# Create bar plot
fig, ax = plt.subplots(figsize=(7, 5))
ax.bar(install_mean['content_rating'], install_mean['install_mean'], align='center', width=0.5)
plt.title('Average Number of Installs based on Content Rating')
plt.show()
Now, this is rather surprising! I expected Everyone to have the highest average installs, because the majority of apps on the Play Store is rated Everyone. However, it is 3rd, behind Teen and Everyone 10+. Recall that Unrated only had 1 app belonging to it, so it makes sense that it has the lowest average number of installs.
At the beginning I mentioned that I would perform a regression. Now we have reached that point. Regressions are modelling approaches that allow us to get a predictive model of our data. A predictive model would allow us to determine how an app would perform before it gets put on the Play Store.
Before I can perform a regression, I need to make sure that all my categorical variables have numeric representation. The variables I'm focusing on are the Category and Content Rating columns.
In order to convert the categorical variables into numeric variables, there are multiple options. I used this [tutorial] (https://pbpython.com/categorical-encoding.html) for help when exploring my options.
The first is is One Hot Encoding. It would turn each categorical data into it's own column. Then 0 or 1 is assigned to each row. For example, art and design, would become a column. A row would have 1 assigned to one of the newly created columns, and 0 assigned everywhere else. This is because a row could have only belonged to one category to begin with.
# Example showing One Hot Encoding
dummies_category = pd.get_dummies(google_play['category'])
dummies_category
One problem with One-Hot Encoding is the number of columns that would be added to the table. Creating dummies of the category variable would create 33 new columns alone. Then, I would need to do the same for Content Rating.
Instead, I will use Label Encoding. This simply assigns a unique integer in alphabetical order. For example, the numeric representation of Art and Design would be 0, while Beauty would be 3.
# Create Label Encoder
label_encoder = LabelEncoder()
# Assign numerical value to Category and Content Rating in new column
google_play['category_num'] = label_encoder.fit_transform(google_play['category'])
google_play['content_rating_num'] = label_encoder.fit_transform(google_play['content_rating'])
google_play.head()
We expect category_num column to have 33 unique values, and for the content_num column to have 6 unique values. I will check to make sure that the encoding is correct.
# Checking number of unique numerical values for "category"
tmp = google_play['category_num'].unique()
tmp.sort()
tmp
# Checking number of unique numerical values for "content rating"
tmp = google_play['content_rating_num'].unique()
tmp.sort()
tmp
It would appear that category and content rating were transformed into numeric variables correctly. Now, I can begin to perform regression.
To begin, we must perform feature selection. Feature Selection is reducing the number of input variables when developing a predictive model. For my model, I decided to select reviews, installs, price, category, and content rating as my features. These features are represented by the X variable. The y variable represents the variable I want to predict. In this case, it is rating.
After selecting features, I must split my data into training and testing sets. I use python's train test split to do so. I decide to make my test set slightly smaller, at 33%, and choose the random state to be 42, which is standard.
# Select features
# More information about feature selection: https://machinelearningmastery.com/feature-selection-for-regression-data/
features = ['reviews', 'installs', 'price', 'category_num', 'content_rating_num']
X = google_play[features]
y = google_play['rating']
# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print('Trainig', X_train.shape, y_train.shape)
print('Testing', X_test.shape, y_test.shape)
Training on 6275 entries, and testing on 3091 entries.
I've decided to fit my data to a Linear Regression, and a Random Forest Regression to determine which one works better. To do so, I shall calculate the mean squared error for this model.
# Select model and fit it
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)
# Predict on test data
y_predict_linear = model_linear.predict(X_test)
print(mean_squared_error(y_test, y_predict_linear))
model_forest = RandomForestRegressor()
model_forest.fit(X_train, y_train)
y_predict_forest = model_forest.predict(X_test)
print(mean_squared_error(y_test, y_predict_forest))
Interestingly, the mean squared error is very similar for both models. However, I suppose the Random Forest model is slightly lower.
Now if we put in features of an app, both models should be able to predict what rating the app would receive.
I will now perform a very similar analysis for the number of installs.
# Select features
features = ['rating', 'reviews', 'price', 'category_num', 'content_rating_num']
X = google_play[features]
y = google_play['installs']
# Split into training and testing data
X_train_install, X_test_install, y_train_install, y_test_install = train_test_split(X, y, test_size=0.33, random_state=42)
model_lin_install = LinearRegression()
model_lin_install.fit(X_train_install, y_train_install)
y_predict_lin_install = model_lin_install.predict(X_test_install)
print(mean_squared_error(y_test, y_predict_lin_install))
model_for_install = RandomForestRegressor()
model_for_install.fit(X_train_install, y_train_install)
y_predict_for_install = model_for_install.predict(X_test_install)
print(mean_squared_error(y_test, y_predict_for_install))
Both predictive models for Ratings had very low mean squared error. However, for installs the mean squared error is extremely high for both models. This suggests that both models may have issues predicting the number of installs an app will receive.
I decided to explore the overall landscape of the Play Store. For example, I discovered that the majority of apps belong to the Family category, and that the vast majority of apps are free. Throughout my paper, I attempted to analyze what factors influence an app's success on the Play Store, primarily looking at rating, and install numbers.
There are numerous improvements that could be made, as well as further questions to be answered. As we can see above, my regression model for number of installs can be fine tuned to have a lower mean squared error.
There are also other questions that I did not explore here. I did not dive into the genres of apps. Perhaps there are certain genres that are more successful than others. Also, we could narrow our search into the most popular apps, and analyze what factors make them popular. Finally, we could refine the charts I created, and perhaps attempt to explore more information with them. This is a massive dataset, and thus there is a variety of things left to explore.