Analysis of the Google Play Store

By: Tara Wade

Introduction

The Google Play Store is a digital distribution service that allows users to download apps that were developed using the Android software development kit. It offers a wide variety of applications, such as, games, books, and maps. There are applications that require payment, while others are free. According to Business of Apps, the Play Store has 2.56 million apps available. Distribution of an app through the store has the potential of being a great success. For example, according to Statista, in 2020, Candy Crush Saga has made a revenue of 473 million U.S dollars across the Play Store and the App Store.

However, it seems that certain applications catch the public’s eye more than others. Therefore, someone who wants to develop applications for the Google play store, would be curious about what qualities led to success. Therefore, I will create a regression model that allows for the prediction of the numbers of installs an app would receive, and a prediction for the rating an app would get based on a variety of categories.

Data Collection

To begin, we must collect data from the Google Play Store. However, a Google Play Store dataset already existed on Kaggle.com that I am able to use.

The dataset has 13 columns:

  1. App - The name of the application
  2. Category - The main category app belongs to (e.g. Game)
  3. Rating - The rating the userbase assigned to the app
  4. Reviews - The number of reviews the app has
  5. Size - Size of the app (MB)
  6. Installs - The number of users that downloaded the app
  7. Type - Whether the app is free or paid
  8. Content Rating - Age group the app is targeting
  9. Genres - Genres app belongs to beside main category
  10. Last Updated - Date app last updated
  11. Current Ver - Current Version of app available on Play Store
  12. Android Ver - Minimum required Android version
In [1]:
# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from random import randint
In [2]:
# Read in the dataset from the csv file
google_play = pd.read_csv('googleplaystore.csv')
google_play.head()
Out[2]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up

The above table shows a section of the dataset. By calling shape, we are able to see the total number of entries, or rows, in the dataset, as well as the columns. Format: (row, column)

In [3]:
google_play.shape
Out[3]:
(10841, 13)

This means that at the time the dataset was created, there was 10,841 apps on the Play Store.

Data Processing

Now that we have our data, we must tidy it. Tidying data makes it so the data can be analyzed easier.

Remember that my aim is to analyze what factors makes an app successful. I measure success by 2 measures: (1) the app's rating and (2) the number of installs an app has. This means that some columns are not required to gauge the success of our app. For example, we do not need to know the Current Version of the app.

To do this, I will begin by removing columns that are not necessary for my analysis. Dropping columns is pretty easy in python. I can call the drop () function with a list of the column names that I want dropped.

After dropping the unneeded columns, I standardize the names of my columns. By standardize, I mean making sure each column name starts with a lowercase value. If the column name has multiple words, I connect the words with the upper score "_" key.

In [4]:
# May need to return to this
google_play.drop(['Size', 'Last Updated', 'Current Ver', 'Android Ver'], axis=1, inplace=True)

# Standardize column names
google_play.columns = ['app', 'category', 'rating', 'reviews', 'installs', 'type', 'price', 'content_rating', 'genres']

google_play.head()
Out[4]:
app category rating reviews installs type price content_rating genres
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 10,000+ Free 0 Everyone Art & Design
1 Coloring book moana ART_AND_DESIGN 3.9 967 500,000+ Free 0 Everyone Art & Design;Pretend Play
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 5,000,000+ Free 0 Everyone Art & Design
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 50,000,000+ Free 0 Teen Art & Design
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 100,000+ Free 0 Everyone Art & Design;Creativity

Now, the dataset has 9 columns of information that will help determine the success of an app. We can also see the difference in the column names.

Data Cleaning

Data cleaning is exactly what it sounds like. Now that we have narrowed our data down to the columns we want, and dealt with missing data, we want to make sure that our data within each row is in a form that we can work with. That includes making sure that the data is the correct type. By using google_play.info(), we can expectedly, receive information about our dataframe. This includes the number of entries, the columns, the number of non-null elements, and finally, the data type of each column.

In [5]:
google_play.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             10841 non-null  object 
 1   category        10841 non-null  object 
 2   rating          9367 non-null   float64
 3   reviews         10841 non-null  object 
 4   installs        10841 non-null  object 
 5   type            10840 non-null  object 
 6   price           10841 non-null  object 
 7   content_rating  10840 non-null  object 
 8   genres          10841 non-null  object 
dtypes: float64(1), object(8)
memory usage: 762.4+ KB

The only column that contains numeric values is rating. Aside from that, everything else appears to be objects, such as a string. Columns, such as category, type, and content_rating, are categorical variables, so we would expect them to be strings. However, we want columns, such as, installs, reviews, and price, to be numeric.

You may notice that the Installs column uses values such as "+" and ",". For example, one entry may be 5,000+ Installs. Similarly, the price column uses the $ character. If we attempted to convert the install, and price columns to numeric values right now, they would all become NaN. That means that first, we must remove those characters. An easy way of removing unwanted characters, is by replacing them with an empty string.

In [6]:
# Remove position 10472 due to data malformation
google_play.drop([10472], inplace=True)

# Cleaning the Installs column
def cleanInstall(install):
    install = install.replace('+', '')
    install = install.replace(',', '')
    return install

google_play['installs'] = google_play['installs'].map(lambda x: cleanInstall(x))

# Cleaning the Price column 
def cleanPrice(price):
    return price.replace('$', '')

google_play['price'] = google_play['price'].map(lambda x: cleanPrice(x))

google_play.head()
Out[6]:
app category rating reviews installs type price content_rating genres
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 10000 Free 0 Everyone Art & Design
1 Coloring book moana ART_AND_DESIGN 3.9 967 500000 Free 0 Everyone Art & Design;Pretend Play
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 5000000 Free 0 Everyone Art & Design
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 50000000 Free 0 Teen Art & Design
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 100000 Free 0 Everyone Art & Design;Creativity

Now that we have removed those characters, we can convert the Reviews, Installs, and Price columns to numeric.

In [7]:
# Convert to Numerics
google_play['reviews'] = pd.to_numeric(google_play['reviews'])
google_play['installs'] = pd.to_numeric(google_play['installs'])
google_play['price'] = pd.to_numeric(google_play['price'])

Now these columns are numeric.

Missing Data

Missing data is exactly what it sounds like: any data that we would like to know, but do not have. Missing data is my dataset is represented by "NaN", which means not a number.

In [8]:
missing = google_play[google_play.isnull().any (axis=1)]
missing
Out[8]:
app category rating reviews installs type price content_rating genres
23 Mcqueen Coloring pages ART_AND_DESIGN NaN 61 100000 Free 0.0 Everyone Art & Design;Action & Adventure
113 Wrinkles and rejuvenation BEAUTY NaN 182 100000 Free 0.0 Everyone 10+ Beauty
123 Manicure - nail design BEAUTY NaN 119 50000 Free 0.0 Everyone Beauty
126 Skin Care and Natural Beauty BEAUTY NaN 654 100000 Free 0.0 Teen Beauty
129 Secrets of beauty, youth and health BEAUTY NaN 77 10000 Free 0.0 Mature 17+ Beauty
... ... ... ... ... ... ... ... ... ...
10824 Cardio-FR MEDICAL NaN 67 10000 Free 0.0 Everyone Medical
10825 Naruto & Boruto FR SOCIAL NaN 7 100 Free 0.0 Teen Social
10831 payermonstationnement.fr MAPS_AND_NAVIGATION NaN 38 5000 Free 0.0 Everyone Maps & Navigation
10835 FR Forms BUSINESS NaN 0 10 Free 0.0 Everyone Business
10838 Parkinson Exercices FR MEDICAL NaN 3 1000 Free 0.0 Everyone Medical

1474 rows × 9 columns

This is the total amount of entries that contain missing data. The majority of them is missing the Rating column. There is 1 entry that is missing the Content Rating column (but has other data malformities). Finally, there is 1 entry that is missing both the Rating Column, and the Type Column. For the mean time, I will remove all entries that have missing values.

In [9]:
# Drops values with NaN
google_play.dropna(inplace=True)
# Fixes the index column.
google_play.reset_index(drop=True, inplace=True)
google_play
Out[9]:
app category rating reviews installs type price content_rating genres
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 10000 Free 0.0 Everyone Art & Design
1 Coloring book moana ART_AND_DESIGN 3.9 967 500000 Free 0.0 Everyone Art & Design;Pretend Play
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 5000000 Free 0.0 Everyone Art & Design
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 50000000 Free 0.0 Teen Art & Design
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 100000 Free 0.0 Everyone Art & Design;Creativity
... ... ... ... ... ... ... ... ... ...
9361 FR Calculator FAMILY 4.0 7 500 Free 0.0 Everyone Education
9362 Sya9a Maroc - FR FAMILY 4.5 38 5000 Free 0.0 Everyone Education
9363 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 100 Free 0.0 Everyone Education
9364 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 1000 Free 0.0 Mature 17+ Books & Reference
9365 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 10000000 Free 0.0 Everyone Lifestyle

9366 rows × 9 columns

Exploratory Data Analysis and Data Visualization

Exploratory Data Analysis involves analyzing our data to observe patterns and trends. One nice way of doing this is by visualizing our data using graphs. Recall that we want to analyze which factors impact the rating and number of downloads an app gets.

To begin, we want to see the overall distribution of ratings and installs across the entire Play Store. I choose to graph the distributions using a histogram. With a histogram, we can easily see if the data is normally distributed or is skewed.

In [10]:
plt.hist(google_play['rating'], bins=20, edgecolor='black', linewidth=1.1)
plt.title('Distribution of Ratings in Play Store')
plt.xlabel('Rating')
plt.ylabel('Number of Apps')
plt.show()

The distribution of rating is left-skewed, with most ratings being over 4.0. It would appear that most apps do not get lower ratings.

In [11]:
plt.hist(google_play['installs'], bins=30, edgecolor='black', linewidth=1.1)
plt.title('Distribution of Installs in Play Store')
plt.xlabel('Installs')
plt.ylabel('Number of Apps')
plt.show()

The distribution of installs in the Play Store is highly right skewed. However, there appears to be some outliers, with a very small numbers of apps closer to 1.0e9 installs.

I wonder if rating has any impact on the number of installs? For example, could an app with a higher number of installs have a lower rating? To observe this, I shall create a scatter plot of installs vs rating.

In [12]:
plt.scatter(google_play['installs'], google_play['rating'])
plt.title("Installs vs Rating for Play Store")
plt.xlabel('Installs')
plt.ylabel('Rating')
plt.show()

It appears that the only apps with rating below 3.0 have lower number of installs.

Category's influence Number of Installs and Rating

One interesting thing to explore the performance of games based on category. For example, do apps that involve art have better ratings or have more installs than apps that involve beauty?

To begin, I want to find out how many apps belong to each category on the store. A nice way of visualizing data like this is with a pie chart.

In [13]:
# Create dataframe that has total count for each category
value_counts_category = google_play['category'].value_counts()
df1 = pd.DataFrame(value_counts_category)
df1 = df1.reset_index()
df1.columns = ['category', 'cat_count']

# Create dataframe with percent for each category
value_counts_cat_pct = google_play['category'].value_counts(normalize=True) * 100
df_temp = pd.DataFrame(value_counts_cat_pct)
df_temp = df_temp.reset_index()
df_temp.columns = ['category', 'cat_pct']

# Merge dataframes
df_cat = df1.merge(df_temp, how='inner', on='category')
df_cat.head()
Out[13]:
category cat_count cat_pct
0 FAMILY 1747 18.652573
1 GAME 1097 11.712577
2 TOOLS 734 7.836857
3 PRODUCTIVITY 351 3.747598
4 MEDICAL 350 3.736921

Now that I have my data, I can begin creating my pie chart. However, there is an issue. In a pie chart, each slice has a separate color to make it easy to identify. However, there is a limit on distinct colors. Therefore, my pie chart with 33 slices would have many colors repeating, making it hard to understand.

Luckily, pie charts have a colors argument, where I can pass my own colors into. First, I will create a list of custom colors, then I will pass that list into my pie chart construction.

In [14]:
# Used this Stack Overflow post for help generating random colors: https://stackoverflow.com/questions/56290650/python-how-to-get-a-list-of-n-different-colors-from-a-colormap
colors = []
for i in range(34):
    colors.append('#%06X' % randint(0, 0xFFFFFF))
In [15]:
# Turn dataframe into Pie Chart
fig, pie_cat = plt.subplots(figsize=(8, 10))
pie_cat.pie(df_cat['cat_count'], startangle=90, colors=colors)
plt.title('Categories of Google Play Store')
pie_cat.axis('equal')

# Create label
# Used this stackoverflow post for help with legend: https://stackoverflow.com/questions/44076203/getting-percentages-in-legend-from-pie-matplotlib-pie-chart
labels = ['%s, %1.1f%%' % (l,s) for l, s in zip(df_cat['category'], df_cat['cat_pct'])]
pie_cat.legend(labels, loc='upper left')
plt.subplots_adjust(right=1.75)

Based on this, the majority of apps belong to the Family category, followed by game and tools. Beauty and Events have the least amount of apps on the store. Less apps could mean less competition to an aspiring creator, therefore, it's not necessary a bad thing that those categories are smaller.

Now, I want to see how successful apps of each category is. I will begin by plotting the average rating and the average number of installs based on category.

In [16]:
cat_rating_mean = google_play.groupby('category', as_index=False)['rating'].mean()
cat_rating_mean.columns = ['category', 'rating_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(3, 10))
bar_cat.barh(cat_rating_mean['category'], cat_rating_mean['rating_mean'], align='center', height=0.6)
plt.title('Average Rating based on Category')
plt.show()

Recall, that the overall distribution of ratings was right skewed, with most rating being around 4. Here, we can see that most categories have an average rating of around 4. Therefore, it seems that the categories are performing around the same rating wise.

In [17]:
cat_install_mean = google_play.groupby('category', as_index=False)['installs'].mean()
cat_install_mean.columns = ['category', 'install_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(3, 10))
bar_cat.barh(cat_install_mean['category'], cat_install_mean['install_mean'], align='center', height=0.6)
plt.title('Average number of Installs based on Category')
plt.show()

This chart is rather different than the previous. Despite the average rating across category being really similar, the average number of installs for categories varies greatly. For example, we saw before that communication apps only make up 3.5%, but they have by far the highest average number of installs. Apps that involve beauty, auto and vehicles, medicine, and parenting seem to have very low number of installs.

Price's influence on Number of Installs and Rating

Another aspect that could have an impact on installs and rating is price. Is it the case that apps with a price tag turn away users, leading to less installs? Do apps with a price tag have higher ratings?

To begin, we will get the proportion of apps that are free vs paid.

In [18]:
# Create dataframe that has total count for each category
value_counts_category = google_play['type'].value_counts()
df1 = pd.DataFrame(value_counts_category)
df1 = df1.reset_index()
df1.columns = ['category', 'type']

# Create dataframe with percent for each category
value_counts_cat_pct = google_play['type'].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_cat_pct)
df2 = df2.reset_index()
df2.columns = ['category', 'type_pct']

# Merge dataframes
df_price = df1.merge(df2, how='inner', on='category')
df_price
Out[18]:
category type type_pct
0 Free 8719 93.092035
1 Paid 647 6.907965

A vast majority, 93%, of the apps are free. However, does that mean that free apps perform better than paid apps? I will perform a similar analysis as I did with category. I decided against turning this into a pie chart because looking at the table alone is enough to understand how the Play Store is distributed payment wise.

In [19]:
price_rating_mean = google_play.groupby('type', as_index=False)['rating'].mean()
price_rating_mean.columns = ['type', 'rating_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(6, 5))
bar_cat.bar(price_rating_mean['type'], price_rating_mean['rating_mean'], align='center', width=0.5)
plt.title('Average Rating based on Free or Paid')
plt.show()

Interestingly, paid apps have a slightly higher mean rating. However, it is not by much. Now, let's look at installs.

In [20]:
price_install_mean = google_play.groupby('type', as_index=False)['installs'].mean()
price_install_mean.columns = ['type', 'install_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(6, 5))
bar_cat.bar(price_install_mean['type'], price_install_mean['install_mean'], align='center', width=0.5)
plt.title('Average Number of Installs based on Free or Paid')
plt.show()

Free apps appear to have massively more average installs than paid apps. However, there are much more free apps on the Play Store than Paid games to begin with.

So far, I have analyzed price without taking category into account. However, it is possible that paid games perform better or worse depending on what category they belong to.

In [21]:
# Create dataframe with counts of category and type
value_counts = google_play[['category', 'type']].value_counts()
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['category', 'type', 'type_count']

# Create dataframe with percentage of category and type
value_counts_pct = google_play[['category','type']].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_pct)
df2 = df2.reset_index()
df2.columns = ['category', 'type', 'type_pct']

df_cat_price = df1.merge(df2, how='inner', on=['category', 'type'])
df_cat = df_cat.merge(df_cat_price, how='inner', on='category')
df_cat
Out[21]:
category cat_count cat_pct type type_count type_pct
0 FAMILY 1747 18.652573 Free 1585 16.922913
1 FAMILY 1747 18.652573 Paid 162 1.729660
2 GAME 1097 11.712577 Free 1020 10.890455
3 GAME 1097 11.712577 Paid 77 0.822123
4 TOOLS 734 7.836857 Free 671 7.164211
... ... ... ... ... ... ...
56 COMICS 58 0.619261 Free 58 0.619261
57 PARENTING 50 0.533846 Free 48 0.512492
58 PARENTING 50 0.533846 Paid 2 0.021354
59 EVENTS 45 0.480461 Free 45 0.480461
60 BEAUTY 42 0.448430 Free 42 0.448430

61 rows × 6 columns

In [22]:
# Get average rating based on category and type
cat_price_rating_mean = google_play.groupby(['category', 'type'], as_index=False)['rating'].mean()
cat_price_rating_mean.columns = ['category', 'type', 'rating_mean']
df_cat = df_cat.merge(cat_price_rating_mean, how='inner', on=['category', 'type'])

# Get average installs based on category and type
cat_price_install_mean = google_play.groupby(['category', 'type'], as_index=False)['installs'].mean()
cat_price_install_mean.columns = ['category', 'type', 'install_mean']
df_cat = df_cat.merge(cat_price_install_mean, how='inner', on=['category', 'type'])

df_cat
Out[22]:
category cat_count cat_pct type type_count type_pct rating_mean install_mean
0 FAMILY 1747 18.652573 Free 1585 16.922913 4.181767 6.452008e+06
1 FAMILY 1747 18.652573 Paid 162 1.729660 4.295062 1.930175e+05
2 GAME 1097 11.712577 Free 1020 10.890455 4.279804 3.437722e+07
3 GAME 1097 11.712577 Paid 77 0.822123 4.372727 2.740164e+05
4 TOOLS 734 7.836857 Free 671 7.164211 4.035917 1.706259e+07
... ... ... ... ... ... ... ... ...
56 COMICS 58 0.619261 Free 58 0.619261 4.155172 9.661397e+05
57 PARENTING 50 0.533846 Free 48 0.512492 4.339583 6.472085e+05
58 PARENTING 50 0.533846 Paid 2 0.021354 3.350000 2.505000e+04
59 EVENTS 45 0.480461 Free 45 0.480461 4.435556 3.544313e+05
60 BEAUTY 42 0.448430 Free 42 0.448430 4.278571 6.408619e+05

61 rows × 8 columns

This table shows the proportion of free vs paid apps based on category. Across each category, there appears to be much more free apps than paid apps.

In [23]:
df_free = df_cat.loc[df_cat['type'] == 'Free']
df_paid = df_cat.loc[df_cat['type'] == 'Paid']
# Want to 
fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_free['category'], df_free['rating_mean'])
ax.barh(df_paid['category'], df_paid['rating_mean'], height=0.5)
plt.title('Average Rating based on Category and whether Free or Paid')
plt.show()

The average rating for free apps of a particular category is represented by the color blue. Orange represents paid apps. The first thing we notice is that some categories (e.g. Beauty, Events) do not have an average rating for paid apps. This means that those categories do not contain paid apps. Aside from that, the average ratings appear to be very similar. The news and magazine category seems to have a higher average rating for paid apps. While the social, and parenting categories, have higher average rating for free apps.

In [24]:
fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_free['category'], df_free['install_mean'])
ax.barh(df_paid['category'], df_paid['install_mean'], height=0.5)
plt.title('Average Number of Installs based on Category and whether Free or Paid')
plt.show()

This graph seems a bit strange, as we cannot see the average number of installs for paid apps of any category. So, let's try plotting just paid apps.

In [25]:
fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_paid['category'], df_paid['install_mean'], height=0.5, color='orange')
plt.title('Average Number of Installs for Paid Apps based on Category')
plt.show()

Now it makes sense. Notice the x axis on this plot compared to the last plot. It's much smaller! Because there are so much more free apps being installed, the paid apps could not be seen on the graph.

Content Rating's influence on Number of Installs and Rating

Finally, I want to analyze how content rating impacts the average number of installs, and the average rating.

I will begin by checking how the Play Store is distributed content rating wise. For example, people of all ages use the store. Does that mean that most apps are for Everyone?

In [26]:
# Create dataframe that has total count for each content rating
value_counts_content = google_play['content_rating'].value_counts()
df1 = pd.DataFrame(value_counts_content)
df1 = df1.reset_index()
df1.columns = ['content_rating', 'content_count']

# Create dataframe with percent for each content rating
value_counts_content_pct = google_play['content_rating'].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_content_pct)
df2 = df2.reset_index()
df2.columns = ['content_rating', 'content_pct']

# Merge dataframes
df_content_rating = df1.merge(df2, how='inner', on='content_rating')
df_content_rating
Out[26]:
content_rating content_count content_pct
0 Everyone 7420 79.222720
1 Teen 1084 11.573777
2 Mature 17+ 461 4.922059
3 Everyone 10+ 397 4.238736
4 Adults only 18+ 3 0.032031
5 Unrated 1 0.010677
In [27]:
# Create Pie Chart
fig, pie_content = plt.subplots(figsize=(8, 10))
pie_content.pie(df_content_rating['content_count'], startangle=90, 
                labels=df_content_rating['content_rating'], autopct='%1.1f%%', rotatelabels=True)
plt.title('Content Ratings of Google Play Store', pad=60)
pie_cat.axis('equal')
plt.show()

The majority of apps are rated Everyone, with close to 80%. Then it's Teen, Mature, Everyone 10+, Adults Only, and finally Unrated. Though there is only 1 app that is Unrated. It appears that apps overall want the biggest audience possible. Though, that would imply that Everyone 10+ should have more entries than Teen and Mature, but it doesn't. Now, let's check how content rating impacts average rating and the average number of installs.

In [28]:
rating_mean = google_play.groupby('content_rating', as_index=False)['rating'].mean()
rating_mean.columns = ['content_rating', 'rating_mean']
# Create bar plot
fig, ax = plt.subplots(figsize=(6, 5))
ax.bar(rating_mean['content_rating'], rating_mean['rating_mean'], align='center', width=0.5)
plt.title('Average Rating based on Content Rating')
plt.show()

Like category and price before it, content rating has a similar average rating across varieties.

In [29]:
install_mean = google_play.groupby('content_rating', as_index=False)['installs'].mean()
install_mean.columns = ['content_rating', 'install_mean']
# Create bar plot
fig, ax = plt.subplots(figsize=(7, 5))
ax.bar(install_mean['content_rating'], install_mean['install_mean'], align='center', width=0.5)
plt.title('Average Number of Installs based on Content Rating')
plt.show()

Now, this is rather surprising! I expected Everyone to have the highest average installs, because the majority of apps on the Play Store is rated Everyone. However, it is 3rd, behind Teen and Everyone 10+. Recall that Unrated only had 1 app belonging to it, so it makes sense that it has the lowest average number of installs.

Analysis, Hypothesis Testing, and Machine Learning

At the beginning I mentioned that I would perform a regression. Now we have reached that point. Regressions are modelling approaches that allow us to get a predictive model of our data. A predictive model would allow us to determine how an app would perform before it gets put on the Play Store.

Before I can perform a regression, I need to make sure that all my categorical variables have numeric representation. The variables I'm focusing on are the Category and Content Rating columns.

In order to convert the categorical variables into numeric variables, there are multiple options. I used this [tutorial] (https://pbpython.com/categorical-encoding.html) for help when exploring my options.

The first is is One Hot Encoding. It would turn each categorical data into it's own column. Then 0 or 1 is assigned to each row. For example, art and design, would become a column. A row would have 1 assigned to one of the newly created columns, and 0 assigned everywhere else. This is because a row could have only belonged to one category to begin with.

In [30]:
# Example showing One Hot Encoding
dummies_category = pd.get_dummies(google_play['category'])
dummies_category
Out[30]:
ART_AND_DESIGN AUTO_AND_VEHICLES BEAUTY BOOKS_AND_REFERENCE BUSINESS COMICS COMMUNICATION DATING EDUCATION ENTERTAINMENT ... PERSONALIZATION PHOTOGRAPHY PRODUCTIVITY SHOPPING SOCIAL SPORTS TOOLS TRAVEL_AND_LOCAL VIDEO_PLAYERS WEATHER
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9361 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9362 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9363 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9364 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9365 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

9366 rows × 33 columns

One problem with One-Hot Encoding is the number of columns that would be added to the table. Creating dummies of the category variable would create 33 new columns alone. Then, I would need to do the same for Content Rating.

Instead, I will use Label Encoding. This simply assigns a unique integer in alphabetical order. For example, the numeric representation of Art and Design would be 0, while Beauty would be 3.

In [31]:
# Create Label Encoder
label_encoder = LabelEncoder()

# Assign numerical value to Category and Content Rating in new column
google_play['category_num'] = label_encoder.fit_transform(google_play['category'])
google_play['content_rating_num'] = label_encoder.fit_transform(google_play['content_rating'])

google_play.head()
Out[31]:
app category rating reviews installs type price content_rating genres category_num content_rating_num
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 10000 Free 0.0 Everyone Art & Design 0 1
1 Coloring book moana ART_AND_DESIGN 3.9 967 500000 Free 0.0 Everyone Art & Design;Pretend Play 0 1
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 5000000 Free 0.0 Everyone Art & Design 0 1
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 50000000 Free 0.0 Teen Art & Design 0 4
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 100000 Free 0.0 Everyone Art & Design;Creativity 0 1

We expect category_num column to have 33 unique values, and for the content_num column to have 6 unique values. I will check to make sure that the encoding is correct.

In [32]:
# Checking number of unique numerical values for "category"
tmp = google_play['category_num'].unique()
tmp.sort()
tmp
Out[32]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32])
In [33]:
# Checking number of unique numerical values for "content rating"
tmp = google_play['content_rating_num'].unique()
tmp.sort()
tmp
Out[33]:
array([0, 1, 2, 3, 4, 5])

It would appear that category and content rating were transformed into numeric variables correctly. Now, I can begin to perform regression.

Predictive Model for Rating

To begin, we must perform feature selection. Feature Selection is reducing the number of input variables when developing a predictive model. For my model, I decided to select reviews, installs, price, category, and content rating as my features. These features are represented by the X variable. The y variable represents the variable I want to predict. In this case, it is rating.

After selecting features, I must split my data into training and testing sets. I use python's train test split to do so. I decide to make my test set slightly smaller, at 33%, and choose the random state to be 42, which is standard.

In [34]:
# Select features 
# More information about feature selection: https://machinelearningmastery.com/feature-selection-for-regression-data/
features = ['reviews', 'installs', 'price', 'category_num', 'content_rating_num']
X = google_play[features]
y = google_play['rating']

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print('Trainig', X_train.shape, y_train.shape)
print('Testing', X_test.shape, y_test.shape)
Trainig (6275, 5) (6275,)
Testing (3091, 5) (3091,)

Training on 6275 entries, and testing on 3091 entries.

I've decided to fit my data to a Linear Regression, and a Random Forest Regression to determine which one works better. To do so, I shall calculate the mean squared error for this model.

In [35]:
# Select model and fit it
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)
# Predict on test data
y_predict_linear = model_linear.predict(X_test)
print(mean_squared_error(y_test, y_predict_linear))
0.24779744340095827
In [36]:
model_forest = RandomForestRegressor()
model_forest.fit(X_train, y_train)
y_predict_forest = model_forest.predict(X_test)
print(mean_squared_error(y_test, y_predict_forest))
0.24378341367135342

Interestingly, the mean squared error is very similar for both models. However, I suppose the Random Forest model is slightly lower.

Now if we put in features of an app, both models should be able to predict what rating the app would receive.

Predictive Model for Number of Installs

I will now perform a very similar analysis for the number of installs.

In [37]:
# Select features 
features = ['rating', 'reviews', 'price', 'category_num', 'content_rating_num']
X = google_play[features]
y = google_play['installs']

# Split into training and testing data
X_train_install, X_test_install, y_train_install, y_test_install = train_test_split(X, y, test_size=0.33, random_state=42)
In [38]:
model_lin_install = LinearRegression()
model_lin_install.fit(X_train_install, y_train_install)
y_predict_lin_install = model_lin_install.predict(X_test_install)
print(mean_squared_error(y_test, y_predict_lin_install))
2958139489796131.0
In [39]:
model_for_install = RandomForestRegressor()
model_for_install.fit(X_train_install, y_train_install)
y_predict_for_install = model_for_install.predict(X_test_install)
print(mean_squared_error(y_test, y_predict_for_install))
7515466591465622.0

Both predictive models for Ratings had very low mean squared error. However, for installs the mean squared error is extremely high for both models. This suggests that both models may have issues predicting the number of installs an app will receive.

Insight and Policy Decision

I decided to explore the overall landscape of the Play Store. For example, I discovered that the majority of apps belong to the Family category, and that the vast majority of apps are free. Throughout my paper, I attempted to analyze what factors influence an app's success on the Play Store, primarily looking at rating, and install numbers.

There are numerous improvements that could be made, as well as further questions to be answered. As we can see above, my regression model for number of installs can be fine tuned to have a lower mean squared error. There are also other questions that I did not explore here. I did not dive into the genres of apps. Perhaps there are certain genres that are more successful than others. Also, we could narrow our search into the most popular apps, and analyze what factors make them popular. Finally, we could refine the charts I created, and perhaps attempt to explore more information with them. This is a massive dataset, and thus there is a variety of things left to explore.