Analysis of the Google Play Store¶

By: Tara Wade

Introduction ¶

The Google Play Store is a digital distribution service that allows users to download apps that were developed using the Android software development kit. It offers a wide variety of applications, such as, games, books, and maps. There are applications that require payment, while others are free. According to Business of Apps, the Play Store has 2.56 million apps available. Distribution of an app through the store has the potential of being a great success. For example, according to Statista, in 2020, Candy Crush Saga has made a revenue of 473 million U.S dollars across the Play Store and the App Store.

However, it seems that certain applications catch the public’s eye more than others. Therefore, someone who wants to develop applications for the Google play store, would be curious about what qualities led to success. Therefore, I will create a regression model that allows for the prediction of the numbers of installs an app would receive, and a prediction for the rating an app would get based on a variety of categories.

Data Collection ¶

To begin, we must collect data from the Google Play Store. However, a Google Play Store dataset already existed on Kaggle.com that I am able to use.

The dataset has 13 columns:

App - The name of the application
Category - The main category app belongs to (e.g. Game)
Rating - The rating the userbase assigned to the app
Reviews - The number of reviews the app has
Size - Size of the app (MB)
Installs - The number of users that downloaded the app
Type - Whether the app is free or paid
Content Rating - Age group the app is targeting
Genres - Genres app belongs to beside main category
Last Updated - Date app last updated
Current Ver - Current Version of app available on Play Store
Android Ver - Minimum required Android version

# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from random import randint

# Read in the dataset from the csv file
google_play = pd.read_csv('googleplaystore.csv')
google_play.head()

The above table shows a section of the dataset. By calling shape, we are able to see the total number of entries, or rows, in the dataset, as well as the columns. Format: (row, column)

google_play.shape

(10841, 13)

This means that at the time the dataset was created, there was 10,841 apps on the Play Store.

Data Processing ¶

Now that we have our data, we must tidy it. Tidying data makes it so the data can be analyzed easier.

Remember that my aim is to analyze what factors makes an app successful. I measure success by 2 measures: (1) the app's rating and (2) the number of installs an app has. This means that some columns are not required to gauge the success of our app. For example, we do not need to know the Current Version of the app.

To do this, I will begin by removing columns that are not necessary for my analysis. Dropping columns is pretty easy in python. I can call the drop () function with a list of the column names that I want dropped.

After dropping the unneeded columns, I standardize the names of my columns. By standardize, I mean making sure each column name starts with a lowercase value. If the column name has multiple words, I connect the words with the upper score "_" key.

# May need to return to this
google_play.drop(['Size', 'Last Updated', 'Current Ver', 'Android Ver'], axis=1, inplace=True)

# Standardize column names
google_play.columns = ['app', 'category', 'rating', 'reviews', 'installs', 'type', 'price', 'content_rating', 'genres']

google_play.head()

Now, the dataset has 9 columns of information that will help determine the success of an app. We can also see the difference in the column names.

Data Cleaning

Data cleaning is exactly what it sounds like. Now that we have narrowed our data down to the columns we want, and dealt with missing data, we want to make sure that our data within each row is in a form that we can work with. That includes making sure that the data is the correct type. By using google_play.info(), we can expectedly, receive information about our dataframe. This includes the number of entries, the columns, the number of non-null elements, and finally, the data type of each column.

google_play.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app             10841 non-null  object 
 1   category        10841 non-null  object 
 2   rating          9367 non-null   float64
 3   reviews         10841 non-null  object 
 4   installs        10841 non-null  object 
 5   type            10840 non-null  object 
 6   price           10841 non-null  object 
 7   content_rating  10840 non-null  object 
 8   genres          10841 non-null  object 
dtypes: float64(1), object(8)
memory usage: 762.4+ KB

The only column that contains numeric values is rating. Aside from that, everything else appears to be objects, such as a string. Columns, such as category, type, and content_rating, are categorical variables, so we would expect them to be strings. However, we want columns, such as, installs, reviews, and price, to be numeric.

You may notice that the Installs column uses values such as "+" and ",". For example, one entry may be 5,000+ Installs. Similarly, the price column uses the $ character. If we attempted to convert the install, and price columns to numeric values right now, they would all become NaN. That means that first, we must remove those characters. An easy way of removing unwanted characters, is by replacing them with an empty string.

# Remove position 10472 due to data malformation
google_play.drop([10472], inplace=True)

# Cleaning the Installs column
def cleanInstall(install):
    install = install.replace('+', '')
    install = install.replace(',', '')
    return install

google_play['installs'] = google_play['installs'].map(lambda x: cleanInstall(x))

# Cleaning the Price column 
def cleanPrice(price):
    return price.replace('$', '')

google_play['price'] = google_play['price'].map(lambda x: cleanPrice(x))

google_play.head()

Now that we have removed those characters, we can convert the Reviews, Installs, and Price columns to numeric.

# Convert to Numerics
google_play['reviews'] = pd.to_numeric(google_play['reviews'])
google_play['installs'] = pd.to_numeric(google_play['installs'])
google_play['price'] = pd.to_numeric(google_play['price'])

Now these columns are numeric.

Missing Data¶

Missing data is exactly what it sounds like: any data that we would like to know, but do not have. Missing data is my dataset is represented by "NaN", which means not a number.

missing = google_play[google_play.isnull().any (axis=1)]
missing

This is the total amount of entries that contain missing data. The majority of them is missing the Rating column. There is 1 entry that is missing the Content Rating column (but has other data malformities). Finally, there is 1 entry that is missing both the Rating Column, and the Type Column. For the mean time, I will remove all entries that have missing values.

# Drops values with NaN
google_play.dropna(inplace=True)
# Fixes the index column.
google_play.reset_index(drop=True, inplace=True)
google_play

Exploratory Data Analysis and Data Visualization ¶

Exploratory Data Analysis involves analyzing our data to observe patterns and trends. One nice way of doing this is by visualizing our data using graphs. Recall that we want to analyze which factors impact the rating and number of downloads an app gets.

To begin, we want to see the overall distribution of ratings and installs across the entire Play Store. I choose to graph the distributions using a histogram. With a histogram, we can easily see if the data is normally distributed or is skewed.

plt.hist(google_play['rating'], bins=20, edgecolor='black', linewidth=1.1)
plt.title('Distribution of Ratings in Play Store')
plt.xlabel('Rating')
plt.ylabel('Number of Apps')
plt.show()

The distribution of rating is left-skewed, with most ratings being over 4.0. It would appear that most apps do not get lower ratings.

plt.hist(google_play['installs'], bins=30, edgecolor='black', linewidth=1.1)
plt.title('Distribution of Installs in Play Store')
plt.xlabel('Installs')
plt.ylabel('Number of Apps')
plt.show()

The distribution of installs in the Play Store is highly right skewed. However, there appears to be some outliers, with a very small numbers of apps closer to 1.0e9 installs.

I wonder if rating has any impact on the number of installs? For example, could an app with a higher number of installs have a lower rating? To observe this, I shall create a scatter plot of installs vs rating.

plt.scatter(google_play['installs'], google_play['rating'])
plt.title("Installs vs Rating for Play Store")
plt.xlabel('Installs')
plt.ylabel('Rating')
plt.show()

It appears that the only apps with rating below 3.0 have lower number of installs.

Category's influence Number of Installs and Rating¶

One interesting thing to explore the performance of games based on category. For example, do apps that involve art have better ratings or have more installs than apps that involve beauty?

To begin, I want to find out how many apps belong to each category on the store. A nice way of visualizing data like this is with a pie chart.

# Create dataframe that has total count for each category
value_counts_category = google_play['category'].value_counts()
df1 = pd.DataFrame(value_counts_category)
df1 = df1.reset_index()
df1.columns = ['category', 'cat_count']

# Create dataframe with percent for each category
value_counts_cat_pct = google_play['category'].value_counts(normalize=True) * 100
df_temp = pd.DataFrame(value_counts_cat_pct)
df_temp = df_temp.reset_index()
df_temp.columns = ['category', 'cat_pct']

# Merge dataframes
df_cat = df1.merge(df_temp, how='inner', on='category')
df_cat.head()

Now that I have my data, I can begin creating my pie chart. However, there is an issue. In a pie chart, each slice has a separate color to make it easy to identify. However, there is a limit on distinct colors. Therefore, my pie chart with 33 slices would have many colors repeating, making it hard to understand.

Luckily, pie charts have a colors argument, where I can pass my own colors into. First, I will create a list of custom colors, then I will pass that list into my pie chart construction.

# Used this Stack Overflow post for help generating random colors: https://stackoverflow.com/questions/56290650/python-how-to-get-a-list-of-n-different-colors-from-a-colormap
colors = []
for i in range(34):
    colors.append('#%06X' % randint(0, 0xFFFFFF))

# Turn dataframe into Pie Chart
fig, pie_cat = plt.subplots(figsize=(8, 10))
pie_cat.pie(df_cat['cat_count'], startangle=90, colors=colors)
plt.title('Categories of Google Play Store')
pie_cat.axis('equal')

# Create label
# Used this stackoverflow post for help with legend: https://stackoverflow.com/questions/44076203/getting-percentages-in-legend-from-pie-matplotlib-pie-chart
labels = ['%s, %1.1f%%' % (l,s) for l, s in zip(df_cat['category'], df_cat['cat_pct'])]
pie_cat.legend(labels, loc='upper left')
plt.subplots_adjust(right=1.75)

Based on this, the majority of apps belong to the Family category, followed by game and tools. Beauty and Events have the least amount of apps on the store. Less apps could mean less competition to an aspiring creator, therefore, it's not necessary a bad thing that those categories are smaller.

Now, I want to see how successful apps of each category is. I will begin by plotting the average rating and the average number of installs based on category.

cat_rating_mean = google_play.groupby('category', as_index=False)['rating'].mean()
cat_rating_mean.columns = ['category', 'rating_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(3, 10))
bar_cat.barh(cat_rating_mean['category'], cat_rating_mean['rating_mean'], align='center', height=0.6)
plt.title('Average Rating based on Category')
plt.show()

Recall, that the overall distribution of ratings was right skewed, with most rating being around 4. Here, we can see that most categories have an average rating of around 4. Therefore, it seems that the categories are performing around the same rating wise.

cat_install_mean = google_play.groupby('category', as_index=False)['installs'].mean()
cat_install_mean.columns = ['category', 'install_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(3, 10))
bar_cat.barh(cat_install_mean['category'], cat_install_mean['install_mean'], align='center', height=0.6)
plt.title('Average number of Installs based on Category')
plt.show()

This chart is rather different than the previous. Despite the average rating across category being really similar, the average number of installs for categories varies greatly. For example, we saw before that communication apps only make up 3.5%, but they have by far the highest average number of installs. Apps that involve beauty, auto and vehicles, medicine, and parenting seem to have very low number of installs.

Price's influence on Number of Installs and Rating¶

Another aspect that could have an impact on installs and rating is price. Is it the case that apps with a price tag turn away users, leading to less installs? Do apps with a price tag have higher ratings?

To begin, we will get the proportion of apps that are free vs paid.

# Create dataframe that has total count for each category
value_counts_category = google_play['type'].value_counts()
df1 = pd.DataFrame(value_counts_category)
df1 = df1.reset_index()
df1.columns = ['category', 'type']

# Create dataframe with percent for each category
value_counts_cat_pct = google_play['type'].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_cat_pct)
df2 = df2.reset_index()
df2.columns = ['category', 'type_pct']

# Merge dataframes
df_price = df1.merge(df2, how='inner', on='category')
df_price

A vast majority, 93%, of the apps are free. However, does that mean that free apps perform better than paid apps? I will perform a similar analysis as I did with category. I decided against turning this into a pie chart because looking at the table alone is enough to understand how the Play Store is distributed payment wise.

price_rating_mean = google_play.groupby('type', as_index=False)['rating'].mean()
price_rating_mean.columns = ['type', 'rating_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(6, 5))
bar_cat.bar(price_rating_mean['type'], price_rating_mean['rating_mean'], align='center', width=0.5)
plt.title('Average Rating based on Free or Paid')
plt.show()

Interestingly, paid apps have a slightly higher mean rating. However, it is not by much. Now, let's look at installs.

price_install_mean = google_play.groupby('type', as_index=False)['installs'].mean()
price_install_mean.columns = ['type', 'install_mean']
# Create bar plot
fig, bar_cat = plt.subplots(figsize=(6, 5))
bar_cat.bar(price_install_mean['type'], price_install_mean['install_mean'], align='center', width=0.5)
plt.title('Average Number of Installs based on Free or Paid')
plt.show()

Free apps appear to have massively more average installs than paid apps. However, there are much more free apps on the Play Store than Paid games to begin with.

So far, I have analyzed price without taking category into account. However, it is possible that paid games perform better or worse depending on what category they belong to.

# Create dataframe with counts of category and type
value_counts = google_play[['category', 'type']].value_counts()
df1 = pd.DataFrame(value_counts)
df1 = df1.reset_index()
df1.columns = ['category', 'type', 'type_count']

# Create dataframe with percentage of category and type
value_counts_pct = google_play[['category','type']].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_pct)
df2 = df2.reset_index()
df2.columns = ['category', 'type', 'type_pct']

df_cat_price = df1.merge(df2, how='inner', on=['category', 'type'])
df_cat = df_cat.merge(df_cat_price, how='inner', on='category')
df_cat

# Get average rating based on category and type
cat_price_rating_mean = google_play.groupby(['category', 'type'], as_index=False)['rating'].mean()
cat_price_rating_mean.columns = ['category', 'type', 'rating_mean']
df_cat = df_cat.merge(cat_price_rating_mean, how='inner', on=['category', 'type'])

# Get average installs based on category and type
cat_price_install_mean = google_play.groupby(['category', 'type'], as_index=False)['installs'].mean()
cat_price_install_mean.columns = ['category', 'type', 'install_mean']
df_cat = df_cat.merge(cat_price_install_mean, how='inner', on=['category', 'type'])

df_cat

This table shows the proportion of free vs paid apps based on category. Across each category, there appears to be much more free apps than paid apps.

df_free = df_cat.loc[df_cat['type'] == 'Free']
df_paid = df_cat.loc[df_cat['type'] == 'Paid']
# Want to 
fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_free['category'], df_free['rating_mean'])
ax.barh(df_paid['category'], df_paid['rating_mean'], height=0.5)
plt.title('Average Rating based on Category and whether Free or Paid')
plt.show()

The average rating for free apps of a particular category is represented by the color blue. Orange represents paid apps. The first thing we notice is that some categories (e.g. Beauty, Events) do not have an average rating for paid apps. This means that those categories do not contain paid apps. Aside from that, the average ratings appear to be very similar. The news and magazine category seems to have a higher average rating for paid apps. While the social, and parenting categories, have higher average rating for free apps.

fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_free['category'], df_free['install_mean'])
ax.barh(df_paid['category'], df_paid['install_mean'], height=0.5)
plt.title('Average Number of Installs based on Category and whether Free or Paid')
plt.show()

This graph seems a bit strange, as we cannot see the average number of installs for paid apps of any category. So, let's try plotting just paid apps.

fig, ax = plt.subplots(1, 1, figsize=(3, 10))
ax.barh(df_paid['category'], df_paid['install_mean'], height=0.5, color='orange')
plt.title('Average Number of Installs for Paid Apps based on Category')
plt.show()

Now it makes sense. Notice the x axis on this plot compared to the last plot. It's much smaller! Because there are so much more free apps being installed, the paid apps could not be seen on the graph.

Content Rating's influence on Number of Installs and Rating¶

Finally, I want to analyze how content rating impacts the average number of installs, and the average rating.

I will begin by checking how the Play Store is distributed content rating wise. For example, people of all ages use the store. Does that mean that most apps are for Everyone?

# Create dataframe that has total count for each content rating
value_counts_content = google_play['content_rating'].value_counts()
df1 = pd.DataFrame(value_counts_content)
df1 = df1.reset_index()
df1.columns = ['content_rating', 'content_count']

# Create dataframe with percent for each content rating
value_counts_content_pct = google_play['content_rating'].value_counts(normalize=True) * 100
df2 = pd.DataFrame(value_counts_content_pct)
df2 = df2.reset_index()
df2.columns = ['content_rating', 'content_pct']

# Merge dataframes
df_content_rating = df1.merge(df2, how='inner', on='content_rating')
df_content_rating

# Create Pie Chart
fig, pie_content = plt.subplots(figsize=(8, 10))
pie_content.pie(df_content_rating['content_count'], startangle=90, 
                labels=df_content_rating['content_rating'], autopct='%1.1f%%', rotatelabels=True)
plt.title('Content Ratings of Google Play Store', pad=60)
pie_cat.axis('equal')
plt.show()

The majority of apps are rated Everyone, with close to 80%. Then it's Teen, Mature, Everyone 10+, Adults Only, and finally Unrated. Though there is only 1 app that is Unrated. It appears that apps overall want the biggest audience possible. Though, that would imply that Everyone 10+ should have more entries than Teen and Mature, but it doesn't. Now, let's check how content rating impacts average rating and the average number of installs.

rating_mean = google_play.groupby('content_rating', as_index=False)['rating'].mean()
rating_mean.columns = ['content_rating', 'rating_mean']
# Create bar plot
fig, ax = plt.subplots(figsize=(6, 5))
ax.bar(rating_mean['content_rating'], rating_mean['rating_mean'], align='center', width=0.5)
plt.title('Average Rating based on Content Rating')
plt.show()

Like category and price before it, content rating has a similar average rating across varieties.

install_mean = google_play.groupby('content_rating', as_index=False)['installs'].mean()
install_mean.columns = ['content_rating', 'install_mean']
# Create bar plot
fig, ax = plt.subplots(figsize=(7, 5))
ax.bar(install_mean['content_rating'], install_mean['install_mean'], align='center', width=0.5)
plt.title('Average Number of Installs based on Content Rating')
plt.show()

Now, this is rather surprising! I expected Everyone to have the highest average installs, because the majority of apps on the Play Store is rated Everyone. However, it is 3rd, behind Teen and Everyone 10+. Recall that Unrated only had 1 app belonging to it, so it makes sense that it has the lowest average number of installs.

Analysis, Hypothesis Testing, and Machine Learning ¶

At the beginning I mentioned that I would perform a regression. Now we have reached that point. Regressions are modelling approaches that allow us to get a predictive model of our data. A predictive model would allow us to determine how an app would perform before it gets put on the Play Store.

Before I can perform a regression, I need to make sure that all my categorical variables have numeric representation. The variables I'm focusing on are the Category and Content Rating columns.

In order to convert the categorical variables into numeric variables, there are multiple options. I used this [tutorial] (https://pbpython.com/categorical-encoding.html) for help when exploring my options.

The first is is One Hot Encoding. It would turn each categorical data into it's own column. Then 0 or 1 is assigned to each row. For example, art and design, would become a column. A row would have 1 assigned to one of the newly created columns, and 0 assigned everywhere else. This is because a row could have only belonged to one category to begin with.

# Example showing One Hot Encoding
dummies_category = pd.get_dummies(google_play['category'])
dummies_category

One problem with One-Hot Encoding is the number of columns that would be added to the table. Creating dummies of the category variable would create 33 new columns alone. Then, I would need to do the same for Content Rating.

Instead, I will use Label Encoding. This simply assigns a unique integer in alphabetical order. For example, the numeric representation of Art and Design would be 0, while Beauty would be 3.

# Create Label Encoder
label_encoder = LabelEncoder()

# Assign numerical value to Category and Content Rating in new column
google_play['category_num'] = label_encoder.fit_transform(google_play['category'])
google_play['content_rating_num'] = label_encoder.fit_transform(google_play['content_rating'])

google_play.head()

We expect category_num column to have 33 unique values, and for the content_num column to have 6 unique values. I will check to make sure that the encoding is correct.

# Checking number of unique numerical values for "category"
tmp = google_play['category_num'].unique()
tmp.sort()
tmp

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32])

# Checking number of unique numerical values for "content rating"
tmp = google_play['content_rating_num'].unique()
tmp.sort()
tmp

array([0, 1, 2, 3, 4, 5])

It would appear that category and content rating were transformed into numeric variables correctly. Now, I can begin to perform regression.

Predictive Model for Rating¶

To begin, we must perform feature selection. Feature Selection is reducing the number of input variables when developing a predictive model. For my model, I decided to select reviews, installs, price, category, and content rating as my features. These features are represented by the X variable. The y variable represents the variable I want to predict. In this case, it is rating.

After selecting features, I must split my data into training and testing sets. I use python's train test split to do so. I decide to make my test set slightly smaller, at 33%, and choose the random state to be 42, which is standard.

# Select features 
# More information about feature selection: https://machinelearningmastery.com/feature-selection-for-regression-data/
features = ['reviews', 'installs', 'price', 'category_num', 'content_rating_num']
X = google_play[features]
y = google_play['rating']

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print('Trainig', X_train.shape, y_train.shape)
print('Testing', X_test.shape, y_test.shape)

Trainig (6275, 5) (6275,)
Testing (3091, 5) (3091,)

Training on 6275 entries, and testing on 3091 entries.

I've decided to fit my data to a Linear Regression, and a Random Forest Regression to determine which one works better. To do so, I shall calculate the mean squared error for this model.

# Select model and fit it
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)
# Predict on test data
y_predict_linear = model_linear.predict(X_test)
print(mean_squared_error(y_test, y_predict_linear))

0.24779744340095827

model_forest = RandomForestRegressor()
model_forest.fit(X_train, y_train)
y_predict_forest = model_forest.predict(X_test)
print(mean_squared_error(y_test, y_predict_forest))

0.24378341367135342

Interestingly, the mean squared error is very similar for both models. However, I suppose the Random Forest model is slightly lower.

Now if we put in features of an app, both models should be able to predict what rating the app would receive.

Predictive Model for Number of Installs¶

I will now perform a very similar analysis for the number of installs.

# Select features 
features = ['rating', 'reviews', 'price', 'category_num', 'content_rating_num']
X = google_play[features]
y = google_play['installs']

# Split into training and testing data
X_train_install, X_test_install, y_train_install, y_test_install = train_test_split(X, y, test_size=0.33, random_state=42)

model_lin_install = LinearRegression()
model_lin_install.fit(X_train_install, y_train_install)
y_predict_lin_install = model_lin_install.predict(X_test_install)
print(mean_squared_error(y_test, y_predict_lin_install))

2958139489796131.0

model_for_install = RandomForestRegressor()
model_for_install.fit(X_train_install, y_train_install)
y_predict_for_install = model_for_install.predict(X_test_install)
print(mean_squared_error(y_test, y_predict_for_install))

7515466591465622.0

Both predictive models for Ratings had very low mean squared error. However, for installs the mean squared error is extremely high for both models. This suggests that both models may have issues predicting the number of installs an app will receive.

Insight and Policy Decision ¶

I decided to explore the overall landscape of the Play Store. For example, I discovered that the majority of apps belong to the Family category, and that the vast majority of apps are free. Throughout my paper, I attempted to analyze what factors influence an app's success on the Play Store, primarily looking at rating, and install numbers.

There are numerous improvements that could be made, as well as further questions to be answered. As we can see above, my regression model for number of installs can be fine tuned to have a lower mean squared error. There are also other questions that I did not explore here. I did not dive into the genres of apps. Perhaps there are certain genres that are more successful than others. Also, we could narrow our search into the most popular apps, and analyze what factors make them popular. Finally, we could refine the charts I created, and perhaps attempt to explore more information with them. This is a massive dataset, and thus there is a variety of things left to explore.

	App	Category	Rating	Reviews	Size	Installs	Type	Content Rating	Genres	Last Updated	Current Ver	Android Ver
0	Photo Editor & Candy Camera & Grid & ScrapBook	ART_AND_DESIGN	4.1	159	19M	10,000+	Free	Everyone	Art & Design	January 7, 2018	1.0.0	4.0.3 and up
1	Coloring book moana	ART_AND_DESIGN	3.9	967	14M	500,000+	Free	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up
2	U Launcher Lite – FREE Live Cool Themes, Hide ...	ART_AND_DESIGN	4.7	87510	8.7M	5,000,000+	Free	Everyone	Art & Design	August 1, 2018	1.2.4	4.0.3 and up
3	Sketch - Draw & Paint	ART_AND_DESIGN	4.5	215644	25M	50,000,000+	Free	Teen	Art & Design	June 8, 2018	Varies with device	4.2 and up
4	Pixel Draw - Number Art Coloring Book	ART_AND_DESIGN	4.3	967	2.8M	100,000+	Free	Everyone	Art & Design;Creativity	June 20, 2018	1.1	4.4 and up

	app	category	rating	reviews	installs	type	content_rating	genres
0	Photo Editor & Candy Camera & Grid & ScrapBook	ART_AND_DESIGN	4.1	159	10,000+	Free	Everyone	Art & Design
1	Coloring book moana	ART_AND_DESIGN	3.9	967	500,000+	Free	Everyone	Art & Design;Pretend Play
2	U Launcher Lite – FREE Live Cool Themes, Hide ...	ART_AND_DESIGN	4.7	87510	5,000,000+	Free	Everyone	Art & Design
3	Sketch - Draw & Paint	ART_AND_DESIGN	4.5	215644	50,000,000+	Free	Teen	Art & Design
4	Pixel Draw - Number Art Coloring Book	ART_AND_DESIGN	4.3	967	100,000+	Free	Everyone	Art & Design;Creativity

	app	category	rating	reviews	installs	type	content_rating	genres
0	Photo Editor & Candy Camera & Grid & ScrapBook	ART_AND_DESIGN	4.1	159	10000	Free	Everyone	Art & Design
1	Coloring book moana	ART_AND_DESIGN	3.9	967	500000	Free	Everyone	Art & Design;Pretend Play
2	U Launcher Lite – FREE Live Cool Themes, Hide ...	ART_AND_DESIGN	4.7	87510	5000000	Free	Everyone	Art & Design
3	Sketch - Draw & Paint	ART_AND_DESIGN	4.5	215644	50000000	Free	Teen	Art & Design
4	Pixel Draw - Number Art Coloring Book	ART_AND_DESIGN	4.3	967	100000	Free	Everyone	Art & Design;Creativity

	app	category	rating	reviews	installs	type	price	content_rating	genres
23	Mcqueen Coloring pages	ART_AND_DESIGN	NaN	61	100000	Free	0.0	Everyone	Art & Design;Action & Adventure
113	Wrinkles and rejuvenation	BEAUTY	NaN	182	100000	Free	0.0	Everyone 10+	Beauty
123	Manicure - nail design	BEAUTY	NaN	119	50000	Free	0.0	Everyone	Beauty
126	Skin Care and Natural Beauty	BEAUTY	NaN	654	100000	Free	0.0	Teen	Beauty
129	Secrets of beauty, youth and health	BEAUTY	NaN	77	10000	Free	0.0	Mature 17+	Beauty
...	...	...	...	...	...	...	...	...	...
10824	Cardio-FR	MEDICAL	NaN	67	10000	Free	0.0	Everyone	Medical
10825	Naruto & Boruto FR	SOCIAL	NaN	7	100	Free	0.0	Teen	Social
10831	payermonstationnement.fr	MAPS_AND_NAVIGATION	NaN	38	5000	Free	0.0	Everyone	Maps & Navigation
10835	FR Forms	BUSINESS	NaN	0	10	Free	0.0	Everyone	Business
10838	Parkinson Exercices FR	MEDICAL	NaN	3	1000	Free	0.0	Everyone	Medical

	app	category	rating	reviews	installs	type	price	content_rating	genres
0	Photo Editor & Candy Camera & Grid & ScrapBook	ART_AND_DESIGN	4.1	159	10000	Free	0.0	Everyone	Art & Design
1	Coloring book moana	ART_AND_DESIGN	3.9	967	500000	Free	0.0	Everyone	Art & Design;Pretend Play
2	U Launcher Lite – FREE Live Cool Themes, Hide ...	ART_AND_DESIGN	4.7	87510	5000000	Free	0.0	Everyone	Art & Design
3	Sketch - Draw & Paint	ART_AND_DESIGN	4.5	215644	50000000	Free	0.0	Teen	Art & Design
4	Pixel Draw - Number Art Coloring Book	ART_AND_DESIGN	4.3	967	100000	Free	0.0	Everyone	Art & Design;Creativity
...	...	...	...	...	...	...	...	...	...
9361	FR Calculator	FAMILY	4.0	7	500	Free	0.0	Everyone	Education
9362	Sya9a Maroc - FR	FAMILY	4.5	38	5000	Free	0.0	Everyone	Education
9363	Fr. Mike Schmitz Audio Teachings	FAMILY	5.0	4	100	Free	0.0	Everyone	Education
9364	The SCP Foundation DB fr nn5n	BOOKS_AND_REFERENCE	4.5	114	1000	Free	0.0	Mature 17+	Books & Reference
9365	iHoroscope - 2018 Daily Horoscope & Astrology	LIFESTYLE	4.5	398307	10000000	Free	0.0	Everyone	Lifestyle

	category	cat_count	cat_pct
0	FAMILY	1747	18.652573
1	GAME	1097	11.712577
2	TOOLS	734	7.836857
3	PRODUCTIVITY	351	3.747598
4	MEDICAL	350	3.736921

	content_rating	content_count	content_pct
0	Everyone	7420	79.222720
1	Teen	1084	11.573777
2	Mature 17+	461	4.922059
3	Everyone 10+	397	4.238736
4	Adults only 18+	3	0.032031
5	Unrated	1	0.010677

	category	type	type_pct
0	Free	8719	93.092035
1	Paid	647	6.907965

	category	cat_count	cat_pct	type	type_count	type_pct	rating_mean	install_mean
0	FAMILY	1747	18.652573	Free	1585	16.922913	4.181767	6.452008e+06
1	FAMILY	1747	18.652573	Paid	162	1.729660	4.295062	1.930175e+05
2	GAME	1097	11.712577	Free	1020	10.890455	4.279804	3.437722e+07
3	GAME	1097	11.712577	Paid	77	0.822123	4.372727	2.740164e+05
4	TOOLS	734	7.836857	Free	671	7.164211	4.035917	1.706259e+07
...	...	...	...	...	...	...	...	...
56	COMICS	58	0.619261	Free	58	0.619261	4.155172	9.661397e+05
57	PARENTING	50	0.533846	Free	48	0.512492	4.339583	6.472085e+05
58	PARENTING	50	0.533846	Paid	2	0.021354	3.350000	2.505000e+04
59	EVENTS	45	0.480461	Free	45	0.480461	4.435556	3.544313e+05
60	BEAUTY	42	0.448430	Free	42	0.448430	4.278571	6.408619e+05