top of page

Youtube Analysis Project

  • cagandemir3
  • 3 Eyl 2024
  • 5 dakikada okunur



Software/Programing Languages/Libraries:Python,Sklearn,Msno,Pandas,Ycimpute,Numpy,

Seaborn,Matplotlib,ChatGPT


The purpose of this project is to analyze YouTube statistics and draw significant insights using Python. I utilized the dataset titled "Global YouTube Statistics.csv" from Kaggle to examine global popularity trends on YouTube and determine which categories are most successful for content creators. The project involves data cleaning, filling missing values, and generating reports based on the analysis.


You can reach jupyter notebook and dataset:


Core Functions

The main functions of the project include:

Data Cleaning: Cleaning the dataset to make it suitable for analysis by addressing missing and erroneous information.

Filling Missing Values: Using ChatGPT to fill in missing country and category information.

Generating Reports: Creating detailed reports and visualizations based on the analysis results.


Libraries:

sklearn: For machine learning algorithms

msno: For visualizing and analyzing missing data

pandas: For data processing and analysis

numpy: For numerical computations

seaborn: For advanced data visualization

matplotlib: For basic data visualization

ycimpute: For filling missing data

scipy: For scientific and technical calculations

Platform: ChatGPT was used for collecting and completing categorical data.


Process

1.EDA

First of all, i check shape of my data which have 995 rows and 28 columns(variables).

df.shape
>>>(995,28)

Then i chech data types by using info method in pandas.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-null    float64
 11  country_rank                             879 non-null    float64
 12  channel_type_rank                        962 non-null    float64
 13  video_views_for_the_last_30_days         939 non-null    float64
 14  lowest_monthly_earnings                  995 non-null    float64
 15  highest_monthly_earnings                 995 non-null    float64
 16  lowest_yearly_earnings                   995 non-null    float64
 17  highest_yearly_earnings                  995 non-null    float64
 18  subscribers_for_last_30_days             658 non-null    float64
 19  created_year                             990 non-null    float64
 20  created_month                            990 non-null    object 
 21  created_date                             990 non-null    float64
 22  Gross tertiary education enrollment (%)  872 non-null    float64
 23  Population                               872 non-null    float64
 24  Unemployment rate                        872 non-null    float64
 25  Urban_population                         872 non-null    float64
 26  Latitude                                 872 non-null    float64
 27  Longitude                                872 non-null    float64
dtypes: float64(18), int64(3), object(7)
memory usage: 217.8+ KB

Defining Types of Variables

In order to decide visualization methods and make some plans on it.I define variable types by writing fuction.

def grab_col_names(dataframe,cat_th=18,car_th=49):
  
  cat_cols=[col for col in df.columns if str(df[col].dtypes) in ["category","object","bool"]]
  num_but_cat =[col for col in df.columns if (df[col].nunique() < cat_th) and (df[col].dtypes in ["int","float"])]
  cat_but_car=[col for col in df.columns if df[col].nunique()>car_th and str(df[col].dtypes) in ["category","object"]]
  cat_cols = cat_cols+num_but_cat
  cat_cols = [col for col in cat_cols if col not in cat_but_car]

  num_cols=[col for col in df.columns if str(df[col].dtype) in ["float64","int64"]]
  num_cols=[col for col in num_cols if col not in cat_cols]
  print(f"Observations: {dataframe. shape [0]}")
  print(f"Variables: {dataframe.shape[1]}")
  print(f' cat_cols: {len(cat_cols)} ')
  print(f' num_cols: {len(num_cols)}')
  print(f' cat_but_car: {len(cat_but_car)}')
  print(f' num_but_cat: {len(num_but_cat)}')
  return cat_cols,num_cols,cat_but_car
cat_cols,num_cols,cat_but_car = grab_col_names(df,19,58) #I define treshold based on unique value amounts.
# As i observed treshold for categorical variables are between 19 and 58.

Also i edit in the end of this process.


2. Handling Missing Values

And also check missing values by using missindno library.

But i need to check if there is any connection between missing datas.Therefore i use matrix in msno library.

In terms of missing value matrix, country,abbrevation,Gross terriary education,Population,Unemployment rate,urban population,Latitude and Longitude have similar pattern in missing values.


I check relationship between missing values by writing python code.

index_population = list(df[df["Population"].isnull()].index)
index_unemployment_rate = list(df[df["Unemployment rate"].isnull()].index)
index_country = list(df[df["Country"].isnull()].index)

# I  chech if index_country items are not in index_unemployment_rate.
for i in index_country:
    if i  not in index_unemployment_rate:
        print("Error")

In the end of this process, i made sure that matrix, country,abbrevation,Gross terriary education,Population,Unemployment rate,urban population,Latitude and Longitude have same missing value pattern.


Filling Missing Unrelated Values with Chat GPT


To complete some categorical variables like nationality of Youtuber and abbreviation i need to make researh and fill with my hand. But i prefer easy way like filling this information by using Chatgpt. To do so you can use either API or you can ask chatGPT to make dictionary. I prefer asking GPT.


I fill Country,Category and Channel type by creating dictionary and assign values to misssing areas in this columns by Youtuber's Name.

for tuber in df["Youtuber"]:  #I fill missing datas from ChatGPT
  if tuber in GPT_country:
    if GPT_country[tuber].split()[0] !="null":
        index_of_tuber = df[df["Youtuber"]== tuber].index[0]
        df.loc[index_of_tuber,"Country"]= GPT_country[tuber].split()[0]
        df.loc[index_of_tuber,"Abbreviation"]= GPT_country[tuber].replace("(","").replace(")","").split()[1]

In the end of this process i fill most of the columns.

Filling Missing Values Using Related Datas

Most of columns(variables) related to country can be filled by non missing valued columns.Thats why i created pivot table based on country.

grouped_by_country = df[(df["Country"].notna())(df["Population"].notna())].groupby(["Country"])[['Gross tertiary education enrollment (%)', 'Population',
       'Unemployment rate', 'Urban_population', 'Latitude', 'Longitude']].mean() #According to country I can find population,unemployment rate etc.

In the end of this process, i filled most of the missing columns.


Using KNN to fill Numerical Missing Values

First i define Neighbour amount by executing tests.

Then i fill missing values by using KNN imputer.

df2=df.drop(columns=total_cat) #Drop all categorical data
dff_num=np.array(df2[num_cols]) #Defining only numerical data
dff = knnimput.KNN(k=2).complete(dff_num) #I complete missing datas using knn imputer
dff=pd.DataFrame(dff,columns=num_cols) #I create new data frame

In the end of this process i fill most of the columns.

3.Last Edits and Dashboard

Lastly i drop unnecessary cardinal values.

df3= df3.drop(columns=["Unnamed: 0.1","Unnamed: 0"])

Create database based on top 10 youtubers.

Top_10=df3.head(10) #Select Top 10 Youtubers

Making dashboard using Matplotlib and seaborn libraries.

#Making Dashboard using Matplotlib and Seaborn
fig, axes = plt.subplots(3, 2, figsize=(20, 20)) Defining dashboardsize
plt.subplots_adjust(wspace=0.4, hspace=0.8) Defining space between graphics

# First subplot
sns.barplot(ax=axes[0, 0], x=category_list, y=category_amount)
axes[0, 0].set_title("Top 10 Youtubers' Category")
axes[0, 0].set_xlabel("Top 10 Youtube Category")
axes[0, 0].set_ylabel("Count of Youtube Channels")
axes[0, 0].tick_params(axis='x', rotation=90)

# Second subplot
sns.barplot(ax=axes[0, 1], x=df3["category"].value_counts().index, y=df3["category"].value_counts().values)
axes[0, 1].set_title("Most Popular Youtube Categories in the World")
axes[0, 1].set_xlabel("Youtube Category in The World")
axes[0, 1].set_ylabel("Count of Youtube Channels")
axes[0, 1].tick_params(axis='x', rotation=90)

# Third subplot (Pie chart)
Countries = list(Top_10["Country"].value_counts().index)
axes[1, 0].pie(x=Top_10["Country"].value_counts().values, labels=Countries, autopct='%.0f%%')
axes[1, 0].set_title("Top 10 Country")

# Forth subplot
sns.countplot(ax=axes[1, 1], data=df3, x="category")
axes[1, 1].set_title("Countries of Popular Youtubers in the World")
axes[1, 1].set_xlabel("Countries")
axes[1, 1].set_ylabel("Amount of Youtubers")
axes[1, 1].tick_params(axis='x', rotation=90)

# Fifth subplot
sns.countplot(ax=axes[2, 0], data=Top_10, x="created_year")
axes[2, 0].set_title("Top 10 Countries")
axes[2, 0].set_xlabel("Channels Created Year")
axes[2, 0].set_ylabel("Count of Top 10 Countries")

# Barplot of Education and Unemployment Rate
df_long = pd.melt(Top_10, id_vars=['Country'], value_vars=['Gross tertiary education enrollment (%)', 'Unemployment rate'], var_name='Metric', value_name='Value')
sns.barplot(ax=axes[2, 1], data=df_long, x="Country", y="Value", hue="Metric", palette="dark", alpha=.6)
axes[2, 1].set_title("Top 10 Youtuber Countries")
axes[2, 1].set_xlabel("Country")
axes[2, 1].set_ylabel("Value")
axes[2, 1].legend(title="Metric")

plt.tight_layout()
plt.show()

Unfortunately,Matplotlib and Seaborn is not for dashboard hovewer i create some graphics to gain meaningful insights.


4.Analysis and Insights

Popular Categories

During the data analysis, various visualization and machine learning techniques were employed to identify the most popular YouTube categories. The results indicate that Music, Entertainment, and People & Blogs are the most popular categories. These categories attract a broad audience and achieve high view counts.


Top YouTubers

My analysis revealed that the majority of successful YouTubers come from countries such as the US, India, Russia, and Japan. Additionally, most YouTubers established their channels in 2006. This suggests that being active on YouTube in its early stages has contributed to long-term success.


Insights About Countries

The data indicates that even in countries with poor economic conditions and educational levels, YouTube success can be achieved through consistency and choosing the right categories. Categories like Entertainment, Film & Animation, and Music enable content creators to reach a wide audience. These success stories highlight that, despite economic and educational challenges, success on YouTube is possible with the right strategies and passion.


 
 
 

Comentarios


bottom of page