Youtube Analysis Project
- cagandemir3
- 3 Eyl 2024
- 5 dakikada okunur
Software/Programing Languages/Libraries:Python,Sklearn,Msno,Pandas,Ycimpute,Numpy,
Seaborn,Matplotlib,ChatGPT
The purpose of this project is to analyze YouTube statistics and draw significant insights using Python. I utilized the dataset titled "Global YouTube Statistics.csv" from Kaggle to examine global popularity trends on YouTube and determine which categories are most successful for content creators. The project involves data cleaning, filling missing values, and generating reports based on the analysis.
You can reach jupyter notebook and dataset:
Core Functions
The main functions of the project include:
Data Cleaning: Cleaning the dataset to make it suitable for analysis by addressing missing and erroneous information.
Filling Missing Values: Using ChatGPT to fill in missing country and category information.
Generating Reports: Creating detailed reports and visualizations based on the analysis results.
Libraries:
sklearn: For machine learning algorithms
msno: For visualizing and analyzing missing data
pandas: For data processing and analysis
numpy: For numerical computations
seaborn: For advanced data visualization
matplotlib: For basic data visualization
ycimpute: For filling missing data
scipy: For scientific and technical calculations
Platform: ChatGPT was used for collecting and completing categorical data.
Process
1.EDA
First of all, i check shape of my data which have 995 rows and 28 columns(variables).
df.shape
>>>(995,28)
Then i chech data types by using info method in pandas.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 rank 995 non-null int64
1 Youtuber 995 non-null object
2 subscribers 995 non-null int64
3 video views 995 non-null float64
4 category 949 non-null object
5 Title 995 non-null object
6 uploads 995 non-null int64
7 Country 873 non-null object
8 Abbreviation 873 non-null object
9 channel_type 965 non-null object
10 video_views_rank 994 non-null float64
11 country_rank 879 non-null float64
12 channel_type_rank 962 non-null float64
13 video_views_for_the_last_30_days 939 non-null float64
14 lowest_monthly_earnings 995 non-null float64
15 highest_monthly_earnings 995 non-null float64
16 lowest_yearly_earnings 995 non-null float64
17 highest_yearly_earnings 995 non-null float64
18 subscribers_for_last_30_days 658 non-null float64
19 created_year 990 non-null float64
20 created_month 990 non-null object
21 created_date 990 non-null float64
22 Gross tertiary education enrollment (%) 872 non-null float64
23 Population 872 non-null float64
24 Unemployment rate 872 non-null float64
25 Urban_population 872 non-null float64
26 Latitude 872 non-null float64
27 Longitude 872 non-null float64
dtypes: float64(18), int64(3), object(7)
memory usage: 217.8+ KB
Defining Types of Variables
In order to decide visualization methods and make some plans on it.I define variable types by writing fuction.
def grab_col_names(dataframe,cat_th=18,car_th=49):
cat_cols=[col for col in df.columns if str(df[col].dtypes) in ["category","object","bool"]]
num_but_cat =[col for col in df.columns if (df[col].nunique() < cat_th) and (df[col].dtypes in ["int","float"])]
cat_but_car=[col for col in df.columns if df[col].nunique()>car_th and str(df[col].dtypes) in ["category","object"]]
cat_cols = cat_cols+num_but_cat
cat_cols = [col for col in cat_cols if col not in cat_but_car]
num_cols=[col for col in df.columns if str(df[col].dtype) in ["float64","int64"]]
num_cols=[col for col in num_cols if col not in cat_cols]
print(f"Observations: {dataframe. shape [0]}")
print(f"Variables: {dataframe.shape[1]}")
print(f' cat_cols: {len(cat_cols)} ')
print(f' num_cols: {len(num_cols)}')
print(f' cat_but_car: {len(cat_but_car)}')
print(f' num_but_cat: {len(num_but_cat)}')
return cat_cols,num_cols,cat_but_car
cat_cols,num_cols,cat_but_car = grab_col_names(df,19,58) #I define treshold based on unique value amounts.
# As i observed treshold for categorical variables are between 19 and 58.
Also i edit in the end of this process.
2. Handling Missing Values
And also check missing values by using missindno library.
But i need to check if there is any connection between missing datas.Therefore i use matrix in msno library.
In terms of missing value matrix, country,abbrevation,Gross terriary education,Population,Unemployment rate,urban population,Latitude and Longitude have similar pattern in missing values.
I check relationship between missing values by writing python code.
index_population = list(df[df["Population"].isnull()].index)
index_unemployment_rate = list(df[df["Unemployment rate"].isnull()].index)
index_country = list(df[df["Country"].isnull()].index)
# I chech if index_country items are not in index_unemployment_rate.
for i in index_country:
if i not in index_unemployment_rate:
print("Error")
In the end of this process, i made sure that matrix, country,abbrevation,Gross terriary education,Population,Unemployment rate,urban population,Latitude and Longitude have same missing value pattern.
Filling Missing Unrelated Values with Chat GPT
To complete some categorical variables like nationality of Youtuber and abbreviation i need to make researh and fill with my hand. But i prefer easy way like filling this information by using Chatgpt. To do so you can use either API or you can ask chatGPT to make dictionary. I prefer asking GPT.
I fill Country,Category and Channel type by creating dictionary and assign values to misssing areas in this columns by Youtuber's Name.
for tuber in df["Youtuber"]: #I fill missing datas from ChatGPT
if tuber in GPT_country:
if GPT_country[tuber].split()[0] !="null":
index_of_tuber = df[df["Youtuber"]== tuber].index[0]
df.loc[index_of_tuber,"Country"]= GPT_country[tuber].split()[0]
df.loc[index_of_tuber,"Abbreviation"]= GPT_country[tuber].replace("(","").replace(")","").split()[1]
In the end of this process i fill most of the columns.
Filling Missing Values Using Related Datas
Most of columns(variables) related to country can be filled by non missing valued columns.Thats why i created pivot table based on country.
grouped_by_country = df[(df["Country"].notna())(df["Population"].notna())].groupby(["Country"])[['Gross tertiary education enrollment (%)', 'Population',
'Unemployment rate', 'Urban_population', 'Latitude', 'Longitude']].mean() #According to country I can find population,unemployment rate etc.
In the end of this process, i filled most of the missing columns.
Using KNN to fill Numerical Missing Values
First i define Neighbour amount by executing tests.
Then i fill missing values by using KNN imputer.
df2=df.drop(columns=total_cat) #Drop all categorical data
dff_num=np.array(df2[num_cols]) #Defining only numerical data
dff = knnimput.KNN(k=2).complete(dff_num) #I complete missing datas using knn imputer
dff=pd.DataFrame(dff,columns=num_cols) #I create new data frame
In the end of this process i fill most of the columns.
3.Last Edits and Dashboard
Lastly i drop unnecessary cardinal values.
df3= df3.drop(columns=["Unnamed: 0.1","Unnamed: 0"])
Create database based on top 10 youtubers.
Top_10=df3.head(10) #Select Top 10 Youtubers
Making dashboard using Matplotlib and seaborn libraries.
#Making Dashboard using Matplotlib and Seaborn
fig, axes = plt.subplots(3, 2, figsize=(20, 20)) Defining dashboardsize
plt.subplots_adjust(wspace=0.4, hspace=0.8) Defining space between graphics
# First subplot
sns.barplot(ax=axes[0, 0], x=category_list, y=category_amount)
axes[0, 0].set_title("Top 10 Youtubers' Category")
axes[0, 0].set_xlabel("Top 10 Youtube Category")
axes[0, 0].set_ylabel("Count of Youtube Channels")
axes[0, 0].tick_params(axis='x', rotation=90)
# Second subplot
sns.barplot(ax=axes[0, 1], x=df3["category"].value_counts().index, y=df3["category"].value_counts().values)
axes[0, 1].set_title("Most Popular Youtube Categories in the World")
axes[0, 1].set_xlabel("Youtube Category in The World")
axes[0, 1].set_ylabel("Count of Youtube Channels")
axes[0, 1].tick_params(axis='x', rotation=90)
# Third subplot (Pie chart)
Countries = list(Top_10["Country"].value_counts().index)
axes[1, 0].pie(x=Top_10["Country"].value_counts().values, labels=Countries, autopct='%.0f%%')
axes[1, 0].set_title("Top 10 Country")
# Forth subplot
sns.countplot(ax=axes[1, 1], data=df3, x="category")
axes[1, 1].set_title("Countries of Popular Youtubers in the World")
axes[1, 1].set_xlabel("Countries")
axes[1, 1].set_ylabel("Amount of Youtubers")
axes[1, 1].tick_params(axis='x', rotation=90)
# Fifth subplot
sns.countplot(ax=axes[2, 0], data=Top_10, x="created_year")
axes[2, 0].set_title("Top 10 Countries")
axes[2, 0].set_xlabel("Channels Created Year")
axes[2, 0].set_ylabel("Count of Top 10 Countries")
# Barplot of Education and Unemployment Rate
df_long = pd.melt(Top_10, id_vars=['Country'], value_vars=['Gross tertiary education enrollment (%)', 'Unemployment rate'], var_name='Metric', value_name='Value')
sns.barplot(ax=axes[2, 1], data=df_long, x="Country", y="Value", hue="Metric", palette="dark", alpha=.6)
axes[2, 1].set_title("Top 10 Youtuber Countries")
axes[2, 1].set_xlabel("Country")
axes[2, 1].set_ylabel("Value")
axes[2, 1].legend(title="Metric")
plt.tight_layout()
plt.show()
Unfortunately,Matplotlib and Seaborn is not for dashboard hovewer i create some graphics to gain meaningful insights.
4.Analysis and Insights
Popular Categories
During the data analysis, various visualization and machine learning techniques were employed to identify the most popular YouTube categories. The results indicate that Music, Entertainment, and People & Blogs are the most popular categories. These categories attract a broad audience and achieve high view counts.
Top YouTubers
My analysis revealed that the majority of successful YouTubers come from countries such as the US, India, Russia, and Japan. Additionally, most YouTubers established their channels in 2006. This suggests that being active on YouTube in its early stages has contributed to long-term success.
Insights About Countries
The data indicates that even in countries with poor economic conditions and educational levels, YouTube success can be achieved through consistency and choosing the right categories. Categories like Entertainment, Film & Animation, and Music enable content creators to reach a wide audience. These success stories highlight that, despite economic and educational challenges, success on YouTube is possible with the right strategies and passion.
Comentarios