doubt.datasets.facebook_comments
Facebook comments data set.
This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.
1"""Facebook comments data set. 2 3This data set is from the UCI data set archive, with the description being the original 4description verbatim. Some feature names may have been altered, based on the 5description. 6""" 7 8import io 9import zipfile 10 11import pandas as pd 12 13from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset 14 15 16class FacebookComments(BaseDataset): 17 __doc__ = f""" 18 Instances in this dataset contain features extracted from Facebook posts. The task 19 associated with the data is to predict how many comments the post will receive. 20 21 {BASE_DATASET_DESCRIPTION} 22 23 Features: 24 page_popularity (int): 25 Defines the popularity of support for the source of the document 26 page_checkins (int): 27 Describes how many individuals so far visited this place. This feature is 28 only associated with places; e.g., some institution, place, theater, etc. 29 page_talking_about (int): 30 Defines the daily interest of individuals towards source of the 31 document/post. The people who actually come back to the page, after liking 32 the page. This include activities such as comments, likes to a post, shares 33 etc., by visitors to the page 34 page_category (int): 35 Defines the category of the source of the document; e.g., place, 36 institution, branch etc. 37 agg[n] for n=0..24 (float): 38 These features are aggreagted by page, by calculating min, max, average, 39 median and standard deviation of essential features 40 cc1 (int): 41 The total number of comments before selected base date/time 42 cc2 (int): 43 The number of comments in the last 24 hours, relative to base date/time 44 cc3 (int): 45 The number of comments in the last 48 to last 24 hours relative to base 46 date/time 47 cc4 (int): 48 The number of comments in the first 24 hours after the publication of post 49 but before base date/time 50 cc5 (int): 51 The difference between cc2 and cc3 52 base_time (int): 53 Selected time in order to simulate the scenario, ranges from 0 to 71 54 post_length (int): 55 Character count in the post 56 post_share_count (int): 57 This feature counts the number of shares of the post, how many people had 58 shared this post onto their timeline 59 post_promotion_status (int): 60 Binary feature. To reach more people with posts in News Feed, individuals 61 can promote their post and this feature indicates whether the post is 62 promoted or not 63 h_local (int): 64 This describes the hours for which we have received the target 65 variable/comments. Ranges from 0 to 23 66 day_published[n] for n=0..6 (int): 67 Binary feature. This represents the day (Sunday-Saturday) on which the post 68 was published 69 day[n] for n=0..6 (int): 70 Binary feature. This represents the day (Sunday-Saturday) on selected base 71 date/time 72 73 Targets: 74 ncomments (int): 75 The number of comments in the next `h_local` hours 76 77 Source: 78 https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset 79 80 Examples: 81 Load in the data set:: 82 83 >>> dataset = FacebookComments() 84 >>> dataset.shape 85 (199030, 54) 86 87 Split the data set into features and targets, as NumPy arrays:: 88 89 >>> X, y = dataset.split() 90 >>> X.shape, y.shape 91 ((199030, 54), (199030,)) 92 93 Perform a train/test split, also outputting NumPy arrays:: 94 95 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 96 >>> X_train, X_test, y_train, y_test = train_test_split 97 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 98 ((159211, 54), (159211,), (39819, 54), (39819,)) 99 100 Output the underlying Pandas DataFrame:: 101 102 >>> df = dataset.to_pandas() 103 >>> type(df) 104 <class 'pandas.core.frame.DataFrame'> 105 """ 106 107 _url = ( 108 "https://archive.ics.uci.edu/ml/machine-learning-databases/" "00363/Dataset.zip" 109 ) 110 111 _features = range(54) 112 _targets = [53] 113 114 def _prep_data(self, data: bytes) -> pd.DataFrame: 115 """Prepare the data set. 116 117 Args: 118 data (bytes): The raw data 119 120 Returns: 121 Pandas dataframe: The prepared data 122 """ 123 124 # Convert the bytes into a file-like object 125 buffer = io.BytesIO(data) 126 127 # Unzip the file and pull out the csv file 128 with zipfile.ZipFile(buffer, "r") as zip_file: 129 csv = zip_file.read("Dataset/Training/Features_Variant_5.csv") 130 131 # Convert the string into a file-like object 132 csv_file = io.BytesIO(csv) 133 134 # Name the columns 135 cols = ( 136 ["page_popularity", "page_checkins", "page_talking_about", "page_category"] 137 + [f"agg{n}" for n in range(25)] 138 + [ 139 "cc1", 140 "cc2", 141 "cc3", 142 "cc4", 143 "cc5", 144 "base_time", 145 "post_length", 146 "post_share_count", 147 "post_promotion_status", 148 "h_local", 149 ] 150 + [f"day_published{n}" for n in range(7)] 151 + [f"day{n}" for n in range(7)] 152 + ["ncomments"] 153 ) 154 155 # Read the file-like object into a dataframe 156 df = pd.read_csv(csv_file, header=None, names=cols) 157 return df
17class FacebookComments(BaseDataset): 18 __doc__ = f""" 19 Instances in this dataset contain features extracted from Facebook posts. The task 20 associated with the data is to predict how many comments the post will receive. 21 22 {BASE_DATASET_DESCRIPTION} 23 24 Features: 25 page_popularity (int): 26 Defines the popularity of support for the source of the document 27 page_checkins (int): 28 Describes how many individuals so far visited this place. This feature is 29 only associated with places; e.g., some institution, place, theater, etc. 30 page_talking_about (int): 31 Defines the daily interest of individuals towards source of the 32 document/post. The people who actually come back to the page, after liking 33 the page. This include activities such as comments, likes to a post, shares 34 etc., by visitors to the page 35 page_category (int): 36 Defines the category of the source of the document; e.g., place, 37 institution, branch etc. 38 agg[n] for n=0..24 (float): 39 These features are aggreagted by page, by calculating min, max, average, 40 median and standard deviation of essential features 41 cc1 (int): 42 The total number of comments before selected base date/time 43 cc2 (int): 44 The number of comments in the last 24 hours, relative to base date/time 45 cc3 (int): 46 The number of comments in the last 48 to last 24 hours relative to base 47 date/time 48 cc4 (int): 49 The number of comments in the first 24 hours after the publication of post 50 but before base date/time 51 cc5 (int): 52 The difference between cc2 and cc3 53 base_time (int): 54 Selected time in order to simulate the scenario, ranges from 0 to 71 55 post_length (int): 56 Character count in the post 57 post_share_count (int): 58 This feature counts the number of shares of the post, how many people had 59 shared this post onto their timeline 60 post_promotion_status (int): 61 Binary feature. To reach more people with posts in News Feed, individuals 62 can promote their post and this feature indicates whether the post is 63 promoted or not 64 h_local (int): 65 This describes the hours for which we have received the target 66 variable/comments. Ranges from 0 to 23 67 day_published[n] for n=0..6 (int): 68 Binary feature. This represents the day (Sunday-Saturday) on which the post 69 was published 70 day[n] for n=0..6 (int): 71 Binary feature. This represents the day (Sunday-Saturday) on selected base 72 date/time 73 74 Targets: 75 ncomments (int): 76 The number of comments in the next `h_local` hours 77 78 Source: 79 https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset 80 81 Examples: 82 Load in the data set:: 83 84 >>> dataset = FacebookComments() 85 >>> dataset.shape 86 (199030, 54) 87 88 Split the data set into features and targets, as NumPy arrays:: 89 90 >>> X, y = dataset.split() 91 >>> X.shape, y.shape 92 ((199030, 54), (199030,)) 93 94 Perform a train/test split, also outputting NumPy arrays:: 95 96 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 97 >>> X_train, X_test, y_train, y_test = train_test_split 98 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 99 ((159211, 54), (159211,), (39819, 54), (39819,)) 100 101 Output the underlying Pandas DataFrame:: 102 103 >>> df = dataset.to_pandas() 104 >>> type(df) 105 <class 'pandas.core.frame.DataFrame'> 106 """ 107 108 _url = ( 109 "https://archive.ics.uci.edu/ml/machine-learning-databases/" "00363/Dataset.zip" 110 ) 111 112 _features = range(54) 113 _targets = [53] 114 115 def _prep_data(self, data: bytes) -> pd.DataFrame: 116 """Prepare the data set. 117 118 Args: 119 data (bytes): The raw data 120 121 Returns: 122 Pandas dataframe: The prepared data 123 """ 124 125 # Convert the bytes into a file-like object 126 buffer = io.BytesIO(data) 127 128 # Unzip the file and pull out the csv file 129 with zipfile.ZipFile(buffer, "r") as zip_file: 130 csv = zip_file.read("Dataset/Training/Features_Variant_5.csv") 131 132 # Convert the string into a file-like object 133 csv_file = io.BytesIO(csv) 134 135 # Name the columns 136 cols = ( 137 ["page_popularity", "page_checkins", "page_talking_about", "page_category"] 138 + [f"agg{n}" for n in range(25)] 139 + [ 140 "cc1", 141 "cc2", 142 "cc3", 143 "cc4", 144 "cc5", 145 "base_time", 146 "post_length", 147 "post_share_count", 148 "post_promotion_status", 149 "h_local", 150 ] 151 + [f"day_published{n}" for n in range(7)] 152 + [f"day{n}" for n in range(7)] 153 + ["ncomments"] 154 ) 155 156 # Read the file-like object into a dataframe 157 df = pd.read_csv(csv_file, header=None, names=cols) 158 return df
Instances in this dataset contain features extracted from Facebook posts. The task associated with the data is to predict how many comments the post will receive.
Arguments:
- cache (str or None, optional): The name of the cache. It will be saved to
cache
in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
- cache (str or None): The name of the cache.
- shape (tuple of integers): Dimensions of the data set
- columns (list of strings): List of column names in the data set
Features:
page_popularity (int): Defines the popularity of support for the source of the document page_checkins (int): Describes how many individuals so far visited this place. This feature is only associated with places; e.g., some institution, place, theater, etc. page_talking_about (int): Defines the daily interest of individuals towards source of the document/post. The people who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares etc., by visitors to the page page_category (int): Defines the category of the source of the document; e.g., place, institution, branch etc. agg[n] for n=0..24 (float): These features are aggreagted by page, by calculating min, max, average, median and standard deviation of essential features cc1 (int): The total number of comments before selected base date/time cc2 (int): The number of comments in the last 24 hours, relative to base date/time cc3 (int): The number of comments in the last 48 to last 24 hours relative to base date/time cc4 (int): The number of comments in the first 24 hours after the publication of post but before base date/time cc5 (int): The difference between cc2 and cc3 base_time (int): Selected time in order to simulate the scenario, ranges from 0 to 71 post_length (int): Character count in the post post_share_count (int): This feature counts the number of shares of the post, how many people had shared this post onto their timeline post_promotion_status (int): Binary feature. To reach more people with posts in News Feed, individuals can promote their post and this feature indicates whether the post is promoted or not h_local (int): This describes the hours for which we have received the target variable/comments. Ranges from 0 to 23 day_published[n] for n=0..6 (int): Binary feature. This represents the day (Sunday-Saturday) on which the post was published day[n] for n=0..6 (int): Binary feature. This represents the day (Sunday-Saturday) on selected base date/time
Targets:
ncomments (int): The number of comments in the next
h_local
hours
Source:
https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
Examples:
Load in the data set::
>>> dataset = FacebookComments() >>> dataset.shape (199030, 54)
Split the data set into features and targets, as NumPy arrays::
>>> X, y = dataset.split() >>> X.shape, y.shape ((199030, 54), (199030,))
Perform a train/test split, also outputting NumPy arrays::
>>> train_test_split = dataset.split(test_size=0.2, random_seed=42) >>> X_train, X_test, y_train, y_test = train_test_split >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape ((159211, 54), (159211,), (39819, 54), (39819,))
Output the underlying Pandas DataFrame::
>>> df = dataset.to_pandas() >>> type(df) <class 'pandas.core.frame.DataFrame'>