doubt.datasets.blog
Blog post data set.
This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.
1"""Blog post data set. 2 3This data set is from the UCI data set archive, with the description being the original 4description verbatim. Some feature names may have been altered, based on the 5description. 6""" 7 8import io 9import zipfile 10 11import pandas as pd 12 13from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset 14 15 16class Blog(BaseDataset): 17 __doc__ = f""" 18 This data originates from blog posts. The raw HTML-documents of the blog posts were 19 crawled and processed. The prediction task associated with the data is the 20 prediction of the number of comments in the upcoming 24 hours. In order to simulate 21 this situation, we choose a basetime (in the past) and select the blog posts that 22 were published at most 72 hours before the selected base date/time. Then, we 23 calculate all the features of the selected blog posts from the information that was 24 available at the basetime, therefore each instance corresponds to a blog post. The 25 target is the number of comments that the blog post received in the next 24 hours 26 relative to the basetime. 27 28 In the train data, the basetimes were in the years 2010 and 2011. In the test data 29 the basetimes were in February and March 2012. This simulates the real-world 30 situtation in which training data from the past is available to predict events in 31 the future. 32 33 The train data was generated from different basetimes that may temporally overlap. 34 Therefore, if you simply split the train into disjoint partitions, the underlying 35 time intervals may overlap. Therefore, the you should use the provided, temporally 36 disjoint train and test splits in order to ensure that the evaluation is fair. 37 38 {BASE_DATASET_DESCRIPTION} 39 40 Features: 41 Features 0-49 (float): 42 50 features containing the average, standard deviation, minimum, maximum 43 and median of feature 50-59 for the source of the current blog post, by 44 which we mean the blog on which the post appeared. For example, 45 myblog.blog.org would be the source of the post 46 myblog.blog.org/post_2010_09_10 47 Feature 50 (int): 48 Total number of comments before basetime 49 Feature 51 (int): 50 Number of comments in the last 24 hours before the basetime 51 Feature 52 (int): 52 If T1 is the datetime 48 hours before basetime and T2 is the datetime 24 53 hours before basetime, then this is the number of comments in the time 54 period between T1 and T2 55 Feature 53 (int): 56 Number of comments in the first 24 hours after the publication of the blog 57 post, but before basetime 58 Feature 54 (int): 59 The difference between Feature 51 and Feature 52 60 Features 55-59 (int): 61 The same thing as Features 50-51, but for links (trackbacks) instead of 62 comments 63 Feature 60 (float): 64 The length of time between the publication of the blog post and basetime 65 Feature 61 (int): 66 The length of the blog post 67 Features 62-261 (int): 68 The 200 bag of words features for 200 frequent words of the text of the 69 blog post 70 Features 262-268 (int): 71 Binary indicators for the weekday (Monday-Sunday) of the basetime 72 Features 269-275 (int): 73 Binary indicators for the weekday (Monday-Sunday) of the date of 74 publication of the blog post 75 Feature 276 (int): 76 Number of parent pages: we consider a blog post P as a parent of blog post 77 B if B is a reply (trackback) to P 78 Features 277-279 (float): 79 Minimum, maximum and average of the number of comments the parents received 80 81 Targets: 82 int: 83 The number of comments in the next 24 hours (relative to baseline) 84 85 Source: 86 https://archive.ics.uci.edu/ml/datasets/BlogFeedback 87 88 Examples: 89 Load in the data set:: 90 91 >>> dataset = Blog() 92 >>> dataset.shape 93 (52397, 281) 94 95 Split the data set into features and targets, as NumPy arrays:: 96 97 >>> X, y = dataset.split() 98 >>> X.shape, y.shape 99 ((52397, 279), (52397,)) 100 101 Perform a train/test split, also outputting NumPy arrays:: 102 103 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 104 >>> X_train, X_test, y_train, y_test = train_test_split 105 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 106 ((41949, 279), (41949,), (10448, 279), (10448,)) 107 108 Output the underlying Pandas DataFrame:: 109 110 >>> df = dataset.to_pandas() 111 >>> type(df) 112 <class 'pandas.core.frame.DataFrame'> 113 """ 114 115 _url = ( 116 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 117 "00304/BlogFeedback.zip" 118 ) 119 120 _features = range(279) 121 _targets = [279] 122 123 def _prep_data(self, data: bytes) -> pd.DataFrame: 124 """Prepare the data set. 125 126 Args: 127 data (bytes): The raw data 128 129 Returns: 130 Pandas dataframe: The prepared data 131 """ 132 # Convert the bytes into a file-like object 133 buffer = io.BytesIO(data) 134 135 # Unzip the file and pull out blogData_train.csv as a string 136 with zipfile.ZipFile(buffer, "r") as zip_file: 137 csv = zip_file.read("blogData_train.csv").decode("utf-8") 138 139 # Convert the string into a file-like object 140 csv_file = io.StringIO(csv) 141 142 # Read the file-like object into a dataframe 143 df = pd.read_csv(csv_file, header=None) 144 return df
17class Blog(BaseDataset): 18 __doc__ = f""" 19 This data originates from blog posts. The raw HTML-documents of the blog posts were 20 crawled and processed. The prediction task associated with the data is the 21 prediction of the number of comments in the upcoming 24 hours. In order to simulate 22 this situation, we choose a basetime (in the past) and select the blog posts that 23 were published at most 72 hours before the selected base date/time. Then, we 24 calculate all the features of the selected blog posts from the information that was 25 available at the basetime, therefore each instance corresponds to a blog post. The 26 target is the number of comments that the blog post received in the next 24 hours 27 relative to the basetime. 28 29 In the train data, the basetimes were in the years 2010 and 2011. In the test data 30 the basetimes were in February and March 2012. This simulates the real-world 31 situtation in which training data from the past is available to predict events in 32 the future. 33 34 The train data was generated from different basetimes that may temporally overlap. 35 Therefore, if you simply split the train into disjoint partitions, the underlying 36 time intervals may overlap. Therefore, the you should use the provided, temporally 37 disjoint train and test splits in order to ensure that the evaluation is fair. 38 39 {BASE_DATASET_DESCRIPTION} 40 41 Features: 42 Features 0-49 (float): 43 50 features containing the average, standard deviation, minimum, maximum 44 and median of feature 50-59 for the source of the current blog post, by 45 which we mean the blog on which the post appeared. For example, 46 myblog.blog.org would be the source of the post 47 myblog.blog.org/post_2010_09_10 48 Feature 50 (int): 49 Total number of comments before basetime 50 Feature 51 (int): 51 Number of comments in the last 24 hours before the basetime 52 Feature 52 (int): 53 If T1 is the datetime 48 hours before basetime and T2 is the datetime 24 54 hours before basetime, then this is the number of comments in the time 55 period between T1 and T2 56 Feature 53 (int): 57 Number of comments in the first 24 hours after the publication of the blog 58 post, but before basetime 59 Feature 54 (int): 60 The difference between Feature 51 and Feature 52 61 Features 55-59 (int): 62 The same thing as Features 50-51, but for links (trackbacks) instead of 63 comments 64 Feature 60 (float): 65 The length of time between the publication of the blog post and basetime 66 Feature 61 (int): 67 The length of the blog post 68 Features 62-261 (int): 69 The 200 bag of words features for 200 frequent words of the text of the 70 blog post 71 Features 262-268 (int): 72 Binary indicators for the weekday (Monday-Sunday) of the basetime 73 Features 269-275 (int): 74 Binary indicators for the weekday (Monday-Sunday) of the date of 75 publication of the blog post 76 Feature 276 (int): 77 Number of parent pages: we consider a blog post P as a parent of blog post 78 B if B is a reply (trackback) to P 79 Features 277-279 (float): 80 Minimum, maximum and average of the number of comments the parents received 81 82 Targets: 83 int: 84 The number of comments in the next 24 hours (relative to baseline) 85 86 Source: 87 https://archive.ics.uci.edu/ml/datasets/BlogFeedback 88 89 Examples: 90 Load in the data set:: 91 92 >>> dataset = Blog() 93 >>> dataset.shape 94 (52397, 281) 95 96 Split the data set into features and targets, as NumPy arrays:: 97 98 >>> X, y = dataset.split() 99 >>> X.shape, y.shape 100 ((52397, 279), (52397,)) 101 102 Perform a train/test split, also outputting NumPy arrays:: 103 104 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 105 >>> X_train, X_test, y_train, y_test = train_test_split 106 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 107 ((41949, 279), (41949,), (10448, 279), (10448,)) 108 109 Output the underlying Pandas DataFrame:: 110 111 >>> df = dataset.to_pandas() 112 >>> type(df) 113 <class 'pandas.core.frame.DataFrame'> 114 """ 115 116 _url = ( 117 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 118 "00304/BlogFeedback.zip" 119 ) 120 121 _features = range(279) 122 _targets = [279] 123 124 def _prep_data(self, data: bytes) -> pd.DataFrame: 125 """Prepare the data set. 126 127 Args: 128 data (bytes): The raw data 129 130 Returns: 131 Pandas dataframe: The prepared data 132 """ 133 # Convert the bytes into a file-like object 134 buffer = io.BytesIO(data) 135 136 # Unzip the file and pull out blogData_train.csv as a string 137 with zipfile.ZipFile(buffer, "r") as zip_file: 138 csv = zip_file.read("blogData_train.csv").decode("utf-8") 139 140 # Convert the string into a file-like object 141 csv_file = io.StringIO(csv) 142 143 # Read the file-like object into a dataframe 144 df = pd.read_csv(csv_file, header=None) 145 return df
This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed. The prediction task associated with the data is the prediction of the number of comments in the upcoming 24 hours. In order to simulate this situation, we choose a basetime (in the past) and select the blog posts that were published at most 72 hours before the selected base date/time. Then, we calculate all the features of the selected blog posts from the information that was available at the basetime, therefore each instance corresponds to a blog post. The target is the number of comments that the blog post received in the next 24 hours relative to the basetime.
In the train data, the basetimes were in the years 2010 and 2011. In the test data the basetimes were in February and March 2012. This simulates the real-world situtation in which training data from the past is available to predict events in the future.
The train data was generated from different basetimes that may temporally overlap. Therefore, if you simply split the train into disjoint partitions, the underlying time intervals may overlap. Therefore, the you should use the provided, temporally disjoint train and test splits in order to ensure that the evaluation is fair.
Arguments:
- cache (str or None, optional): The name of the cache. It will be saved to
cache
in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
- cache (str or None): The name of the cache.
- shape (tuple of integers): Dimensions of the data set
- columns (list of strings): List of column names in the data set
Features:
Features 0-49 (float): 50 features containing the average, standard deviation, minimum, maximum and median of feature 50-59 for the source of the current blog post, by which we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10 Feature 50 (int): Total number of comments before basetime Feature 51 (int): Number of comments in the last 24 hours before the basetime Feature 52 (int): If T1 is the datetime 48 hours before basetime and T2 is the datetime 24 hours before basetime, then this is the number of comments in the time period between T1 and T2 Feature 53 (int): Number of comments in the first 24 hours after the publication of the blog post, but before basetime Feature 54 (int): The difference between Feature 51 and Feature 52 Features 55-59 (int): The same thing as Features 50-51, but for links (trackbacks) instead of comments Feature 60 (float): The length of time between the publication of the blog post and basetime Feature 61 (int): The length of the blog post Features 62-261 (int): The 200 bag of words features for 200 frequent words of the text of the blog post Features 262-268 (int): Binary indicators for the weekday (Monday-Sunday) of the basetime Features 269-275 (int): Binary indicators for the weekday (Monday-Sunday) of the date of publication of the blog post Feature 276 (int): Number of parent pages: we consider a blog post P as a parent of blog post B if B is a reply (trackback) to P Features 277-279 (float): Minimum, maximum and average of the number of comments the parents received
Targets:
int: The number of comments in the next 24 hours (relative to baseline)
Source:
Examples:
Load in the data set::
>>> dataset = Blog() >>> dataset.shape (52397, 281)
Split the data set into features and targets, as NumPy arrays::
>>> X, y = dataset.split() >>> X.shape, y.shape ((52397, 279), (52397,))
Perform a train/test split, also outputting NumPy arrays::
>>> train_test_split = dataset.split(test_size=0.2, random_seed=42) >>> X_train, X_test, y_train, y_test = train_test_split >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape ((41949, 279), (41949,), (10448, 279), (10448,))
Output the underlying Pandas DataFrame::
>>> df = dataset.to_pandas() >>> type(df) <class 'pandas.core.frame.DataFrame'>