doubt.datasets.blog

Blog post data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

  1"""Blog post data set.
  2
  3This data set is from the UCI data set archive, with the description being the original
  4description verbatim. Some feature names may have been altered, based on the
  5description.
  6"""
  7
  8import io
  9import zipfile
 10
 11import pandas as pd
 12
 13from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset
 14
 15
 16class Blog(BaseDataset):
 17    __doc__ = f"""
 18    This data originates from blog posts. The raw HTML-documents of the blog posts were
 19    crawled and processed. The prediction task associated with the data is the
 20    prediction of the number of comments in the upcoming 24 hours. In order to simulate
 21    this situation, we choose a basetime (in the past) and select the blog posts that
 22    were published at most 72 hours before the selected base date/time. Then, we
 23    calculate all the features of the selected blog posts from the information that was
 24    available at the basetime, therefore each instance corresponds to a blog post. The
 25    target is the number of comments that the blog post received in the next 24 hours
 26    relative to the basetime.
 27
 28    In the train data, the basetimes were in the years 2010 and 2011. In the test data
 29    the basetimes were in February and March 2012. This simulates the real-world
 30    situtation in which training data from the past is available to predict events in
 31    the future.
 32
 33    The train data was generated from different basetimes that may temporally overlap.
 34    Therefore, if you simply split the train into disjoint partitions, the underlying
 35    time intervals may overlap. Therefore, the you should use the provided, temporally
 36    disjoint train and test splits in order to ensure that the evaluation is fair.
 37
 38    {BASE_DATASET_DESCRIPTION}
 39
 40    Features:
 41        Features 0-49 (float):
 42            50 features containing the average, standard deviation, minimum, maximum
 43            and median of feature 50-59 for the source of the current blog post, by
 44            which we mean the blog on which the post appeared. For example,
 45            myblog.blog.org would be the source of the post
 46            myblog.blog.org/post_2010_09_10
 47        Feature 50 (int):
 48            Total number of comments before basetime
 49        Feature 51 (int):
 50            Number of comments in the last 24 hours before the basetime
 51        Feature 52 (int):
 52            If T1 is the datetime 48 hours before basetime and T2 is the datetime 24
 53            hours before basetime, then this is the number of comments in the time
 54            period between T1 and T2
 55        Feature 53 (int):
 56            Number of comments in the first 24 hours after the publication of the blog
 57            post, but before basetime
 58        Feature 54 (int):
 59            The difference between Feature 51 and Feature 52
 60        Features 55-59 (int):
 61            The same thing as Features 50-51, but for links (trackbacks) instead of
 62            comments
 63        Feature 60 (float):
 64            The length of time between the publication of the blog post and basetime
 65        Feature 61 (int):
 66            The length of the blog post
 67        Features 62-261 (int):
 68            The 200 bag of words features for 200 frequent words of the text of the
 69            blog post
 70        Features 262-268 (int):
 71            Binary indicators for the weekday (Monday-Sunday) of the basetime
 72        Features 269-275 (int):
 73            Binary indicators for the weekday (Monday-Sunday) of the date of
 74            publication of the blog post
 75        Feature 276 (int):
 76            Number of parent pages: we consider a blog post P as a parent of blog post
 77            B if B is a reply (trackback) to P
 78        Features 277-279 (float):
 79            Minimum, maximum and average of the number of comments the parents received
 80
 81    Targets:
 82        int:
 83            The number of comments in the next 24 hours (relative to baseline)
 84
 85    Source:
 86        https://archive.ics.uci.edu/ml/datasets/BlogFeedback
 87
 88    Examples:
 89        Load in the data set::
 90
 91            >>> dataset = Blog()
 92            >>> dataset.shape
 93            (52397, 281)
 94
 95        Split the data set into features and targets, as NumPy arrays::
 96
 97            >>> X, y = dataset.split()
 98            >>> X.shape, y.shape
 99            ((52397, 279), (52397,))
100
101        Perform a train/test split, also outputting NumPy arrays::
102
103            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
104            >>> X_train, X_test, y_train, y_test = train_test_split
105            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
106            ((41949, 279), (41949,), (10448, 279), (10448,))
107
108        Output the underlying Pandas DataFrame::
109
110            >>> df = dataset.to_pandas()
111            >>> type(df)
112            <class 'pandas.core.frame.DataFrame'>
113    """
114
115    _url = (
116        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
117        "00304/BlogFeedback.zip"
118    )
119
120    _features = range(279)
121    _targets = [279]
122
123    def _prep_data(self, data: bytes) -> pd.DataFrame:
124        """Prepare the data set.
125
126        Args:
127            data (bytes): The raw data
128
129        Returns:
130            Pandas dataframe: The prepared data
131        """
132        # Convert the bytes into a file-like object
133        buffer = io.BytesIO(data)
134
135        # Unzip the file and pull out blogData_train.csv as a string
136        with zipfile.ZipFile(buffer, "r") as zip_file:
137            csv = zip_file.read("blogData_train.csv").decode("utf-8")
138
139        # Convert the string into a file-like object
140        csv_file = io.StringIO(csv)
141
142        # Read the file-like object into a dataframe
143        df = pd.read_csv(csv_file, header=None)
144        return df
class Blog(doubt.datasets.dataset.BaseDataset):
 17class Blog(BaseDataset):
 18    __doc__ = f"""
 19    This data originates from blog posts. The raw HTML-documents of the blog posts were
 20    crawled and processed. The prediction task associated with the data is the
 21    prediction of the number of comments in the upcoming 24 hours. In order to simulate
 22    this situation, we choose a basetime (in the past) and select the blog posts that
 23    were published at most 72 hours before the selected base date/time. Then, we
 24    calculate all the features of the selected blog posts from the information that was
 25    available at the basetime, therefore each instance corresponds to a blog post. The
 26    target is the number of comments that the blog post received in the next 24 hours
 27    relative to the basetime.
 28
 29    In the train data, the basetimes were in the years 2010 and 2011. In the test data
 30    the basetimes were in February and March 2012. This simulates the real-world
 31    situtation in which training data from the past is available to predict events in
 32    the future.
 33
 34    The train data was generated from different basetimes that may temporally overlap.
 35    Therefore, if you simply split the train into disjoint partitions, the underlying
 36    time intervals may overlap. Therefore, the you should use the provided, temporally
 37    disjoint train and test splits in order to ensure that the evaluation is fair.
 38
 39    {BASE_DATASET_DESCRIPTION}
 40
 41    Features:
 42        Features 0-49 (float):
 43            50 features containing the average, standard deviation, minimum, maximum
 44            and median of feature 50-59 for the source of the current blog post, by
 45            which we mean the blog on which the post appeared. For example,
 46            myblog.blog.org would be the source of the post
 47            myblog.blog.org/post_2010_09_10
 48        Feature 50 (int):
 49            Total number of comments before basetime
 50        Feature 51 (int):
 51            Number of comments in the last 24 hours before the basetime
 52        Feature 52 (int):
 53            If T1 is the datetime 48 hours before basetime and T2 is the datetime 24
 54            hours before basetime, then this is the number of comments in the time
 55            period between T1 and T2
 56        Feature 53 (int):
 57            Number of comments in the first 24 hours after the publication of the blog
 58            post, but before basetime
 59        Feature 54 (int):
 60            The difference between Feature 51 and Feature 52
 61        Features 55-59 (int):
 62            The same thing as Features 50-51, but for links (trackbacks) instead of
 63            comments
 64        Feature 60 (float):
 65            The length of time between the publication of the blog post and basetime
 66        Feature 61 (int):
 67            The length of the blog post
 68        Features 62-261 (int):
 69            The 200 bag of words features for 200 frequent words of the text of the
 70            blog post
 71        Features 262-268 (int):
 72            Binary indicators for the weekday (Monday-Sunday) of the basetime
 73        Features 269-275 (int):
 74            Binary indicators for the weekday (Monday-Sunday) of the date of
 75            publication of the blog post
 76        Feature 276 (int):
 77            Number of parent pages: we consider a blog post P as a parent of blog post
 78            B if B is a reply (trackback) to P
 79        Features 277-279 (float):
 80            Minimum, maximum and average of the number of comments the parents received
 81
 82    Targets:
 83        int:
 84            The number of comments in the next 24 hours (relative to baseline)
 85
 86    Source:
 87        https://archive.ics.uci.edu/ml/datasets/BlogFeedback
 88
 89    Examples:
 90        Load in the data set::
 91
 92            >>> dataset = Blog()
 93            >>> dataset.shape
 94            (52397, 281)
 95
 96        Split the data set into features and targets, as NumPy arrays::
 97
 98            >>> X, y = dataset.split()
 99            >>> X.shape, y.shape
100            ((52397, 279), (52397,))
101
102        Perform a train/test split, also outputting NumPy arrays::
103
104            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
105            >>> X_train, X_test, y_train, y_test = train_test_split
106            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
107            ((41949, 279), (41949,), (10448, 279), (10448,))
108
109        Output the underlying Pandas DataFrame::
110
111            >>> df = dataset.to_pandas()
112            >>> type(df)
113            <class 'pandas.core.frame.DataFrame'>
114    """
115
116    _url = (
117        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
118        "00304/BlogFeedback.zip"
119    )
120
121    _features = range(279)
122    _targets = [279]
123
124    def _prep_data(self, data: bytes) -> pd.DataFrame:
125        """Prepare the data set.
126
127        Args:
128            data (bytes): The raw data
129
130        Returns:
131            Pandas dataframe: The prepared data
132        """
133        # Convert the bytes into a file-like object
134        buffer = io.BytesIO(data)
135
136        # Unzip the file and pull out blogData_train.csv as a string
137        with zipfile.ZipFile(buffer, "r") as zip_file:
138            csv = zip_file.read("blogData_train.csv").decode("utf-8")
139
140        # Convert the string into a file-like object
141        csv_file = io.StringIO(csv)
142
143        # Read the file-like object into a dataframe
144        df = pd.read_csv(csv_file, header=None)
145        return df

This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed. The prediction task associated with the data is the prediction of the number of comments in the upcoming 24 hours. In order to simulate this situation, we choose a basetime (in the past) and select the blog posts that were published at most 72 hours before the selected base date/time. Then, we calculate all the features of the selected blog posts from the information that was available at the basetime, therefore each instance corresponds to a blog post. The target is the number of comments that the blog post received in the next 24 hours relative to the basetime.

In the train data, the basetimes were in the years 2010 and 2011. In the test data the basetimes were in February and March 2012. This simulates the real-world situtation in which training data from the past is available to predict events in the future.

The train data was generated from different basetimes that may temporally overlap. Therefore, if you simply split the train into disjoint partitions, the underlying time intervals may overlap. Therefore, the you should use the provided, temporally disjoint train and test splits in order to ensure that the evaluation is fair.

Arguments:
  • cache (str or None, optional): The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
  • cache (str or None): The name of the cache.
  • shape (tuple of integers): Dimensions of the data set
  • columns (list of strings): List of column names in the data set
Features:

Features 0-49 (float): 50 features containing the average, standard deviation, minimum, maximum and median of feature 50-59 for the source of the current blog post, by which we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10 Feature 50 (int): Total number of comments before basetime Feature 51 (int): Number of comments in the last 24 hours before the basetime Feature 52 (int): If T1 is the datetime 48 hours before basetime and T2 is the datetime 24 hours before basetime, then this is the number of comments in the time period between T1 and T2 Feature 53 (int): Number of comments in the first 24 hours after the publication of the blog post, but before basetime Feature 54 (int): The difference between Feature 51 and Feature 52 Features 55-59 (int): The same thing as Features 50-51, but for links (trackbacks) instead of comments Feature 60 (float): The length of time between the publication of the blog post and basetime Feature 61 (int): The length of the blog post Features 62-261 (int): The 200 bag of words features for 200 frequent words of the text of the blog post Features 262-268 (int): Binary indicators for the weekday (Monday-Sunday) of the basetime Features 269-275 (int): Binary indicators for the weekday (Monday-Sunday) of the date of publication of the blog post Feature 276 (int): Number of parent pages: we consider a blog post P as a parent of blog post B if B is a reply (trackback) to P Features 277-279 (float): Minimum, maximum and average of the number of comments the parents received

Targets:

int: The number of comments in the next 24 hours (relative to baseline)

Source:

https://archive.ics.uci.edu/ml/datasets/BlogFeedback

Examples:

Load in the data set::

>>> dataset = Blog()
>>> dataset.shape
(52397, 281)

Split the data set into features and targets, as NumPy arrays::

>>> X, y = dataset.split()
>>> X.shape, y.shape
((52397, 279), (52397,))

Perform a train/test split, also outputting NumPy arrays::

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((41949, 279), (41949,), (10448, 279), (10448,))

Output the underlying Pandas DataFrame::

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>