doubt.datasets.facebook_comments

Facebook comments data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

  1"""Facebook comments data set.
  2
  3This data set is from the UCI data set archive, with the description being the original
  4description verbatim. Some feature names may have been altered, based on the
  5description.
  6"""
  7
  8import io
  9import zipfile
 10
 11import pandas as pd
 12
 13from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset
 14
 15
 16class FacebookComments(BaseDataset):
 17    __doc__ = f"""
 18    Instances in this dataset contain features extracted from Facebook posts. The task
 19    associated with the data is to predict how many comments the post will receive.
 20
 21    {BASE_DATASET_DESCRIPTION}
 22
 23    Features:
 24        page_popularity (int):
 25            Defines the popularity of support for the source of the document
 26        page_checkins (int):
 27            Describes how many individuals so far visited this place. This feature is
 28            only associated with places; e.g., some institution, place, theater, etc.
 29        page_talking_about (int):
 30            Defines the daily interest of individuals towards source of the
 31            document/post. The people who actually come back to the page, after liking
 32            the page. This include activities such as comments, likes to a post, shares
 33            etc., by visitors to the page
 34        page_category (int):
 35            Defines the category of the source of the document; e.g., place,
 36            institution, branch etc.
 37        agg[n] for n=0..24 (float):
 38            These features are aggreagted by page, by calculating min, max, average,
 39            median and standard deviation of essential features
 40        cc1 (int):
 41            The total number of comments before selected base date/time
 42        cc2 (int):
 43            The number of comments in the last 24 hours, relative to base date/time
 44        cc3 (int):
 45            The number of comments in the last 48 to last 24 hours relative to base
 46            date/time
 47        cc4 (int):
 48            The number of comments in the first 24 hours after the publication of post
 49            but before base date/time
 50        cc5 (int):
 51            The difference between cc2 and cc3
 52        base_time (int):
 53            Selected time in order to simulate the scenario, ranges from 0 to 71
 54        post_length (int):
 55            Character count in the post
 56        post_share_count (int):
 57            This feature counts the number of shares of the post, how many people had
 58            shared this post onto their timeline
 59        post_promotion_status (int):
 60            Binary feature. To reach more people with posts in News Feed, individuals
 61            can promote their post and this feature indicates whether the post is
 62            promoted or not
 63        h_local (int):
 64            This describes the hours for which we have received the target
 65            variable/comments. Ranges from 0 to 23
 66        day_published[n] for n=0..6 (int):
 67            Binary feature. This represents the day (Sunday-Saturday) on which the post
 68            was published
 69        day[n] for n=0..6 (int):
 70            Binary feature. This represents the day (Sunday-Saturday) on selected base
 71            date/time
 72
 73    Targets:
 74        ncomments (int):
 75            The number of comments in the next `h_local` hours
 76
 77    Source:
 78        https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
 79
 80    Examples:
 81        Load in the data set::
 82
 83            >>> dataset = FacebookComments()
 84            >>> dataset.shape
 85            (199030, 54)
 86
 87        Split the data set into features and targets, as NumPy arrays::
 88
 89            >>> X, y = dataset.split()
 90            >>> X.shape, y.shape
 91            ((199030, 54), (199030,))
 92
 93        Perform a train/test split, also outputting NumPy arrays::
 94
 95            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
 96            >>> X_train, X_test, y_train, y_test = train_test_split
 97            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
 98            ((159211, 54), (159211,), (39819, 54), (39819,))
 99
100        Output the underlying Pandas DataFrame::
101
102            >>> df = dataset.to_pandas()
103            >>> type(df)
104            <class 'pandas.core.frame.DataFrame'>
105    """
106
107    _url = (
108        "https://archive.ics.uci.edu/ml/machine-learning-databases/" "00363/Dataset.zip"
109    )
110
111    _features = range(54)
112    _targets = [53]
113
114    def _prep_data(self, data: bytes) -> pd.DataFrame:
115        """Prepare the data set.
116
117        Args:
118            data (bytes): The raw data
119
120        Returns:
121            Pandas dataframe: The prepared data
122        """
123
124        # Convert the bytes into a file-like object
125        buffer = io.BytesIO(data)
126
127        # Unzip the file and pull out the csv file
128        with zipfile.ZipFile(buffer, "r") as zip_file:
129            csv = zip_file.read("Dataset/Training/Features_Variant_5.csv")
130
131        # Convert the string into a file-like object
132        csv_file = io.BytesIO(csv)
133
134        # Name the columns
135        cols = (
136            ["page_popularity", "page_checkins", "page_talking_about", "page_category"]
137            + [f"agg{n}" for n in range(25)]
138            + [
139                "cc1",
140                "cc2",
141                "cc3",
142                "cc4",
143                "cc5",
144                "base_time",
145                "post_length",
146                "post_share_count",
147                "post_promotion_status",
148                "h_local",
149            ]
150            + [f"day_published{n}" for n in range(7)]
151            + [f"day{n}" for n in range(7)]
152            + ["ncomments"]
153        )
154
155        # Read the file-like object into a dataframe
156        df = pd.read_csv(csv_file, header=None, names=cols)
157        return df
class FacebookComments(doubt.datasets.dataset.BaseDataset):
 17class FacebookComments(BaseDataset):
 18    __doc__ = f"""
 19    Instances in this dataset contain features extracted from Facebook posts. The task
 20    associated with the data is to predict how many comments the post will receive.
 21
 22    {BASE_DATASET_DESCRIPTION}
 23
 24    Features:
 25        page_popularity (int):
 26            Defines the popularity of support for the source of the document
 27        page_checkins (int):
 28            Describes how many individuals so far visited this place. This feature is
 29            only associated with places; e.g., some institution, place, theater, etc.
 30        page_talking_about (int):
 31            Defines the daily interest of individuals towards source of the
 32            document/post. The people who actually come back to the page, after liking
 33            the page. This include activities such as comments, likes to a post, shares
 34            etc., by visitors to the page
 35        page_category (int):
 36            Defines the category of the source of the document; e.g., place,
 37            institution, branch etc.
 38        agg[n] for n=0..24 (float):
 39            These features are aggreagted by page, by calculating min, max, average,
 40            median and standard deviation of essential features
 41        cc1 (int):
 42            The total number of comments before selected base date/time
 43        cc2 (int):
 44            The number of comments in the last 24 hours, relative to base date/time
 45        cc3 (int):
 46            The number of comments in the last 48 to last 24 hours relative to base
 47            date/time
 48        cc4 (int):
 49            The number of comments in the first 24 hours after the publication of post
 50            but before base date/time
 51        cc5 (int):
 52            The difference between cc2 and cc3
 53        base_time (int):
 54            Selected time in order to simulate the scenario, ranges from 0 to 71
 55        post_length (int):
 56            Character count in the post
 57        post_share_count (int):
 58            This feature counts the number of shares of the post, how many people had
 59            shared this post onto their timeline
 60        post_promotion_status (int):
 61            Binary feature. To reach more people with posts in News Feed, individuals
 62            can promote their post and this feature indicates whether the post is
 63            promoted or not
 64        h_local (int):
 65            This describes the hours for which we have received the target
 66            variable/comments. Ranges from 0 to 23
 67        day_published[n] for n=0..6 (int):
 68            Binary feature. This represents the day (Sunday-Saturday) on which the post
 69            was published
 70        day[n] for n=0..6 (int):
 71            Binary feature. This represents the day (Sunday-Saturday) on selected base
 72            date/time
 73
 74    Targets:
 75        ncomments (int):
 76            The number of comments in the next `h_local` hours
 77
 78    Source:
 79        https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
 80
 81    Examples:
 82        Load in the data set::
 83
 84            >>> dataset = FacebookComments()
 85            >>> dataset.shape
 86            (199030, 54)
 87
 88        Split the data set into features and targets, as NumPy arrays::
 89
 90            >>> X, y = dataset.split()
 91            >>> X.shape, y.shape
 92            ((199030, 54), (199030,))
 93
 94        Perform a train/test split, also outputting NumPy arrays::
 95
 96            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
 97            >>> X_train, X_test, y_train, y_test = train_test_split
 98            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
 99            ((159211, 54), (159211,), (39819, 54), (39819,))
100
101        Output the underlying Pandas DataFrame::
102
103            >>> df = dataset.to_pandas()
104            >>> type(df)
105            <class 'pandas.core.frame.DataFrame'>
106    """
107
108    _url = (
109        "https://archive.ics.uci.edu/ml/machine-learning-databases/" "00363/Dataset.zip"
110    )
111
112    _features = range(54)
113    _targets = [53]
114
115    def _prep_data(self, data: bytes) -> pd.DataFrame:
116        """Prepare the data set.
117
118        Args:
119            data (bytes): The raw data
120
121        Returns:
122            Pandas dataframe: The prepared data
123        """
124
125        # Convert the bytes into a file-like object
126        buffer = io.BytesIO(data)
127
128        # Unzip the file and pull out the csv file
129        with zipfile.ZipFile(buffer, "r") as zip_file:
130            csv = zip_file.read("Dataset/Training/Features_Variant_5.csv")
131
132        # Convert the string into a file-like object
133        csv_file = io.BytesIO(csv)
134
135        # Name the columns
136        cols = (
137            ["page_popularity", "page_checkins", "page_talking_about", "page_category"]
138            + [f"agg{n}" for n in range(25)]
139            + [
140                "cc1",
141                "cc2",
142                "cc3",
143                "cc4",
144                "cc5",
145                "base_time",
146                "post_length",
147                "post_share_count",
148                "post_promotion_status",
149                "h_local",
150            ]
151            + [f"day_published{n}" for n in range(7)]
152            + [f"day{n}" for n in range(7)]
153            + ["ncomments"]
154        )
155
156        # Read the file-like object into a dataframe
157        df = pd.read_csv(csv_file, header=None, names=cols)
158        return df

Instances in this dataset contain features extracted from Facebook posts. The task associated with the data is to predict how many comments the post will receive.

Arguments:
  • cache (str or None, optional): The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
  • cache (str or None): The name of the cache.
  • shape (tuple of integers): Dimensions of the data set
  • columns (list of strings): List of column names in the data set
Features:

page_popularity (int): Defines the popularity of support for the source of the document page_checkins (int): Describes how many individuals so far visited this place. This feature is only associated with places; e.g., some institution, place, theater, etc. page_talking_about (int): Defines the daily interest of individuals towards source of the document/post. The people who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares etc., by visitors to the page page_category (int): Defines the category of the source of the document; e.g., place, institution, branch etc. agg[n] for n=0..24 (float): These features are aggreagted by page, by calculating min, max, average, median and standard deviation of essential features cc1 (int): The total number of comments before selected base date/time cc2 (int): The number of comments in the last 24 hours, relative to base date/time cc3 (int): The number of comments in the last 48 to last 24 hours relative to base date/time cc4 (int): The number of comments in the first 24 hours after the publication of post but before base date/time cc5 (int): The difference between cc2 and cc3 base_time (int): Selected time in order to simulate the scenario, ranges from 0 to 71 post_length (int): Character count in the post post_share_count (int): This feature counts the number of shares of the post, how many people had shared this post onto their timeline post_promotion_status (int): Binary feature. To reach more people with posts in News Feed, individuals can promote their post and this feature indicates whether the post is promoted or not h_local (int): This describes the hours for which we have received the target variable/comments. Ranges from 0 to 23 day_published[n] for n=0..6 (int): Binary feature. This represents the day (Sunday-Saturday) on which the post was published day[n] for n=0..6 (int): Binary feature. This represents the day (Sunday-Saturday) on selected base date/time

Targets:

ncomments (int): The number of comments in the next h_local hours

Source:

https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset

Examples:

Load in the data set::

>>> dataset = FacebookComments()
>>> dataset.shape
(199030, 54)

Split the data set into features and targets, as NumPy arrays::

>>> X, y = dataset.split()
>>> X.shape, y.shape
((199030, 54), (199030,))

Perform a train/test split, also outputting NumPy arrays::

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((159211, 54), (159211,), (39819, 54), (39819,))

Output the underlying Pandas DataFrame::

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>