doubt.datasets.power_plant

Power plant data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

  1"""Power plant data set.
  2
  3This data set is from the UCI data set archive, with the description being the original
  4description verbatim. Some feature names may have been altered, based on the
  5description.
  6"""
  7
  8import io
  9import zipfile
 10
 11import pandas as pd
 12
 13from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset
 14
 15
 16class PowerPlant(BaseDataset):
 17    __doc__ = f"""
 18    The dataset contains 9568 data points collected from a Combined Cycle Power Plant
 19    over 6 years (2006-2011), when the power plant was set to work with full load.
 20    Features consist of hourly average ambient variables Temperature (T), Ambient
 21    Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net
 22    hourly electrical energy output (EP) of the plant.
 23
 24    A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam
 25    turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is
 26    generated by gas and steam turbines, which are combined in one cycle, and is
 27    transferred from one turbine to another. While the Vacuum is colected from and has
 28    effect on the Steam Turbine, he other three of the ambient variables effect the GT
 29    performance.
 30
 31    For comparability with our baseline studies, and to allow 5x2 fold statistical
 32    tests be carried out, we provide the data shuffled five times. For each shuffling
 33    2-fold CV is carried out and the resulting 10 measurements are used for statistical
 34    testing.
 35
 36    {BASE_DATASET_DESCRIPTION}
 37
 38    Features:
 39        AT (float):
 40            Hourly average temperature in Celsius, ranges from 1.81 to 37.11
 41        V (float):
 42            Hourly average exhaust vacuum in cm Hg, ranges from 25.36 to 81.56
 43        AP (float):
 44            Hourly average ambient pressure in millibar, ranges from 992.89
 45            to 1033.30
 46        RH (float):
 47            Hourly average relative humidity in percent, ranges from 25.56 to 100.16
 48
 49    Targets:
 50        PE (float):
 51            Net hourly electrical energy output in MW, ranges from 420.26 to 495.76
 52
 53    Source:
 54        https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
 55
 56    Examples:
 57        Load in the data set::
 58
 59            >>> dataset = PowerPlant()
 60            >>> dataset.shape
 61            (9568, 5)
 62
 63        Split the data set into features and targets, as NumPy arrays::
 64
 65            >>> X, y = dataset.split()
 66            >>> X.shape, y.shape
 67            ((9568, 4), (9568,))
 68
 69        Perform a train/test split, also outputting NumPy arrays::
 70
 71            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
 72            >>> X_train, X_test, y_train, y_test = train_test_split
 73            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
 74            ((7633, 4), (7633,), (1935, 4), (1935,))
 75
 76        Output the underlying Pandas DataFrame::
 77
 78            >>> df = dataset.to_pandas()
 79            >>> type(df)
 80            <class 'pandas.core.frame.DataFrame'>
 81    """
 82
 83    _url = "https://archive.ics.uci.edu/ml/machine-learning-databases/" "00294/CCPP.zip"
 84
 85    _features = range(4)
 86    _targets = [4]
 87
 88    def _prep_data(self, data: bytes) -> pd.DataFrame:
 89        """Prepare the data set.
 90
 91        Args:
 92            data (bytes): The raw data
 93
 94        Returns:
 95            Pandas dataframe: The prepared data
 96        """
 97
 98        # Convert the bytes into a file-like object
 99        buffer = io.BytesIO(data)
100
101        # Unzip the file and pull out the xlsx file
102        with zipfile.ZipFile(buffer, "r") as zip_file:
103            xlsx = zip_file.read("CCPP/Folds5x2_pp.xlsx")
104
105        # Convert the xlsx bytes into a file-like object
106        xlsx_file = io.BytesIO(xlsx)
107
108        # Read the file-like object into a dataframe
109        df = pd.read_excel(xlsx_file)
110        return df
class PowerPlant(doubt.datasets.dataset.BaseDataset):
 17class PowerPlant(BaseDataset):
 18    __doc__ = f"""
 19    The dataset contains 9568 data points collected from a Combined Cycle Power Plant
 20    over 6 years (2006-2011), when the power plant was set to work with full load.
 21    Features consist of hourly average ambient variables Temperature (T), Ambient
 22    Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net
 23    hourly electrical energy output (EP) of the plant.
 24
 25    A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam
 26    turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is
 27    generated by gas and steam turbines, which are combined in one cycle, and is
 28    transferred from one turbine to another. While the Vacuum is colected from and has
 29    effect on the Steam Turbine, he other three of the ambient variables effect the GT
 30    performance.
 31
 32    For comparability with our baseline studies, and to allow 5x2 fold statistical
 33    tests be carried out, we provide the data shuffled five times. For each shuffling
 34    2-fold CV is carried out and the resulting 10 measurements are used for statistical
 35    testing.
 36
 37    {BASE_DATASET_DESCRIPTION}
 38
 39    Features:
 40        AT (float):
 41            Hourly average temperature in Celsius, ranges from 1.81 to 37.11
 42        V (float):
 43            Hourly average exhaust vacuum in cm Hg, ranges from 25.36 to 81.56
 44        AP (float):
 45            Hourly average ambient pressure in millibar, ranges from 992.89
 46            to 1033.30
 47        RH (float):
 48            Hourly average relative humidity in percent, ranges from 25.56 to 100.16
 49
 50    Targets:
 51        PE (float):
 52            Net hourly electrical energy output in MW, ranges from 420.26 to 495.76
 53
 54    Source:
 55        https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
 56
 57    Examples:
 58        Load in the data set::
 59
 60            >>> dataset = PowerPlant()
 61            >>> dataset.shape
 62            (9568, 5)
 63
 64        Split the data set into features and targets, as NumPy arrays::
 65
 66            >>> X, y = dataset.split()
 67            >>> X.shape, y.shape
 68            ((9568, 4), (9568,))
 69
 70        Perform a train/test split, also outputting NumPy arrays::
 71
 72            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
 73            >>> X_train, X_test, y_train, y_test = train_test_split
 74            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
 75            ((7633, 4), (7633,), (1935, 4), (1935,))
 76
 77        Output the underlying Pandas DataFrame::
 78
 79            >>> df = dataset.to_pandas()
 80            >>> type(df)
 81            <class 'pandas.core.frame.DataFrame'>
 82    """
 83
 84    _url = "https://archive.ics.uci.edu/ml/machine-learning-databases/" "00294/CCPP.zip"
 85
 86    _features = range(4)
 87    _targets = [4]
 88
 89    def _prep_data(self, data: bytes) -> pd.DataFrame:
 90        """Prepare the data set.
 91
 92        Args:
 93            data (bytes): The raw data
 94
 95        Returns:
 96            Pandas dataframe: The prepared data
 97        """
 98
 99        # Convert the bytes into a file-like object
100        buffer = io.BytesIO(data)
101
102        # Unzip the file and pull out the xlsx file
103        with zipfile.ZipFile(buffer, "r") as zip_file:
104            xlsx = zip_file.read("CCPP/Folds5x2_pp.xlsx")
105
106        # Convert the xlsx bytes into a file-like object
107        xlsx_file = io.BytesIO(xlsx)
108
109        # Read the file-like object into a dataframe
110        df = pd.read_excel(xlsx_file)
111        return df

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.

For comparability with our baseline studies, and to allow 5x2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.

Arguments:
  • cache (str or None, optional): The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
  • cache (str or None): The name of the cache.
  • shape (tuple of integers): Dimensions of the data set
  • columns (list of strings): List of column names in the data set
Features:

AT (float): Hourly average temperature in Celsius, ranges from 1.81 to 37.11 V (float): Hourly average exhaust vacuum in cm Hg, ranges from 25.36 to 81.56 AP (float): Hourly average ambient pressure in millibar, ranges from 992.89 to 1033.30 RH (float): Hourly average relative humidity in percent, ranges from 25.56 to 100.16

Targets:

PE (float): Net hourly electrical energy output in MW, ranges from 420.26 to 495.76

Source:

https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Examples:

Load in the data set::

>>> dataset = PowerPlant()
>>> dataset.shape
(9568, 5)

Split the data set into features and targets, as NumPy arrays::

>>> X, y = dataset.split()
>>> X.shape, y.shape
((9568, 4), (9568,))

Perform a train/test split, also outputting NumPy arrays::

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((7633, 4), (7633,), (1935, 4), (1935,))

Output the underlying Pandas DataFrame::

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>