doubt.datasets.power_plant
Power plant data set.
This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.
1"""Power plant data set. 2 3This data set is from the UCI data set archive, with the description being the original 4description verbatim. Some feature names may have been altered, based on the 5description. 6""" 7 8import io 9import zipfile 10 11import pandas as pd 12 13from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset 14 15 16class PowerPlant(BaseDataset): 17 __doc__ = f""" 18 The dataset contains 9568 data points collected from a Combined Cycle Power Plant 19 over 6 years (2006-2011), when the power plant was set to work with full load. 20 Features consist of hourly average ambient variables Temperature (T), Ambient 21 Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net 22 hourly electrical energy output (EP) of the plant. 23 24 A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam 25 turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is 26 generated by gas and steam turbines, which are combined in one cycle, and is 27 transferred from one turbine to another. While the Vacuum is colected from and has 28 effect on the Steam Turbine, he other three of the ambient variables effect the GT 29 performance. 30 31 For comparability with our baseline studies, and to allow 5x2 fold statistical 32 tests be carried out, we provide the data shuffled five times. For each shuffling 33 2-fold CV is carried out and the resulting 10 measurements are used for statistical 34 testing. 35 36 {BASE_DATASET_DESCRIPTION} 37 38 Features: 39 AT (float): 40 Hourly average temperature in Celsius, ranges from 1.81 to 37.11 41 V (float): 42 Hourly average exhaust vacuum in cm Hg, ranges from 25.36 to 81.56 43 AP (float): 44 Hourly average ambient pressure in millibar, ranges from 992.89 45 to 1033.30 46 RH (float): 47 Hourly average relative humidity in percent, ranges from 25.56 to 100.16 48 49 Targets: 50 PE (float): 51 Net hourly electrical energy output in MW, ranges from 420.26 to 495.76 52 53 Source: 54 https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant 55 56 Examples: 57 Load in the data set:: 58 59 >>> dataset = PowerPlant() 60 >>> dataset.shape 61 (9568, 5) 62 63 Split the data set into features and targets, as NumPy arrays:: 64 65 >>> X, y = dataset.split() 66 >>> X.shape, y.shape 67 ((9568, 4), (9568,)) 68 69 Perform a train/test split, also outputting NumPy arrays:: 70 71 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 72 >>> X_train, X_test, y_train, y_test = train_test_split 73 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 74 ((7633, 4), (7633,), (1935, 4), (1935,)) 75 76 Output the underlying Pandas DataFrame:: 77 78 >>> df = dataset.to_pandas() 79 >>> type(df) 80 <class 'pandas.core.frame.DataFrame'> 81 """ 82 83 _url = "https://archive.ics.uci.edu/ml/machine-learning-databases/" "00294/CCPP.zip" 84 85 _features = range(4) 86 _targets = [4] 87 88 def _prep_data(self, data: bytes) -> pd.DataFrame: 89 """Prepare the data set. 90 91 Args: 92 data (bytes): The raw data 93 94 Returns: 95 Pandas dataframe: The prepared data 96 """ 97 98 # Convert the bytes into a file-like object 99 buffer = io.BytesIO(data) 100 101 # Unzip the file and pull out the xlsx file 102 with zipfile.ZipFile(buffer, "r") as zip_file: 103 xlsx = zip_file.read("CCPP/Folds5x2_pp.xlsx") 104 105 # Convert the xlsx bytes into a file-like object 106 xlsx_file = io.BytesIO(xlsx) 107 108 # Read the file-like object into a dataframe 109 df = pd.read_excel(xlsx_file) 110 return df
17class PowerPlant(BaseDataset): 18 __doc__ = f""" 19 The dataset contains 9568 data points collected from a Combined Cycle Power Plant 20 over 6 years (2006-2011), when the power plant was set to work with full load. 21 Features consist of hourly average ambient variables Temperature (T), Ambient 22 Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net 23 hourly electrical energy output (EP) of the plant. 24 25 A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam 26 turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is 27 generated by gas and steam turbines, which are combined in one cycle, and is 28 transferred from one turbine to another. While the Vacuum is colected from and has 29 effect on the Steam Turbine, he other three of the ambient variables effect the GT 30 performance. 31 32 For comparability with our baseline studies, and to allow 5x2 fold statistical 33 tests be carried out, we provide the data shuffled five times. For each shuffling 34 2-fold CV is carried out and the resulting 10 measurements are used for statistical 35 testing. 36 37 {BASE_DATASET_DESCRIPTION} 38 39 Features: 40 AT (float): 41 Hourly average temperature in Celsius, ranges from 1.81 to 37.11 42 V (float): 43 Hourly average exhaust vacuum in cm Hg, ranges from 25.36 to 81.56 44 AP (float): 45 Hourly average ambient pressure in millibar, ranges from 992.89 46 to 1033.30 47 RH (float): 48 Hourly average relative humidity in percent, ranges from 25.56 to 100.16 49 50 Targets: 51 PE (float): 52 Net hourly electrical energy output in MW, ranges from 420.26 to 495.76 53 54 Source: 55 https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant 56 57 Examples: 58 Load in the data set:: 59 60 >>> dataset = PowerPlant() 61 >>> dataset.shape 62 (9568, 5) 63 64 Split the data set into features and targets, as NumPy arrays:: 65 66 >>> X, y = dataset.split() 67 >>> X.shape, y.shape 68 ((9568, 4), (9568,)) 69 70 Perform a train/test split, also outputting NumPy arrays:: 71 72 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 73 >>> X_train, X_test, y_train, y_test = train_test_split 74 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 75 ((7633, 4), (7633,), (1935, 4), (1935,)) 76 77 Output the underlying Pandas DataFrame:: 78 79 >>> df = dataset.to_pandas() 80 >>> type(df) 81 <class 'pandas.core.frame.DataFrame'> 82 """ 83 84 _url = "https://archive.ics.uci.edu/ml/machine-learning-databases/" "00294/CCPP.zip" 85 86 _features = range(4) 87 _targets = [4] 88 89 def _prep_data(self, data: bytes) -> pd.DataFrame: 90 """Prepare the data set. 91 92 Args: 93 data (bytes): The raw data 94 95 Returns: 96 Pandas dataframe: The prepared data 97 """ 98 99 # Convert the bytes into a file-like object 100 buffer = io.BytesIO(data) 101 102 # Unzip the file and pull out the xlsx file 103 with zipfile.ZipFile(buffer, "r") as zip_file: 104 xlsx = zip_file.read("CCPP/Folds5x2_pp.xlsx") 105 106 # Convert the xlsx bytes into a file-like object 107 xlsx_file = io.BytesIO(xlsx) 108 109 # Read the file-like object into a dataframe 110 df = pd.read_excel(xlsx_file) 111 return df
The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH) and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.
A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.
For comparability with our baseline studies, and to allow 5x2 fold statistical tests be carried out, we provide the data shuffled five times. For each shuffling 2-fold CV is carried out and the resulting 10 measurements are used for statistical testing.
Arguments:
- cache (str or None, optional): The name of the cache. It will be saved to
cache
in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
- cache (str or None): The name of the cache.
- shape (tuple of integers): Dimensions of the data set
- columns (list of strings): List of column names in the data set
Features:
AT (float): Hourly average temperature in Celsius, ranges from 1.81 to 37.11 V (float): Hourly average exhaust vacuum in cm Hg, ranges from 25.36 to 81.56 AP (float): Hourly average ambient pressure in millibar, ranges from 992.89 to 1033.30 RH (float): Hourly average relative humidity in percent, ranges from 25.56 to 100.16
Targets:
PE (float): Net hourly electrical energy output in MW, ranges from 420.26 to 495.76
Source:
https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
Examples:
Load in the data set::
>>> dataset = PowerPlant() >>> dataset.shape (9568, 5)
Split the data set into features and targets, as NumPy arrays::
>>> X, y = dataset.split() >>> X.shape, y.shape ((9568, 4), (9568,))
Perform a train/test split, also outputting NumPy arrays::
>>> train_test_split = dataset.split(test_size=0.2, random_seed=42) >>> X_train, X_test, y_train, y_test = train_test_split >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape ((7633, 4), (7633,), (1935, 4), (1935,))
Output the underlying Pandas DataFrame::
>>> df = dataset.to_pandas() >>> type(df) <class 'pandas.core.frame.DataFrame'>