doubt.datasets.concrete
Concrete data set.
This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.
1"""Concrete data set. 2 3This data set is from the UCI data set archive, with the description being the original 4description verbatim. Some feature names may have been altered, based on the 5description. 6""" 7 8import io 9 10import pandas as pd 11 12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset 13 14 15class Concrete(BaseDataset): 16 __doc__ = f""" 17 Concrete is the most important material in civil engineering. The concrete 18 compressive strength is a highly nonlinear function of age and ingredients. 19 20 {BASE_DATASET_DESCRIPTION} 21 22 Features: 23 Cement (float): 24 Kg of cement in an m3 mixture 25 Blast Furnace Slag (float): 26 Kg of blast furnace slag in an m3 mixture 27 Fly Ash (float): 28 Kg of fly ash in an m3 mixture 29 Water (float): 30 Kg of water in an m3 mixture 31 Superplasticiser (float): 32 Kg of superplasticiser in an m3 mixture 33 Coarse Aggregate (float): 34 Kg of coarse aggregate in an m3 mixture 35 Fine Aggregate (float): 36 Kg of fine aggregate in an m3 mixture 37 Age (int): 38 Age in days, between 1 and 365 inclusive 39 40 Targets: 41 Concrete Compressive Strength (float): 42 Concrete compressive strength in megapascals 43 44 Source: 45 https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength 46 47 Examples: 48 Load in the data set:: 49 50 >>> dataset = Concrete() 51 >>> dataset.shape 52 (1030, 9) 53 54 Split the data set into features and targets, as NumPy arrays:: 55 56 >>> X, y = dataset.split() 57 >>> X.shape, y.shape 58 ((1030, 8), (1030,)) 59 60 Perform a train/test split, also outputting NumPy arrays:: 61 62 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 63 >>> X_train, X_test, y_train, y_test = train_test_split 64 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 65 ((807, 8), (807,), (223, 8), (223,)) 66 67 Output the underlying Pandas DataFrame:: 68 69 >>> df = dataset.to_pandas() 70 >>> type(df) 71 <class 'pandas.core.frame.DataFrame'> 72 """ 73 74 _url = ( 75 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 76 "concrete/compressive/Concrete_Data.xls" 77 ) 78 79 _features = range(8) 80 _targets = [8] 81 82 def _prep_data(self, data: bytes) -> pd.DataFrame: 83 """Prepare the data set. 84 85 Args: 86 data (bytes): The raw data 87 88 Returns: 89 Pandas dataframe: The prepared data 90 """ 91 92 # Convert the bytes into a file-like object 93 xls_file = io.BytesIO(data) 94 95 # Load the file-like object into a data frame 96 df = pd.read_excel(xls_file) 97 return df
16class Concrete(BaseDataset): 17 __doc__ = f""" 18 Concrete is the most important material in civil engineering. The concrete 19 compressive strength is a highly nonlinear function of age and ingredients. 20 21 {BASE_DATASET_DESCRIPTION} 22 23 Features: 24 Cement (float): 25 Kg of cement in an m3 mixture 26 Blast Furnace Slag (float): 27 Kg of blast furnace slag in an m3 mixture 28 Fly Ash (float): 29 Kg of fly ash in an m3 mixture 30 Water (float): 31 Kg of water in an m3 mixture 32 Superplasticiser (float): 33 Kg of superplasticiser in an m3 mixture 34 Coarse Aggregate (float): 35 Kg of coarse aggregate in an m3 mixture 36 Fine Aggregate (float): 37 Kg of fine aggregate in an m3 mixture 38 Age (int): 39 Age in days, between 1 and 365 inclusive 40 41 Targets: 42 Concrete Compressive Strength (float): 43 Concrete compressive strength in megapascals 44 45 Source: 46 https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength 47 48 Examples: 49 Load in the data set:: 50 51 >>> dataset = Concrete() 52 >>> dataset.shape 53 (1030, 9) 54 55 Split the data set into features and targets, as NumPy arrays:: 56 57 >>> X, y = dataset.split() 58 >>> X.shape, y.shape 59 ((1030, 8), (1030,)) 60 61 Perform a train/test split, also outputting NumPy arrays:: 62 63 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 64 >>> X_train, X_test, y_train, y_test = train_test_split 65 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 66 ((807, 8), (807,), (223, 8), (223,)) 67 68 Output the underlying Pandas DataFrame:: 69 70 >>> df = dataset.to_pandas() 71 >>> type(df) 72 <class 'pandas.core.frame.DataFrame'> 73 """ 74 75 _url = ( 76 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 77 "concrete/compressive/Concrete_Data.xls" 78 ) 79 80 _features = range(8) 81 _targets = [8] 82 83 def _prep_data(self, data: bytes) -> pd.DataFrame: 84 """Prepare the data set. 85 86 Args: 87 data (bytes): The raw data 88 89 Returns: 90 Pandas dataframe: The prepared data 91 """ 92 93 # Convert the bytes into a file-like object 94 xls_file = io.BytesIO(data) 95 96 # Load the file-like object into a data frame 97 df = pd.read_excel(xls_file) 98 return df
Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.
Arguments:
- cache (str or None, optional): The name of the cache. It will be saved to
cache
in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
- cache (str or None): The name of the cache.
- shape (tuple of integers): Dimensions of the data set
- columns (list of strings): List of column names in the data set
Features:
Cement (float): Kg of cement in an m3 mixture Blast Furnace Slag (float): Kg of blast furnace slag in an m3 mixture Fly Ash (float): Kg of fly ash in an m3 mixture Water (float): Kg of water in an m3 mixture Superplasticiser (float): Kg of superplasticiser in an m3 mixture Coarse Aggregate (float): Kg of coarse aggregate in an m3 mixture Fine Aggregate (float): Kg of fine aggregate in an m3 mixture Age (int): Age in days, between 1 and 365 inclusive
Targets:
Concrete Compressive Strength (float): Concrete compressive strength in megapascals
Source:
https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
Examples:
Load in the data set::
>>> dataset = Concrete() >>> dataset.shape (1030, 9)
Split the data set into features and targets, as NumPy arrays::
>>> X, y = dataset.split() >>> X.shape, y.shape ((1030, 8), (1030,))
Perform a train/test split, also outputting NumPy arrays::
>>> train_test_split = dataset.split(test_size=0.2, random_seed=42) >>> X_train, X_test, y_train, y_test = train_test_split >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape ((807, 8), (807,), (223, 8), (223,))
Output the underlying Pandas DataFrame::
>>> df = dataset.to_pandas() >>> type(df) <class 'pandas.core.frame.DataFrame'>