doubt.datasets.concrete

Concrete data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

 1"""Concrete data set.
 2
 3This data set is from the UCI data set archive, with the description being the original
 4description verbatim. Some feature names may have been altered, based on the
 5description.
 6"""
 7
 8import io
 9
10import pandas as pd
11
12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset
13
14
15class Concrete(BaseDataset):
16    __doc__ = f"""
17    Concrete is the most important material in civil engineering. The concrete
18    compressive strength is a highly nonlinear function of age and ingredients.
19
20    {BASE_DATASET_DESCRIPTION}
21
22    Features:
23        Cement (float):
24            Kg of cement in an m3 mixture
25        Blast Furnace Slag (float):
26            Kg of blast furnace slag in an m3 mixture
27        Fly Ash (float):
28            Kg of fly ash in an m3 mixture
29        Water (float):
30            Kg of water in an m3 mixture
31        Superplasticiser (float):
32            Kg of superplasticiser in an m3 mixture
33        Coarse Aggregate (float):
34            Kg of coarse aggregate in an m3 mixture
35        Fine Aggregate (float):
36            Kg of fine aggregate in an m3 mixture
37        Age (int):
38            Age in days, between 1 and 365 inclusive
39
40    Targets:
41        Concrete Compressive Strength (float):
42            Concrete compressive strength in megapascals
43
44    Source:
45        https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
46
47    Examples:
48        Load in the data set::
49
50            >>> dataset = Concrete()
51            >>> dataset.shape
52            (1030, 9)
53
54        Split the data set into features and targets, as NumPy arrays::
55
56            >>> X, y = dataset.split()
57            >>> X.shape, y.shape
58            ((1030, 8), (1030,))
59
60        Perform a train/test split, also outputting NumPy arrays::
61
62            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
63            >>> X_train, X_test, y_train, y_test = train_test_split
64            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
65            ((807, 8), (807,), (223, 8), (223,))
66
67        Output the underlying Pandas DataFrame::
68
69            >>> df = dataset.to_pandas()
70            >>> type(df)
71            <class 'pandas.core.frame.DataFrame'>
72    """
73
74    _url = (
75        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
76        "concrete/compressive/Concrete_Data.xls"
77    )
78
79    _features = range(8)
80    _targets = [8]
81
82    def _prep_data(self, data: bytes) -> pd.DataFrame:
83        """Prepare the data set.
84
85        Args:
86            data (bytes): The raw data
87
88        Returns:
89            Pandas dataframe: The prepared data
90        """
91
92        # Convert the bytes into a file-like object
93        xls_file = io.BytesIO(data)
94
95        # Load the file-like object into a data frame
96        df = pd.read_excel(xls_file)
97        return df
class Concrete(doubt.datasets.dataset.BaseDataset):
16class Concrete(BaseDataset):
17    __doc__ = f"""
18    Concrete is the most important material in civil engineering. The concrete
19    compressive strength is a highly nonlinear function of age and ingredients.
20
21    {BASE_DATASET_DESCRIPTION}
22
23    Features:
24        Cement (float):
25            Kg of cement in an m3 mixture
26        Blast Furnace Slag (float):
27            Kg of blast furnace slag in an m3 mixture
28        Fly Ash (float):
29            Kg of fly ash in an m3 mixture
30        Water (float):
31            Kg of water in an m3 mixture
32        Superplasticiser (float):
33            Kg of superplasticiser in an m3 mixture
34        Coarse Aggregate (float):
35            Kg of coarse aggregate in an m3 mixture
36        Fine Aggregate (float):
37            Kg of fine aggregate in an m3 mixture
38        Age (int):
39            Age in days, between 1 and 365 inclusive
40
41    Targets:
42        Concrete Compressive Strength (float):
43            Concrete compressive strength in megapascals
44
45    Source:
46        https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
47
48    Examples:
49        Load in the data set::
50
51            >>> dataset = Concrete()
52            >>> dataset.shape
53            (1030, 9)
54
55        Split the data set into features and targets, as NumPy arrays::
56
57            >>> X, y = dataset.split()
58            >>> X.shape, y.shape
59            ((1030, 8), (1030,))
60
61        Perform a train/test split, also outputting NumPy arrays::
62
63            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
64            >>> X_train, X_test, y_train, y_test = train_test_split
65            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
66            ((807, 8), (807,), (223, 8), (223,))
67
68        Output the underlying Pandas DataFrame::
69
70            >>> df = dataset.to_pandas()
71            >>> type(df)
72            <class 'pandas.core.frame.DataFrame'>
73    """
74
75    _url = (
76        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
77        "concrete/compressive/Concrete_Data.xls"
78    )
79
80    _features = range(8)
81    _targets = [8]
82
83    def _prep_data(self, data: bytes) -> pd.DataFrame:
84        """Prepare the data set.
85
86        Args:
87            data (bytes): The raw data
88
89        Returns:
90            Pandas dataframe: The prepared data
91        """
92
93        # Convert the bytes into a file-like object
94        xls_file = io.BytesIO(data)
95
96        # Load the file-like object into a data frame
97        df = pd.read_excel(xls_file)
98        return df

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.

Arguments:
  • cache (str or None, optional): The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
  • cache (str or None): The name of the cache.
  • shape (tuple of integers): Dimensions of the data set
  • columns (list of strings): List of column names in the data set
Features:

Cement (float): Kg of cement in an m3 mixture Blast Furnace Slag (float): Kg of blast furnace slag in an m3 mixture Fly Ash (float): Kg of fly ash in an m3 mixture Water (float): Kg of water in an m3 mixture Superplasticiser (float): Kg of superplasticiser in an m3 mixture Coarse Aggregate (float): Kg of coarse aggregate in an m3 mixture Fine Aggregate (float): Kg of fine aggregate in an m3 mixture Age (int): Age in days, between 1 and 365 inclusive

Targets:

Concrete Compressive Strength (float): Concrete compressive strength in megapascals

Source:

https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength

Examples:

Load in the data set::

>>> dataset = Concrete()
>>> dataset.shape
(1030, 9)

Split the data set into features and targets, as NumPy arrays::

>>> X, y = dataset.split()
>>> X.shape, y.shape
((1030, 8), (1030,))

Perform a train/test split, also outputting NumPy arrays::

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((807, 8), (807,), (223, 8), (223,))

Output the underlying Pandas DataFrame::

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>