doubt.datasets.fish_toxicity

Fish toxicity data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

 1"""Fish toxicity data set.
 2
 3This data set is from the UCI data set archive, with the description being the original
 4description verbatim. Some feature names may have been altered, based on the
 5description.
 6"""
 7
 8import io
 9
10import pandas as pd
11
12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset
13
14
15class FishToxicity(BaseDataset):
16    __doc__ = f"""
17    This dataset was used to develop quantitative regression QSAR models to predict
18    acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a
19    set of 908 chemicals. LC50 data, which is the concentration that causes death in
20    50% of test fish over a test duration of 96 hours, was used as model response
21
22    {BASE_DATASET_DESCRIPTION}
23
24    Features:
25        CIC0 (float):
26            Information indices
27        SM1_Dz(Z) (float):
28            2D matrix-based descriptors
29        GATS1i (float):
30            2D autocorrelations
31        NdsCH (int)
32            Atom-type counts
33        NdssC (int)
34            Atom-type counts
35        MLOGP (float):
36            Molecular properties
37
38    Targets:
39        LC50 (float):
40            A concentration that causes death in 50% of test fish over a test duration
41            of 96 hours. In -log(mol/L) units.
42
43    Source:
44        https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity
45
46    Examples:
47        Load in the data set::
48
49            >>> dataset = FishToxicity()
50            >>> dataset.shape
51            (908, 7)
52
53        Split the data set into features and targets, as NumPy arrays::
54
55            >>> X, y = dataset.split()
56            >>> X.shape, y.shape
57            ((908, 6), (908,))
58
59        Perform a train/test split, also outputting NumPy arrays::
60
61            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
62            >>> X_train, X_test, y_train, y_test = train_test_split
63            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
64            ((708, 6), (708,), (200, 6), (200,))
65
66        Output the underlying Pandas DataFrame::
67
68            >>> df = dataset.to_pandas()
69            >>> type(df)
70            <class 'pandas.core.frame.DataFrame'>
71    """
72
73    _url = (
74        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
75        "00504/qsar_fish_toxicity.csv"
76    )
77
78    _features = range(6)
79    _targets = [6]
80
81    def _prep_data(self, data: bytes) -> pd.DataFrame:
82        """Prepare the data set.
83
84        Args:
85            data (bytes): The raw data
86
87        Returns:
88            Pandas dataframe: The prepared data
89        """
90        # Convert the bytes into a file-like object
91        csv_file = io.BytesIO(data)
92
93        # Read the file-like object into a dataframe
94        cols = ["CIC0", "SM1_Dz(Z)", "GATS1i", "NdsCH", "NdssC", "MLOGP", "LC50"]
95        df = pd.read_csv(csv_file, sep=";", header=None, names=cols)
96
97        return df
class FishToxicity(doubt.datasets.dataset.BaseDataset):
16class FishToxicity(BaseDataset):
17    __doc__ = f"""
18    This dataset was used to develop quantitative regression QSAR models to predict
19    acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a
20    set of 908 chemicals. LC50 data, which is the concentration that causes death in
21    50% of test fish over a test duration of 96 hours, was used as model response
22
23    {BASE_DATASET_DESCRIPTION}
24
25    Features:
26        CIC0 (float):
27            Information indices
28        SM1_Dz(Z) (float):
29            2D matrix-based descriptors
30        GATS1i (float):
31            2D autocorrelations
32        NdsCH (int)
33            Atom-type counts
34        NdssC (int)
35            Atom-type counts
36        MLOGP (float):
37            Molecular properties
38
39    Targets:
40        LC50 (float):
41            A concentration that causes death in 50% of test fish over a test duration
42            of 96 hours. In -log(mol/L) units.
43
44    Source:
45        https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity
46
47    Examples:
48        Load in the data set::
49
50            >>> dataset = FishToxicity()
51            >>> dataset.shape
52            (908, 7)
53
54        Split the data set into features and targets, as NumPy arrays::
55
56            >>> X, y = dataset.split()
57            >>> X.shape, y.shape
58            ((908, 6), (908,))
59
60        Perform a train/test split, also outputting NumPy arrays::
61
62            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
63            >>> X_train, X_test, y_train, y_test = train_test_split
64            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
65            ((708, 6), (708,), (200, 6), (200,))
66
67        Output the underlying Pandas DataFrame::
68
69            >>> df = dataset.to_pandas()
70            >>> type(df)
71            <class 'pandas.core.frame.DataFrame'>
72    """
73
74    _url = (
75        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
76        "00504/qsar_fish_toxicity.csv"
77    )
78
79    _features = range(6)
80    _targets = [6]
81
82    def _prep_data(self, data: bytes) -> pd.DataFrame:
83        """Prepare the data set.
84
85        Args:
86            data (bytes): The raw data
87
88        Returns:
89            Pandas dataframe: The prepared data
90        """
91        # Convert the bytes into a file-like object
92        csv_file = io.BytesIO(data)
93
94        # Read the file-like object into a dataframe
95        cols = ["CIC0", "SM1_Dz(Z)", "GATS1i", "NdsCH", "NdssC", "MLOGP", "LC50"]
96        df = pd.read_csv(csv_file, sep=";", header=None, names=cols)
97
98        return df

This dataset was used to develop quantitative regression QSAR models to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals. LC50 data, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response

Arguments:
  • cache (str or None, optional): The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
  • cache (str or None): The name of the cache.
  • shape (tuple of integers): Dimensions of the data set
  • columns (list of strings): List of column names in the data set
Features:

CIC0 (float): Information indices SM1_Dz(Z) (float): 2D matrix-based descriptors GATS1i (float): 2D autocorrelations NdsCH (int) Atom-type counts NdssC (int) Atom-type counts MLOGP (float): Molecular properties

Targets:

LC50 (float): A concentration that causes death in 50% of test fish over a test duration of 96 hours. In -log(mol/L) units.

Source:

https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity

Examples:

Load in the data set::

>>> dataset = FishToxicity()
>>> dataset.shape
(908, 7)

Split the data set into features and targets, as NumPy arrays::

>>> X, y = dataset.split()
>>> X.shape, y.shape
((908, 6), (908,))

Perform a train/test split, also outputting NumPy arrays::

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((708, 6), (708,), (200, 6), (200,))

Output the underlying Pandas DataFrame::

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>