doubt.datasets.fish_toxicity
Fish toxicity data set.
This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.
1"""Fish toxicity data set. 2 3This data set is from the UCI data set archive, with the description being the original 4description verbatim. Some feature names may have been altered, based on the 5description. 6""" 7 8import io 9 10import pandas as pd 11 12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset 13 14 15class FishToxicity(BaseDataset): 16 __doc__ = f""" 17 This dataset was used to develop quantitative regression QSAR models to predict 18 acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a 19 set of 908 chemicals. LC50 data, which is the concentration that causes death in 20 50% of test fish over a test duration of 96 hours, was used as model response 21 22 {BASE_DATASET_DESCRIPTION} 23 24 Features: 25 CIC0 (float): 26 Information indices 27 SM1_Dz(Z) (float): 28 2D matrix-based descriptors 29 GATS1i (float): 30 2D autocorrelations 31 NdsCH (int) 32 Atom-type counts 33 NdssC (int) 34 Atom-type counts 35 MLOGP (float): 36 Molecular properties 37 38 Targets: 39 LC50 (float): 40 A concentration that causes death in 50% of test fish over a test duration 41 of 96 hours. In -log(mol/L) units. 42 43 Source: 44 https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity 45 46 Examples: 47 Load in the data set:: 48 49 >>> dataset = FishToxicity() 50 >>> dataset.shape 51 (908, 7) 52 53 Split the data set into features and targets, as NumPy arrays:: 54 55 >>> X, y = dataset.split() 56 >>> X.shape, y.shape 57 ((908, 6), (908,)) 58 59 Perform a train/test split, also outputting NumPy arrays:: 60 61 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 62 >>> X_train, X_test, y_train, y_test = train_test_split 63 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 64 ((708, 6), (708,), (200, 6), (200,)) 65 66 Output the underlying Pandas DataFrame:: 67 68 >>> df = dataset.to_pandas() 69 >>> type(df) 70 <class 'pandas.core.frame.DataFrame'> 71 """ 72 73 _url = ( 74 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 75 "00504/qsar_fish_toxicity.csv" 76 ) 77 78 _features = range(6) 79 _targets = [6] 80 81 def _prep_data(self, data: bytes) -> pd.DataFrame: 82 """Prepare the data set. 83 84 Args: 85 data (bytes): The raw data 86 87 Returns: 88 Pandas dataframe: The prepared data 89 """ 90 # Convert the bytes into a file-like object 91 csv_file = io.BytesIO(data) 92 93 # Read the file-like object into a dataframe 94 cols = ["CIC0", "SM1_Dz(Z)", "GATS1i", "NdsCH", "NdssC", "MLOGP", "LC50"] 95 df = pd.read_csv(csv_file, sep=";", header=None, names=cols) 96 97 return df
16class FishToxicity(BaseDataset): 17 __doc__ = f""" 18 This dataset was used to develop quantitative regression QSAR models to predict 19 acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a 20 set of 908 chemicals. LC50 data, which is the concentration that causes death in 21 50% of test fish over a test duration of 96 hours, was used as model response 22 23 {BASE_DATASET_DESCRIPTION} 24 25 Features: 26 CIC0 (float): 27 Information indices 28 SM1_Dz(Z) (float): 29 2D matrix-based descriptors 30 GATS1i (float): 31 2D autocorrelations 32 NdsCH (int) 33 Atom-type counts 34 NdssC (int) 35 Atom-type counts 36 MLOGP (float): 37 Molecular properties 38 39 Targets: 40 LC50 (float): 41 A concentration that causes death in 50% of test fish over a test duration 42 of 96 hours. In -log(mol/L) units. 43 44 Source: 45 https://archive.ics.uci.edu/ml/datasets/QSAR+fish+toxicity 46 47 Examples: 48 Load in the data set:: 49 50 >>> dataset = FishToxicity() 51 >>> dataset.shape 52 (908, 7) 53 54 Split the data set into features and targets, as NumPy arrays:: 55 56 >>> X, y = dataset.split() 57 >>> X.shape, y.shape 58 ((908, 6), (908,)) 59 60 Perform a train/test split, also outputting NumPy arrays:: 61 62 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 63 >>> X_train, X_test, y_train, y_test = train_test_split 64 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 65 ((708, 6), (708,), (200, 6), (200,)) 66 67 Output the underlying Pandas DataFrame:: 68 69 >>> df = dataset.to_pandas() 70 >>> type(df) 71 <class 'pandas.core.frame.DataFrame'> 72 """ 73 74 _url = ( 75 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 76 "00504/qsar_fish_toxicity.csv" 77 ) 78 79 _features = range(6) 80 _targets = [6] 81 82 def _prep_data(self, data: bytes) -> pd.DataFrame: 83 """Prepare the data set. 84 85 Args: 86 data (bytes): The raw data 87 88 Returns: 89 Pandas dataframe: The prepared data 90 """ 91 # Convert the bytes into a file-like object 92 csv_file = io.BytesIO(data) 93 94 # Read the file-like object into a dataframe 95 cols = ["CIC0", "SM1_Dz(Z)", "GATS1i", "NdsCH", "NdssC", "MLOGP", "LC50"] 96 df = pd.read_csv(csv_file, sep=";", header=None, names=cols) 97 98 return df
This dataset was used to develop quantitative regression QSAR models to predict acute aquatic toxicity towards the fish Pimephales promelas (fathead minnow) on a set of 908 chemicals. LC50 data, which is the concentration that causes death in 50% of test fish over a test duration of 96 hours, was used as model response
Arguments:
- cache (str or None, optional): The name of the cache. It will be saved to
cache
in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
- cache (str or None): The name of the cache.
- shape (tuple of integers): Dimensions of the data set
- columns (list of strings): List of column names in the data set
Features:
CIC0 (float): Information indices SM1_Dz(Z) (float): 2D matrix-based descriptors GATS1i (float): 2D autocorrelations NdsCH (int) Atom-type counts NdssC (int) Atom-type counts MLOGP (float): Molecular properties
Targets:
LC50 (float): A concentration that causes death in 50% of test fish over a test duration of 96 hours. In -log(mol/L) units.
Source:
Examples:
Load in the data set::
>>> dataset = FishToxicity() >>> dataset.shape (908, 7)
Split the data set into features and targets, as NumPy arrays::
>>> X, y = dataset.split() >>> X.shape, y.shape ((908, 6), (908,))
Perform a train/test split, also outputting NumPy arrays::
>>> train_test_split = dataset.split(test_size=0.2, random_seed=42) >>> X_train, X_test, y_train, y_test = train_test_split >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape ((708, 6), (708,), (200, 6), (200,))
Output the underlying Pandas DataFrame::
>>> df = dataset.to_pandas() >>> type(df) <class 'pandas.core.frame.DataFrame'>