doubt.datasets.nanotube

Nanotube data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

  1"""Nanotube data set.
  2
  3This data set is from the UCI data set archive, with the description being the original
  4description verbatim. Some feature names may have been altered, based on the
  5description.
  6"""
  7
  8import io
  9
 10import pandas as pd
 11
 12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset
 13
 14
 15class Nanotube(BaseDataset):
 16    __doc__ = f"""
 17    CASTEP can simulate a wide range of properties of materials proprieties using
 18    density functional theory (DFT). DFT is the most successful method calculates
 19    atomic coordinates faster than other mathematical approaches, and it also reaches
 20    more accurate results. The dataset is generated with CASTEP using CNT geometry
 21    optimization. Many CNTs are simulated in CASTEP, then geometry optimizations are
 22    calculated. Initial coordinates of all carbon atoms are generated randomly.
 23    Different chiral vectors are used for each CNT simulation.
 24
 25    The atom type is selected as carbon, bond length is used as 1.42 A° (default
 26    value). CNT calculation parameters are used as default parameters. To finalize the
 27    computation, CASTEP uses a parameter named as elec_energy_tol (electrical energy
 28    tolerance) (default 1x10-5 eV) which represents that the change in the total energy
 29    from one iteration to the next remains below some tolerance value per atom for a
 30    few self-consistent field steps. Initial atomic coordinates (u, v, w), chiral
 31    vector (n, m) and calculated atomic coordinates (u, v, w) are obtained from the
 32    output files.
 33
 34    {BASE_DATASET_DESCRIPTION}
 35
 36    Features:
 37        Chiral indice n (int):
 38            n parameter of the selected chiral vector
 39        Chiral indice m (int):
 40            m parameter of the selected chiral vector
 41        Initial atomic coordinate u (float):
 42            Randomly generated u parameter of the initial atomic coordinates
 43            of all carbon atoms.
 44        Initial atomic coordinate v (float):
 45            Randomly generated v parameter of the initial atomic coordinates
 46            of all carbon atoms.
 47        Initial atomic coordinate w (float):
 48            Randomly generated w parameter of the initial atomic coordinates
 49            of all carbon atoms.
 50
 51    Targets:
 52        Calculated atomic coordinates u (float):
 53           Calculated u parameter of the atomic coordinates of all
 54           carbon atoms
 55        Calculated atomic coordinates v (float):
 56           Calculated v parameter of the atomic coordinates of all
 57           carbon atoms
 58        Calculated atomic coordinates w (float):
 59           Calculated w parameter of the atomic coordinates of all
 60           carbon atoms
 61
 62    Sources:
 63        https://archive.ics.uci.edu/ml/datasets/Carbon+Nanotubes
 64        https://doi.org/10.1007/s00339-016-0153-1
 65        https://doi.org/10.17341/gazimmfd.337642
 66
 67    Examples:
 68        Load in the data set::
 69
 70            >>> dataset = Nanotube()
 71            >>> dataset.shape
 72            (10721, 8)
 73
 74        Split the data set into features and targets, as NumPy arrays::
 75
 76            >>> X, y = dataset.split()
 77            >>> X.shape, y.shape
 78            ((10721, 5), (10721, 3))
 79
 80        Perform a train/test split, also outputting NumPy arrays::
 81
 82            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
 83            >>> X_train, X_test, y_train, y_test = train_test_split
 84            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
 85            ((8541, 5), (8541, 3), (2180, 5), (2180, 3))
 86
 87        Output the underlying Pandas DataFrame::
 88
 89            >>> df = dataset.to_pandas()
 90            >>> type(df)
 91            <class 'pandas.core.frame.DataFrame'>
 92    """
 93
 94    _url = (
 95        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
 96        "00448/carbon_nanotubes.csv"
 97    )
 98
 99    _features = range(5)
100    _targets = [5, 6, 7]
101
102    def _prep_data(self, data: bytes) -> pd.DataFrame:
103        """Prepare the data set.
104
105        Args:
106            data (bytes): The raw data
107
108        Returns:
109            Pandas dataframe: The prepared data
110        """
111        # Convert the bytes into a file-like object
112        csv_file = io.BytesIO(data)
113
114        # Read the file-like object into a dataframe
115        df = pd.read_csv(csv_file, sep=";", decimal=",")
116        return df
class Nanotube(doubt.datasets.dataset.BaseDataset):
 16class Nanotube(BaseDataset):
 17    __doc__ = f"""
 18    CASTEP can simulate a wide range of properties of materials proprieties using
 19    density functional theory (DFT). DFT is the most successful method calculates
 20    atomic coordinates faster than other mathematical approaches, and it also reaches
 21    more accurate results. The dataset is generated with CASTEP using CNT geometry
 22    optimization. Many CNTs are simulated in CASTEP, then geometry optimizations are
 23    calculated. Initial coordinates of all carbon atoms are generated randomly.
 24    Different chiral vectors are used for each CNT simulation.
 25
 26    The atom type is selected as carbon, bond length is used as 1.42 A° (default
 27    value). CNT calculation parameters are used as default parameters. To finalize the
 28    computation, CASTEP uses a parameter named as elec_energy_tol (electrical energy
 29    tolerance) (default 1x10-5 eV) which represents that the change in the total energy
 30    from one iteration to the next remains below some tolerance value per atom for a
 31    few self-consistent field steps. Initial atomic coordinates (u, v, w), chiral
 32    vector (n, m) and calculated atomic coordinates (u, v, w) are obtained from the
 33    output files.
 34
 35    {BASE_DATASET_DESCRIPTION}
 36
 37    Features:
 38        Chiral indice n (int):
 39            n parameter of the selected chiral vector
 40        Chiral indice m (int):
 41            m parameter of the selected chiral vector
 42        Initial atomic coordinate u (float):
 43            Randomly generated u parameter of the initial atomic coordinates
 44            of all carbon atoms.
 45        Initial atomic coordinate v (float):
 46            Randomly generated v parameter of the initial atomic coordinates
 47            of all carbon atoms.
 48        Initial atomic coordinate w (float):
 49            Randomly generated w parameter of the initial atomic coordinates
 50            of all carbon atoms.
 51
 52    Targets:
 53        Calculated atomic coordinates u (float):
 54           Calculated u parameter of the atomic coordinates of all
 55           carbon atoms
 56        Calculated atomic coordinates v (float):
 57           Calculated v parameter of the atomic coordinates of all
 58           carbon atoms
 59        Calculated atomic coordinates w (float):
 60           Calculated w parameter of the atomic coordinates of all
 61           carbon atoms
 62
 63    Sources:
 64        https://archive.ics.uci.edu/ml/datasets/Carbon+Nanotubes
 65        https://doi.org/10.1007/s00339-016-0153-1
 66        https://doi.org/10.17341/gazimmfd.337642
 67
 68    Examples:
 69        Load in the data set::
 70
 71            >>> dataset = Nanotube()
 72            >>> dataset.shape
 73            (10721, 8)
 74
 75        Split the data set into features and targets, as NumPy arrays::
 76
 77            >>> X, y = dataset.split()
 78            >>> X.shape, y.shape
 79            ((10721, 5), (10721, 3))
 80
 81        Perform a train/test split, also outputting NumPy arrays::
 82
 83            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
 84            >>> X_train, X_test, y_train, y_test = train_test_split
 85            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
 86            ((8541, 5), (8541, 3), (2180, 5), (2180, 3))
 87
 88        Output the underlying Pandas DataFrame::
 89
 90            >>> df = dataset.to_pandas()
 91            >>> type(df)
 92            <class 'pandas.core.frame.DataFrame'>
 93    """
 94
 95    _url = (
 96        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
 97        "00448/carbon_nanotubes.csv"
 98    )
 99
100    _features = range(5)
101    _targets = [5, 6, 7]
102
103    def _prep_data(self, data: bytes) -> pd.DataFrame:
104        """Prepare the data set.
105
106        Args:
107            data (bytes): The raw data
108
109        Returns:
110            Pandas dataframe: The prepared data
111        """
112        # Convert the bytes into a file-like object
113        csv_file = io.BytesIO(data)
114
115        # Read the file-like object into a dataframe
116        df = pd.read_csv(csv_file, sep=";", decimal=",")
117        return df

CASTEP can simulate a wide range of properties of materials proprieties using density functional theory (DFT). DFT is the most successful method calculates atomic coordinates faster than other mathematical approaches, and it also reaches more accurate results. The dataset is generated with CASTEP using CNT geometry optimization. Many CNTs are simulated in CASTEP, then geometry optimizations are calculated. Initial coordinates of all carbon atoms are generated randomly. Different chiral vectors are used for each CNT simulation.

The atom type is selected as carbon, bond length is used as 1.42 A° (default value). CNT calculation parameters are used as default parameters. To finalize the computation, CASTEP uses a parameter named as elec_energy_tol (electrical energy tolerance) (default 1x10-5 eV) which represents that the change in the total energy from one iteration to the next remains below some tolerance value per atom for a few self-consistent field steps. Initial atomic coordinates (u, v, w), chiral vector (n, m) and calculated atomic coordinates (u, v, w) are obtained from the output files.

Arguments:
  • cache (str or None, optional): The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
  • cache (str or None): The name of the cache.
  • shape (tuple of integers): Dimensions of the data set
  • columns (list of strings): List of column names in the data set
Features:

Chiral indice n (int): n parameter of the selected chiral vector Chiral indice m (int): m parameter of the selected chiral vector Initial atomic coordinate u (float): Randomly generated u parameter of the initial atomic coordinates of all carbon atoms. Initial atomic coordinate v (float): Randomly generated v parameter of the initial atomic coordinates of all carbon atoms. Initial atomic coordinate w (float): Randomly generated w parameter of the initial atomic coordinates of all carbon atoms.

Targets:

Calculated atomic coordinates u (float): Calculated u parameter of the atomic coordinates of all carbon atoms Calculated atomic coordinates v (float): Calculated v parameter of the atomic coordinates of all carbon atoms Calculated atomic coordinates w (float): Calculated w parameter of the atomic coordinates of all carbon atoms

Sources:

https://archive.ics.uci.edu/ml/datasets/Carbon+Nanotubes https://doi.org/10.1007/s00339-016-0153-1 https://doi.org/10.17341/gazimmfd.337642

Examples:

Load in the data set::

>>> dataset = Nanotube()
>>> dataset.shape
(10721, 8)

Split the data set into features and targets, as NumPy arrays::

>>> X, y = dataset.split()
>>> X.shape, y.shape
((10721, 5), (10721, 3))

Perform a train/test split, also outputting NumPy arrays::

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((8541, 5), (8541, 3), (2180, 5), (2180, 3))

Output the underlying Pandas DataFrame::

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>