doubt.datasets.nanotube
Nanotube data set.
This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.
1"""Nanotube data set. 2 3This data set is from the UCI data set archive, with the description being the original 4description verbatim. Some feature names may have been altered, based on the 5description. 6""" 7 8import io 9 10import pandas as pd 11 12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset 13 14 15class Nanotube(BaseDataset): 16 __doc__ = f""" 17 CASTEP can simulate a wide range of properties of materials proprieties using 18 density functional theory (DFT). DFT is the most successful method calculates 19 atomic coordinates faster than other mathematical approaches, and it also reaches 20 more accurate results. The dataset is generated with CASTEP using CNT geometry 21 optimization. Many CNTs are simulated in CASTEP, then geometry optimizations are 22 calculated. Initial coordinates of all carbon atoms are generated randomly. 23 Different chiral vectors are used for each CNT simulation. 24 25 The atom type is selected as carbon, bond length is used as 1.42 A° (default 26 value). CNT calculation parameters are used as default parameters. To finalize the 27 computation, CASTEP uses a parameter named as elec_energy_tol (electrical energy 28 tolerance) (default 1x10-5 eV) which represents that the change in the total energy 29 from one iteration to the next remains below some tolerance value per atom for a 30 few self-consistent field steps. Initial atomic coordinates (u, v, w), chiral 31 vector (n, m) and calculated atomic coordinates (u, v, w) are obtained from the 32 output files. 33 34 {BASE_DATASET_DESCRIPTION} 35 36 Features: 37 Chiral indice n (int): 38 n parameter of the selected chiral vector 39 Chiral indice m (int): 40 m parameter of the selected chiral vector 41 Initial atomic coordinate u (float): 42 Randomly generated u parameter of the initial atomic coordinates 43 of all carbon atoms. 44 Initial atomic coordinate v (float): 45 Randomly generated v parameter of the initial atomic coordinates 46 of all carbon atoms. 47 Initial atomic coordinate w (float): 48 Randomly generated w parameter of the initial atomic coordinates 49 of all carbon atoms. 50 51 Targets: 52 Calculated atomic coordinates u (float): 53 Calculated u parameter of the atomic coordinates of all 54 carbon atoms 55 Calculated atomic coordinates v (float): 56 Calculated v parameter of the atomic coordinates of all 57 carbon atoms 58 Calculated atomic coordinates w (float): 59 Calculated w parameter of the atomic coordinates of all 60 carbon atoms 61 62 Sources: 63 https://archive.ics.uci.edu/ml/datasets/Carbon+Nanotubes 64 https://doi.org/10.1007/s00339-016-0153-1 65 https://doi.org/10.17341/gazimmfd.337642 66 67 Examples: 68 Load in the data set:: 69 70 >>> dataset = Nanotube() 71 >>> dataset.shape 72 (10721, 8) 73 74 Split the data set into features and targets, as NumPy arrays:: 75 76 >>> X, y = dataset.split() 77 >>> X.shape, y.shape 78 ((10721, 5), (10721, 3)) 79 80 Perform a train/test split, also outputting NumPy arrays:: 81 82 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 83 >>> X_train, X_test, y_train, y_test = train_test_split 84 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 85 ((8541, 5), (8541, 3), (2180, 5), (2180, 3)) 86 87 Output the underlying Pandas DataFrame:: 88 89 >>> df = dataset.to_pandas() 90 >>> type(df) 91 <class 'pandas.core.frame.DataFrame'> 92 """ 93 94 _url = ( 95 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 96 "00448/carbon_nanotubes.csv" 97 ) 98 99 _features = range(5) 100 _targets = [5, 6, 7] 101 102 def _prep_data(self, data: bytes) -> pd.DataFrame: 103 """Prepare the data set. 104 105 Args: 106 data (bytes): The raw data 107 108 Returns: 109 Pandas dataframe: The prepared data 110 """ 111 # Convert the bytes into a file-like object 112 csv_file = io.BytesIO(data) 113 114 # Read the file-like object into a dataframe 115 df = pd.read_csv(csv_file, sep=";", decimal=",") 116 return df
16class Nanotube(BaseDataset): 17 __doc__ = f""" 18 CASTEP can simulate a wide range of properties of materials proprieties using 19 density functional theory (DFT). DFT is the most successful method calculates 20 atomic coordinates faster than other mathematical approaches, and it also reaches 21 more accurate results. The dataset is generated with CASTEP using CNT geometry 22 optimization. Many CNTs are simulated in CASTEP, then geometry optimizations are 23 calculated. Initial coordinates of all carbon atoms are generated randomly. 24 Different chiral vectors are used for each CNT simulation. 25 26 The atom type is selected as carbon, bond length is used as 1.42 A° (default 27 value). CNT calculation parameters are used as default parameters. To finalize the 28 computation, CASTEP uses a parameter named as elec_energy_tol (electrical energy 29 tolerance) (default 1x10-5 eV) which represents that the change in the total energy 30 from one iteration to the next remains below some tolerance value per atom for a 31 few self-consistent field steps. Initial atomic coordinates (u, v, w), chiral 32 vector (n, m) and calculated atomic coordinates (u, v, w) are obtained from the 33 output files. 34 35 {BASE_DATASET_DESCRIPTION} 36 37 Features: 38 Chiral indice n (int): 39 n parameter of the selected chiral vector 40 Chiral indice m (int): 41 m parameter of the selected chiral vector 42 Initial atomic coordinate u (float): 43 Randomly generated u parameter of the initial atomic coordinates 44 of all carbon atoms. 45 Initial atomic coordinate v (float): 46 Randomly generated v parameter of the initial atomic coordinates 47 of all carbon atoms. 48 Initial atomic coordinate w (float): 49 Randomly generated w parameter of the initial atomic coordinates 50 of all carbon atoms. 51 52 Targets: 53 Calculated atomic coordinates u (float): 54 Calculated u parameter of the atomic coordinates of all 55 carbon atoms 56 Calculated atomic coordinates v (float): 57 Calculated v parameter of the atomic coordinates of all 58 carbon atoms 59 Calculated atomic coordinates w (float): 60 Calculated w parameter of the atomic coordinates of all 61 carbon atoms 62 63 Sources: 64 https://archive.ics.uci.edu/ml/datasets/Carbon+Nanotubes 65 https://doi.org/10.1007/s00339-016-0153-1 66 https://doi.org/10.17341/gazimmfd.337642 67 68 Examples: 69 Load in the data set:: 70 71 >>> dataset = Nanotube() 72 >>> dataset.shape 73 (10721, 8) 74 75 Split the data set into features and targets, as NumPy arrays:: 76 77 >>> X, y = dataset.split() 78 >>> X.shape, y.shape 79 ((10721, 5), (10721, 3)) 80 81 Perform a train/test split, also outputting NumPy arrays:: 82 83 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 84 >>> X_train, X_test, y_train, y_test = train_test_split 85 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 86 ((8541, 5), (8541, 3), (2180, 5), (2180, 3)) 87 88 Output the underlying Pandas DataFrame:: 89 90 >>> df = dataset.to_pandas() 91 >>> type(df) 92 <class 'pandas.core.frame.DataFrame'> 93 """ 94 95 _url = ( 96 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 97 "00448/carbon_nanotubes.csv" 98 ) 99 100 _features = range(5) 101 _targets = [5, 6, 7] 102 103 def _prep_data(self, data: bytes) -> pd.DataFrame: 104 """Prepare the data set. 105 106 Args: 107 data (bytes): The raw data 108 109 Returns: 110 Pandas dataframe: The prepared data 111 """ 112 # Convert the bytes into a file-like object 113 csv_file = io.BytesIO(data) 114 115 # Read the file-like object into a dataframe 116 df = pd.read_csv(csv_file, sep=";", decimal=",") 117 return df
CASTEP can simulate a wide range of properties of materials proprieties using density functional theory (DFT). DFT is the most successful method calculates atomic coordinates faster than other mathematical approaches, and it also reaches more accurate results. The dataset is generated with CASTEP using CNT geometry optimization. Many CNTs are simulated in CASTEP, then geometry optimizations are calculated. Initial coordinates of all carbon atoms are generated randomly. Different chiral vectors are used for each CNT simulation.
The atom type is selected as carbon, bond length is used as 1.42 A° (default value). CNT calculation parameters are used as default parameters. To finalize the computation, CASTEP uses a parameter named as elec_energy_tol (electrical energy tolerance) (default 1x10-5 eV) which represents that the change in the total energy from one iteration to the next remains below some tolerance value per atom for a few self-consistent field steps. Initial atomic coordinates (u, v, w), chiral vector (n, m) and calculated atomic coordinates (u, v, w) are obtained from the output files.
Arguments:
- cache (str or None, optional): The name of the cache. It will be saved to
cache
in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
- cache (str or None): The name of the cache.
- shape (tuple of integers): Dimensions of the data set
- columns (list of strings): List of column names in the data set
Features:
Chiral indice n (int): n parameter of the selected chiral vector Chiral indice m (int): m parameter of the selected chiral vector Initial atomic coordinate u (float): Randomly generated u parameter of the initial atomic coordinates of all carbon atoms. Initial atomic coordinate v (float): Randomly generated v parameter of the initial atomic coordinates of all carbon atoms. Initial atomic coordinate w (float): Randomly generated w parameter of the initial atomic coordinates of all carbon atoms.
Targets:
Calculated atomic coordinates u (float): Calculated u parameter of the atomic coordinates of all carbon atoms Calculated atomic coordinates v (float): Calculated v parameter of the atomic coordinates of all carbon atoms Calculated atomic coordinates w (float): Calculated w parameter of the atomic coordinates of all carbon atoms
Sources:
https://archive.ics.uci.edu/ml/datasets/Carbon+Nanotubes https://doi.org/10.1007/s00339-016-0153-1 https://doi.org/10.17341/gazimmfd.337642
Examples:
Load in the data set::
>>> dataset = Nanotube() >>> dataset.shape (10721, 8)
Split the data set into features and targets, as NumPy arrays::
>>> X, y = dataset.split() >>> X.shape, y.shape ((10721, 5), (10721, 3))
Perform a train/test split, also outputting NumPy arrays::
>>> train_test_split = dataset.split(test_size=0.2, random_seed=42) >>> X_train, X_test, y_train, y_test = train_test_split >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape ((8541, 5), (8541, 3), (2180, 5), (2180, 3))
Output the underlying Pandas DataFrame::
>>> df = dataset.to_pandas() >>> type(df) <class 'pandas.core.frame.DataFrame'>