doubt.datasets.solar_flare
Solar flare data set.
This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.
1"""Solar flare data set. 2 3This data set is from the UCI data set archive, with the description being the original 4description verbatim. Some feature names may have been altered, based on the 5description. 6""" 7 8import io 9 10import pandas as pd 11 12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset 13 14 15class SolarFlare(BaseDataset): 16 __doc__ = f""" 17 Each class attribute counts the number of solar flares of a certain class that 18 occur in a 24 hour period. 19 20 The database contains 3 potential classes, one for the number of times a certain 21 type of solar flare occured in a 24 hour period. 22 23 Each instance represents captured features for 1 active region on the sun. 24 25 The data are divided into two sections. The second section (flare.data2) has had 26 much more error correction applied to the it, and has consequently been treated as 27 more reliable. 28 29 {BASE_DATASET_DESCRIPTION} 30 31 Features: 32 class (int): 33 Code for class (modified Zurich class). Ranges from 0 to 6 inclusive 34 spot_size (int): 35 Code for largest spot size. Ranges from 0 to 5 inclusive 36 spot_distr (int): 37 Code for spot distribution. Ranges from 0 to 3 inclusive 38 activity (int): 39 Binary feature indicating 1 = reduced and 2 = unchanged 40 evolution (int): 41 0 = decay, 1 = no growth and 2 = growth 42 flare_activity (int): 43 Previous 24 hour flare activity code, where 0 = nothing as big as an M1, 1 44 = one M1 and 2 = more activity than one M1 45 is_complex (int): 46 Binary feature indicating historically complex 47 became_complex (int): 48 Binary feature indicating whether the region became historically complex on 49 this pass across the sun's disk 50 large (int): 51 Binary feature, indicating whether area is large 52 large_spot (int): 53 Binary feature, indicating whether the area of the largest spot is greater 54 than 5 55 56 Targets: 57 C-class (int): 58 C-class flares production by this region in the following 24 hours (common 59 flares) 60 M-class (int): 61 M-class flares production by this region in the following 24 hours (common 62 flares) 63 X-class (int): 64 X-class flares production by this region in the following 24 hours (common 65 flares) 66 67 Source: 68 https://archive.ics.uci.edu/ml/datasets/Solar+Flare 69 70 Examples: 71 Load in the data set:: 72 73 >>> dataset = SolarFlare() 74 >>> dataset.shape 75 (1066, 13) 76 77 Split the data set into features and targets, as NumPy arrays:: 78 79 >>> X, y = dataset.split() 80 >>> X.shape, y.shape 81 ((1066, 10), (1066, 3)) 82 83 Perform a train/test split, also outputting NumPy arrays:: 84 85 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 86 >>> X_train, X_test, y_train, y_test = train_test_split 87 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 88 ((837, 10), (837, 3), (229, 10), (229, 3)) 89 90 Output the underlying Pandas DataFrame:: 91 92 >>> df = dataset.to_pandas() 93 >>> type(df) 94 <class 'pandas.core.frame.DataFrame'> 95 """ 96 97 _url = ( 98 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 99 "solar-flare/flare.data2" 100 ) 101 102 _features = range(10) 103 _targets = range(10, 13) 104 105 def _prep_data(self, data: bytes) -> pd.DataFrame: 106 """Prepare the data set. 107 108 Args: 109 data (bytes): The raw data 110 111 Returns: 112 Pandas dataframe: The prepared data 113 """ 114 # Convert the bytes into a file-like object 115 csv_file = io.BytesIO(data) 116 117 # Load in dataframe 118 cols = [ 119 "class", 120 "spot_size", 121 "spot_distr", 122 "activity", 123 "evolution", 124 "flare_activity", 125 "is_complex", 126 "became_complex", 127 "large", 128 "large_spot", 129 "C-class", 130 "M-class", 131 "X-class", 132 ] 133 df = pd.read_csv(csv_file, sep=" ", skiprows=[0], names=cols) 134 135 # Encode class 136 encodings = ["A", "B", "C", "D", "E", "F", "H"] 137 df["class"] = df["class"].map(lambda x: encodings.index(x)) 138 139 # Encode spot size 140 encodings = ["X", "R", "S", "A", "H", "K"] 141 df["spot_size"] = df.spot_size.map(lambda x: encodings.index(x)) 142 143 # Encode spot distribution 144 encodings = ["X", "O", "I", "C"] 145 df["spot_distr"] = df.spot_distr.map(lambda x: encodings.index(x)) 146 147 return df
16class SolarFlare(BaseDataset): 17 __doc__ = f""" 18 Each class attribute counts the number of solar flares of a certain class that 19 occur in a 24 hour period. 20 21 The database contains 3 potential classes, one for the number of times a certain 22 type of solar flare occured in a 24 hour period. 23 24 Each instance represents captured features for 1 active region on the sun. 25 26 The data are divided into two sections. The second section (flare.data2) has had 27 much more error correction applied to the it, and has consequently been treated as 28 more reliable. 29 30 {BASE_DATASET_DESCRIPTION} 31 32 Features: 33 class (int): 34 Code for class (modified Zurich class). Ranges from 0 to 6 inclusive 35 spot_size (int): 36 Code for largest spot size. Ranges from 0 to 5 inclusive 37 spot_distr (int): 38 Code for spot distribution. Ranges from 0 to 3 inclusive 39 activity (int): 40 Binary feature indicating 1 = reduced and 2 = unchanged 41 evolution (int): 42 0 = decay, 1 = no growth and 2 = growth 43 flare_activity (int): 44 Previous 24 hour flare activity code, where 0 = nothing as big as an M1, 1 45 = one M1 and 2 = more activity than one M1 46 is_complex (int): 47 Binary feature indicating historically complex 48 became_complex (int): 49 Binary feature indicating whether the region became historically complex on 50 this pass across the sun's disk 51 large (int): 52 Binary feature, indicating whether area is large 53 large_spot (int): 54 Binary feature, indicating whether the area of the largest spot is greater 55 than 5 56 57 Targets: 58 C-class (int): 59 C-class flares production by this region in the following 24 hours (common 60 flares) 61 M-class (int): 62 M-class flares production by this region in the following 24 hours (common 63 flares) 64 X-class (int): 65 X-class flares production by this region in the following 24 hours (common 66 flares) 67 68 Source: 69 https://archive.ics.uci.edu/ml/datasets/Solar+Flare 70 71 Examples: 72 Load in the data set:: 73 74 >>> dataset = SolarFlare() 75 >>> dataset.shape 76 (1066, 13) 77 78 Split the data set into features and targets, as NumPy arrays:: 79 80 >>> X, y = dataset.split() 81 >>> X.shape, y.shape 82 ((1066, 10), (1066, 3)) 83 84 Perform a train/test split, also outputting NumPy arrays:: 85 86 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 87 >>> X_train, X_test, y_train, y_test = train_test_split 88 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 89 ((837, 10), (837, 3), (229, 10), (229, 3)) 90 91 Output the underlying Pandas DataFrame:: 92 93 >>> df = dataset.to_pandas() 94 >>> type(df) 95 <class 'pandas.core.frame.DataFrame'> 96 """ 97 98 _url = ( 99 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 100 "solar-flare/flare.data2" 101 ) 102 103 _features = range(10) 104 _targets = range(10, 13) 105 106 def _prep_data(self, data: bytes) -> pd.DataFrame: 107 """Prepare the data set. 108 109 Args: 110 data (bytes): The raw data 111 112 Returns: 113 Pandas dataframe: The prepared data 114 """ 115 # Convert the bytes into a file-like object 116 csv_file = io.BytesIO(data) 117 118 # Load in dataframe 119 cols = [ 120 "class", 121 "spot_size", 122 "spot_distr", 123 "activity", 124 "evolution", 125 "flare_activity", 126 "is_complex", 127 "became_complex", 128 "large", 129 "large_spot", 130 "C-class", 131 "M-class", 132 "X-class", 133 ] 134 df = pd.read_csv(csv_file, sep=" ", skiprows=[0], names=cols) 135 136 # Encode class 137 encodings = ["A", "B", "C", "D", "E", "F", "H"] 138 df["class"] = df["class"].map(lambda x: encodings.index(x)) 139 140 # Encode spot size 141 encodings = ["X", "R", "S", "A", "H", "K"] 142 df["spot_size"] = df.spot_size.map(lambda x: encodings.index(x)) 143 144 # Encode spot distribution 145 encodings = ["X", "O", "I", "C"] 146 df["spot_distr"] = df.spot_distr.map(lambda x: encodings.index(x)) 147 148 return df
Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period.
The database contains 3 potential classes, one for the number of times a certain type of solar flare occured in a 24 hour period.
Each instance represents captured features for 1 active region on the sun.
The data are divided into two sections. The second section (flare.data2) has had much more error correction applied to the it, and has consequently been treated as more reliable.
Arguments:
- cache (str or None, optional): The name of the cache. It will be saved to
cache
in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
- cache (str or None): The name of the cache.
- shape (tuple of integers): Dimensions of the data set
- columns (list of strings): List of column names in the data set
Features:
class (int): Code for class (modified Zurich class). Ranges from 0 to 6 inclusive spot_size (int): Code for largest spot size. Ranges from 0 to 5 inclusive spot_distr (int): Code for spot distribution. Ranges from 0 to 3 inclusive activity (int): Binary feature indicating 1 = reduced and 2 = unchanged evolution (int): 0 = decay, 1 = no growth and 2 = growth flare_activity (int): Previous 24 hour flare activity code, where 0 = nothing as big as an M1, 1 = one M1 and 2 = more activity than one M1 is_complex (int): Binary feature indicating historically complex became_complex (int): Binary feature indicating whether the region became historically complex on this pass across the sun's disk large (int): Binary feature, indicating whether area is large large_spot (int): Binary feature, indicating whether the area of the largest spot is greater than 5
Targets:
C-class (int): C-class flares production by this region in the following 24 hours (common flares) M-class (int): M-class flares production by this region in the following 24 hours (common flares) X-class (int): X-class flares production by this region in the following 24 hours (common flares)
Source:
Examples:
Load in the data set::
>>> dataset = SolarFlare() >>> dataset.shape (1066, 13)
Split the data set into features and targets, as NumPy arrays::
>>> X, y = dataset.split() >>> X.shape, y.shape ((1066, 10), (1066, 3))
Perform a train/test split, also outputting NumPy arrays::
>>> train_test_split = dataset.split(test_size=0.2, random_seed=42) >>> X_train, X_test, y_train, y_test = train_test_split >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape ((837, 10), (837, 3), (229, 10), (229, 3))
Output the underlying Pandas DataFrame::
>>> df = dataset.to_pandas() >>> type(df) <class 'pandas.core.frame.DataFrame'>