doubt.datasets.solar_flare

Solar flare data set.

This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.

  1"""Solar flare data set.
  2
  3This data set is from the UCI data set archive, with the description being the original
  4description verbatim. Some feature names may have been altered, based on the
  5description.
  6"""
  7
  8import io
  9
 10import pandas as pd
 11
 12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset
 13
 14
 15class SolarFlare(BaseDataset):
 16    __doc__ = f"""
 17    Each class attribute counts the number of solar flares of a certain class that
 18    occur in a 24 hour period.
 19
 20    The database contains 3 potential classes, one for the number of times a certain
 21    type of solar flare occured in a 24 hour period.
 22
 23    Each instance represents captured features for 1 active region on the sun.
 24
 25    The data are divided into two sections. The second section (flare.data2) has had
 26    much more error correction applied to the it, and has consequently been treated as
 27    more reliable.
 28
 29    {BASE_DATASET_DESCRIPTION}
 30
 31    Features:
 32        class (int):
 33            Code for class (modified Zurich class). Ranges from 0 to 6 inclusive
 34        spot_size (int):
 35            Code for largest spot size. Ranges from 0 to 5 inclusive
 36        spot_distr (int):
 37            Code for spot distribution. Ranges from 0 to 3 inclusive
 38        activity (int):
 39            Binary feature indicating 1 = reduced and 2 = unchanged
 40        evolution (int):
 41            0 = decay, 1 = no growth and 2 = growth
 42        flare_activity (int):
 43            Previous 24 hour flare activity code, where 0 = nothing as big as an M1, 1
 44            = one M1 and 2 = more activity than one M1
 45        is_complex (int):
 46            Binary feature indicating historically complex
 47        became_complex (int):
 48            Binary feature indicating whether the region became historically complex on
 49            this pass across the sun's disk
 50        large (int):
 51            Binary feature, indicating whether area is large
 52        large_spot (int):
 53            Binary feature, indicating whether the area of the largest spot is greater
 54            than 5
 55
 56    Targets:
 57        C-class (int):
 58            C-class flares production by this region in the following 24 hours (common
 59            flares)
 60        M-class (int):
 61            M-class flares production by this region in the following 24 hours (common
 62            flares)
 63        X-class (int):
 64            X-class flares production by this region in the following 24 hours (common
 65            flares)
 66
 67    Source:
 68        https://archive.ics.uci.edu/ml/datasets/Solar+Flare
 69
 70    Examples:
 71        Load in the data set::
 72
 73            >>> dataset = SolarFlare()
 74            >>> dataset.shape
 75            (1066, 13)
 76
 77        Split the data set into features and targets, as NumPy arrays::
 78
 79            >>> X, y = dataset.split()
 80            >>> X.shape, y.shape
 81            ((1066, 10), (1066, 3))
 82
 83        Perform a train/test split, also outputting NumPy arrays::
 84
 85            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
 86            >>> X_train, X_test, y_train, y_test = train_test_split
 87            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
 88            ((837, 10), (837, 3), (229, 10), (229, 3))
 89
 90        Output the underlying Pandas DataFrame::
 91
 92            >>> df = dataset.to_pandas()
 93            >>> type(df)
 94            <class 'pandas.core.frame.DataFrame'>
 95    """
 96
 97    _url = (
 98        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
 99        "solar-flare/flare.data2"
100    )
101
102    _features = range(10)
103    _targets = range(10, 13)
104
105    def _prep_data(self, data: bytes) -> pd.DataFrame:
106        """Prepare the data set.
107
108        Args:
109            data (bytes): The raw data
110
111        Returns:
112            Pandas dataframe: The prepared data
113        """
114        # Convert the bytes into a file-like object
115        csv_file = io.BytesIO(data)
116
117        # Load in dataframe
118        cols = [
119            "class",
120            "spot_size",
121            "spot_distr",
122            "activity",
123            "evolution",
124            "flare_activity",
125            "is_complex",
126            "became_complex",
127            "large",
128            "large_spot",
129            "C-class",
130            "M-class",
131            "X-class",
132        ]
133        df = pd.read_csv(csv_file, sep=" ", skiprows=[0], names=cols)
134
135        # Encode class
136        encodings = ["A", "B", "C", "D", "E", "F", "H"]
137        df["class"] = df["class"].map(lambda x: encodings.index(x))
138
139        # Encode spot size
140        encodings = ["X", "R", "S", "A", "H", "K"]
141        df["spot_size"] = df.spot_size.map(lambda x: encodings.index(x))
142
143        # Encode spot distribution
144        encodings = ["X", "O", "I", "C"]
145        df["spot_distr"] = df.spot_distr.map(lambda x: encodings.index(x))
146
147        return df
class SolarFlare(doubt.datasets.dataset.BaseDataset):
 16class SolarFlare(BaseDataset):
 17    __doc__ = f"""
 18    Each class attribute counts the number of solar flares of a certain class that
 19    occur in a 24 hour period.
 20
 21    The database contains 3 potential classes, one for the number of times a certain
 22    type of solar flare occured in a 24 hour period.
 23
 24    Each instance represents captured features for 1 active region on the sun.
 25
 26    The data are divided into two sections. The second section (flare.data2) has had
 27    much more error correction applied to the it, and has consequently been treated as
 28    more reliable.
 29
 30    {BASE_DATASET_DESCRIPTION}
 31
 32    Features:
 33        class (int):
 34            Code for class (modified Zurich class). Ranges from 0 to 6 inclusive
 35        spot_size (int):
 36            Code for largest spot size. Ranges from 0 to 5 inclusive
 37        spot_distr (int):
 38            Code for spot distribution. Ranges from 0 to 3 inclusive
 39        activity (int):
 40            Binary feature indicating 1 = reduced and 2 = unchanged
 41        evolution (int):
 42            0 = decay, 1 = no growth and 2 = growth
 43        flare_activity (int):
 44            Previous 24 hour flare activity code, where 0 = nothing as big as an M1, 1
 45            = one M1 and 2 = more activity than one M1
 46        is_complex (int):
 47            Binary feature indicating historically complex
 48        became_complex (int):
 49            Binary feature indicating whether the region became historically complex on
 50            this pass across the sun's disk
 51        large (int):
 52            Binary feature, indicating whether area is large
 53        large_spot (int):
 54            Binary feature, indicating whether the area of the largest spot is greater
 55            than 5
 56
 57    Targets:
 58        C-class (int):
 59            C-class flares production by this region in the following 24 hours (common
 60            flares)
 61        M-class (int):
 62            M-class flares production by this region in the following 24 hours (common
 63            flares)
 64        X-class (int):
 65            X-class flares production by this region in the following 24 hours (common
 66            flares)
 67
 68    Source:
 69        https://archive.ics.uci.edu/ml/datasets/Solar+Flare
 70
 71    Examples:
 72        Load in the data set::
 73
 74            >>> dataset = SolarFlare()
 75            >>> dataset.shape
 76            (1066, 13)
 77
 78        Split the data set into features and targets, as NumPy arrays::
 79
 80            >>> X, y = dataset.split()
 81            >>> X.shape, y.shape
 82            ((1066, 10), (1066, 3))
 83
 84        Perform a train/test split, also outputting NumPy arrays::
 85
 86            >>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
 87            >>> X_train, X_test, y_train, y_test = train_test_split
 88            >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
 89            ((837, 10), (837, 3), (229, 10), (229, 3))
 90
 91        Output the underlying Pandas DataFrame::
 92
 93            >>> df = dataset.to_pandas()
 94            >>> type(df)
 95            <class 'pandas.core.frame.DataFrame'>
 96    """
 97
 98    _url = (
 99        "https://archive.ics.uci.edu/ml/machine-learning-databases/"
100        "solar-flare/flare.data2"
101    )
102
103    _features = range(10)
104    _targets = range(10, 13)
105
106    def _prep_data(self, data: bytes) -> pd.DataFrame:
107        """Prepare the data set.
108
109        Args:
110            data (bytes): The raw data
111
112        Returns:
113            Pandas dataframe: The prepared data
114        """
115        # Convert the bytes into a file-like object
116        csv_file = io.BytesIO(data)
117
118        # Load in dataframe
119        cols = [
120            "class",
121            "spot_size",
122            "spot_distr",
123            "activity",
124            "evolution",
125            "flare_activity",
126            "is_complex",
127            "became_complex",
128            "large",
129            "large_spot",
130            "C-class",
131            "M-class",
132            "X-class",
133        ]
134        df = pd.read_csv(csv_file, sep=" ", skiprows=[0], names=cols)
135
136        # Encode class
137        encodings = ["A", "B", "C", "D", "E", "F", "H"]
138        df["class"] = df["class"].map(lambda x: encodings.index(x))
139
140        # Encode spot size
141        encodings = ["X", "R", "S", "A", "H", "K"]
142        df["spot_size"] = df.spot_size.map(lambda x: encodings.index(x))
143
144        # Encode spot distribution
145        encodings = ["X", "O", "I", "C"]
146        df["spot_distr"] = df.spot_distr.map(lambda x: encodings.index(x))
147
148        return df

Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period.

The database contains 3 potential classes, one for the number of times a certain type of solar flare occured in a 24 hour period.

Each instance represents captured features for 1 active region on the sun.

The data are divided into two sections. The second section (flare.data2) has had much more error correction applied to the it, and has consequently been treated as more reliable.

Arguments:
  • cache (str or None, optional): The name of the cache. It will be saved to cache in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
  • cache (str or None): The name of the cache.
  • shape (tuple of integers): Dimensions of the data set
  • columns (list of strings): List of column names in the data set
Features:

class (int): Code for class (modified Zurich class). Ranges from 0 to 6 inclusive spot_size (int): Code for largest spot size. Ranges from 0 to 5 inclusive spot_distr (int): Code for spot distribution. Ranges from 0 to 3 inclusive activity (int): Binary feature indicating 1 = reduced and 2 = unchanged evolution (int): 0 = decay, 1 = no growth and 2 = growth flare_activity (int): Previous 24 hour flare activity code, where 0 = nothing as big as an M1, 1 = one M1 and 2 = more activity than one M1 is_complex (int): Binary feature indicating historically complex became_complex (int): Binary feature indicating whether the region became historically complex on this pass across the sun's disk large (int): Binary feature, indicating whether area is large large_spot (int): Binary feature, indicating whether the area of the largest spot is greater than 5

Targets:

C-class (int): C-class flares production by this region in the following 24 hours (common flares) M-class (int): M-class flares production by this region in the following 24 hours (common flares) X-class (int): X-class flares production by this region in the following 24 hours (common flares)

Source:

https://archive.ics.uci.edu/ml/datasets/Solar+Flare

Examples:

Load in the data set::

>>> dataset = SolarFlare()
>>> dataset.shape
(1066, 13)

Split the data set into features and targets, as NumPy arrays::

>>> X, y = dataset.split()
>>> X.shape, y.shape
((1066, 10), (1066, 3))

Perform a train/test split, also outputting NumPy arrays::

>>> train_test_split = dataset.split(test_size=0.2, random_seed=42)
>>> X_train, X_test, y_train, y_test = train_test_split
>>> X_train.shape, y_train.shape, X_test.shape, y_test.shape
((837, 10), (837, 3), (229, 10), (229, 3))

Output the underlying Pandas DataFrame::

>>> df = dataset.to_pandas()
>>> type(df)
<class 'pandas.core.frame.DataFrame'>