doubt.datasets.stocks
Stocks data set.
This data set is from the UCI data set archive, with the description being the original description verbatim. Some feature names may have been altered, based on the description.
1"""Stocks data set. 2 3This data set is from the UCI data set archive, with the description being the original 4description verbatim. Some feature names may have been altered, based on the 5description. 6""" 7 8import io 9 10import pandas as pd 11 12from .dataset import BASE_DATASET_DESCRIPTION, BaseDataset 13 14 15class Stocks(BaseDataset): 16 __doc__ = f""" 17 There are three disadvantages of weighted scoring stock selection models. First, 18 they cannot identify the relations between weights of stock-picking concepts and 19 performances of portfolios. Second, they cannot systematically discover the optimal 20 combination for weights of concepts to optimize the performances. Third, they are 21 unable to meet various investors' preferences. 22 23 This study aims to more efficiently construct weighted scoring stock selection 24 models to overcome these disadvantages. Since the weights of stock-picking concepts 25 in a weighted scoring stock selection model can be regarded as components in a 26 mixture, we used the simplex centroid mixture design to obtain the experimental 27 sets of weights. These sets of weights are simulated with US stock market 28 historical data to obtain their performances. Performance prediction models were 29 built with the simulated performance data set and artificial neural networks. 30 31 Furthermore, the optimization models to reflect investors' preferences were built 32 up, and the performance prediction models were employed as the kernel of the 33 optimization models so that the optimal solutions can now be solved with 34 optimization techniques. The empirical values of the performances of the optimal 35 weighting combinations generated by the optimization models showed that they can 36 meet various investors' preferences and outperform those of S&P's 500 not only 37 during the training period but also during the testing period. 38 39 {BASE_DATASET_DESCRIPTION} 40 41 Features: 42 bp (float): 43 Large B/P 44 roe (float): 45 Large ROE 46 sp (float): 47 Large S/P 48 return_rate (float): 49 Large return rate in the last quarter 50 market_value (float): 51 Large market value 52 small_risk (float): 53 Small systematic risk 54 orig_annual_return (float): 55 Annual return 56 orig_excess_return (float): 57 Excess return 58 orig_risk (float): 59 Systematic risk 60 orig_total_risk (float): 61 Total risk 62 orig_abs_win_rate (float): 63 Absolute win rate 64 orig_rel_win_rate (float): 65 Relative win rate 66 67 Targets: 68 annual_return (float): 69 Annual return 70 excess_return (float): 71 Excess return 72 risk (float): 73 Systematic risk 74 total_risk (float): 75 Total risk 76 abs_win_rate (float): 77 Absolute win rate 78 rel_win_rate (float): 79 Relative win rate 80 81 Source: 82 https://archive.ics.uci.edu/ml/datasets/Stock+portfolio+performance 83 84 Examples: 85 Load in the data set:: 86 87 >>> dataset = Stocks() 88 >>> dataset.shape 89 (252, 19) 90 91 Split the data set into features and targets, as NumPy arrays:: 92 93 >>> X, y = dataset.split() 94 >>> X.shape, y.shape 95 ((252, 12), (252, 6)) 96 97 Perform a train/test split, also outputting NumPy arrays:: 98 99 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 100 >>> X_train, X_test, y_train, y_test = train_test_split 101 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 102 ((197, 12), (197, 6), (55, 12), (55, 6)) 103 104 Output the underlying Pandas DataFrame:: 105 106 >>> df = dataset.to_pandas() 107 >>> type(df) 108 <class 'pandas.core.frame.DataFrame'> 109 """ 110 111 _url = ( 112 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 113 "00390/stock%20portfolio%20performance%20data%20set.xlsx" 114 ) 115 116 _features = range(12) 117 _targets = range(12, 18) 118 119 def _prep_data(self, data: bytes) -> pd.DataFrame: 120 """Prepare the data set. 121 122 Args: 123 data (bytes): The raw data 124 125 Returns: 126 Pandas dataframe: The prepared data 127 """ 128 # Convert the bytes into a file-like object 129 xlsx_file = io.BytesIO(data) 130 131 # Load in the dataframes 132 cols = [ 133 "id", 134 "bp", 135 "roe", 136 "sp", 137 "return_rate", 138 "market_value", 139 "small_risk", 140 "orig_annual_return", 141 "orig_excess_return", 142 "orig_risk", 143 "orig_total_risk", 144 "orig_abs_win_rate", 145 "orig_rel_win_rate", 146 "annual_return", 147 "excess_return", 148 "risk", 149 "total_risk", 150 "abs_win_rate", 151 "rel_win_rate", 152 ] 153 sheets = ["1st period", "2nd period", "3rd period", "4th period"] 154 dfs = pd.read_excel( 155 xlsx_file, sheet_name=sheets, names=cols, skiprows=[0, 1], header=None 156 ) 157 158 # Concatenate the dataframes 159 df = pd.concat([dfs[sheet] for sheet in sheets], ignore_index=True) 160 161 return df
16class Stocks(BaseDataset): 17 __doc__ = f""" 18 There are three disadvantages of weighted scoring stock selection models. First, 19 they cannot identify the relations between weights of stock-picking concepts and 20 performances of portfolios. Second, they cannot systematically discover the optimal 21 combination for weights of concepts to optimize the performances. Third, they are 22 unable to meet various investors' preferences. 23 24 This study aims to more efficiently construct weighted scoring stock selection 25 models to overcome these disadvantages. Since the weights of stock-picking concepts 26 in a weighted scoring stock selection model can be regarded as components in a 27 mixture, we used the simplex centroid mixture design to obtain the experimental 28 sets of weights. These sets of weights are simulated with US stock market 29 historical data to obtain their performances. Performance prediction models were 30 built with the simulated performance data set and artificial neural networks. 31 32 Furthermore, the optimization models to reflect investors' preferences were built 33 up, and the performance prediction models were employed as the kernel of the 34 optimization models so that the optimal solutions can now be solved with 35 optimization techniques. The empirical values of the performances of the optimal 36 weighting combinations generated by the optimization models showed that they can 37 meet various investors' preferences and outperform those of S&P's 500 not only 38 during the training period but also during the testing period. 39 40 {BASE_DATASET_DESCRIPTION} 41 42 Features: 43 bp (float): 44 Large B/P 45 roe (float): 46 Large ROE 47 sp (float): 48 Large S/P 49 return_rate (float): 50 Large return rate in the last quarter 51 market_value (float): 52 Large market value 53 small_risk (float): 54 Small systematic risk 55 orig_annual_return (float): 56 Annual return 57 orig_excess_return (float): 58 Excess return 59 orig_risk (float): 60 Systematic risk 61 orig_total_risk (float): 62 Total risk 63 orig_abs_win_rate (float): 64 Absolute win rate 65 orig_rel_win_rate (float): 66 Relative win rate 67 68 Targets: 69 annual_return (float): 70 Annual return 71 excess_return (float): 72 Excess return 73 risk (float): 74 Systematic risk 75 total_risk (float): 76 Total risk 77 abs_win_rate (float): 78 Absolute win rate 79 rel_win_rate (float): 80 Relative win rate 81 82 Source: 83 https://archive.ics.uci.edu/ml/datasets/Stock+portfolio+performance 84 85 Examples: 86 Load in the data set:: 87 88 >>> dataset = Stocks() 89 >>> dataset.shape 90 (252, 19) 91 92 Split the data set into features and targets, as NumPy arrays:: 93 94 >>> X, y = dataset.split() 95 >>> X.shape, y.shape 96 ((252, 12), (252, 6)) 97 98 Perform a train/test split, also outputting NumPy arrays:: 99 100 >>> train_test_split = dataset.split(test_size=0.2, random_seed=42) 101 >>> X_train, X_test, y_train, y_test = train_test_split 102 >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape 103 ((197, 12), (197, 6), (55, 12), (55, 6)) 104 105 Output the underlying Pandas DataFrame:: 106 107 >>> df = dataset.to_pandas() 108 >>> type(df) 109 <class 'pandas.core.frame.DataFrame'> 110 """ 111 112 _url = ( 113 "https://archive.ics.uci.edu/ml/machine-learning-databases/" 114 "00390/stock%20portfolio%20performance%20data%20set.xlsx" 115 ) 116 117 _features = range(12) 118 _targets = range(12, 18) 119 120 def _prep_data(self, data: bytes) -> pd.DataFrame: 121 """Prepare the data set. 122 123 Args: 124 data (bytes): The raw data 125 126 Returns: 127 Pandas dataframe: The prepared data 128 """ 129 # Convert the bytes into a file-like object 130 xlsx_file = io.BytesIO(data) 131 132 # Load in the dataframes 133 cols = [ 134 "id", 135 "bp", 136 "roe", 137 "sp", 138 "return_rate", 139 "market_value", 140 "small_risk", 141 "orig_annual_return", 142 "orig_excess_return", 143 "orig_risk", 144 "orig_total_risk", 145 "orig_abs_win_rate", 146 "orig_rel_win_rate", 147 "annual_return", 148 "excess_return", 149 "risk", 150 "total_risk", 151 "abs_win_rate", 152 "rel_win_rate", 153 ] 154 sheets = ["1st period", "2nd period", "3rd period", "4th period"] 155 dfs = pd.read_excel( 156 xlsx_file, sheet_name=sheets, names=cols, skiprows=[0, 1], header=None 157 ) 158 159 # Concatenate the dataframes 160 df = pd.concat([dfs[sheet] for sheet in sheets], ignore_index=True) 161 162 return df
There are three disadvantages of weighted scoring stock selection models. First, they cannot identify the relations between weights of stock-picking concepts and performances of portfolios. Second, they cannot systematically discover the optimal combination for weights of concepts to optimize the performances. Third, they are unable to meet various investors' preferences.
This study aims to more efficiently construct weighted scoring stock selection models to overcome these disadvantages. Since the weights of stock-picking concepts in a weighted scoring stock selection model can be regarded as components in a mixture, we used the simplex centroid mixture design to obtain the experimental sets of weights. These sets of weights are simulated with US stock market historical data to obtain their performances. Performance prediction models were built with the simulated performance data set and artificial neural networks.
Furthermore, the optimization models to reflect investors' preferences were built up, and the performance prediction models were employed as the kernel of the optimization models so that the optimal solutions can now be solved with optimization techniques. The empirical values of the performances of the optimal weighting combinations generated by the optimization models showed that they can meet various investors' preferences and outperform those of S&P's 500 not only during the training period but also during the testing period.
Arguments:
- cache (str or None, optional): The name of the cache. It will be saved to
cache
in the current working directory. If None then no cache will be saved. Defaults to '.dataset_cache'.
Attributes:
- cache (str or None): The name of the cache.
- shape (tuple of integers): Dimensions of the data set
- columns (list of strings): List of column names in the data set
Features:
bp (float): Large B/P roe (float): Large ROE sp (float): Large S/P return_rate (float): Large return rate in the last quarter market_value (float): Large market value small_risk (float): Small systematic risk orig_annual_return (float): Annual return orig_excess_return (float): Excess return orig_risk (float): Systematic risk orig_total_risk (float): Total risk orig_abs_win_rate (float): Absolute win rate orig_rel_win_rate (float): Relative win rate
Targets:
annual_return (float): Annual return excess_return (float): Excess return risk (float): Systematic risk total_risk (float): Total risk abs_win_rate (float): Absolute win rate rel_win_rate (float): Relative win rate
Source:
https://archive.ics.uci.edu/ml/datasets/Stock+portfolio+performance
Examples:
Load in the data set::
>>> dataset = Stocks() >>> dataset.shape (252, 19)
Split the data set into features and targets, as NumPy arrays::
>>> X, y = dataset.split() >>> X.shape, y.shape ((252, 12), (252, 6))
Perform a train/test split, also outputting NumPy arrays::
>>> train_test_split = dataset.split(test_size=0.2, random_seed=42) >>> X_train, X_test, y_train, y_test = train_test_split >>> X_train.shape, y_train.shape, X_test.shape, y_test.shape ((197, 12), (197, 6), (55, 12), (55, 6))
Output the underlying Pandas DataFrame::
>>> df = dataset.to_pandas() >>> type(df) <class 'pandas.core.frame.DataFrame'>