最近了解学习数据统计,了解到了python的sklearn这个库,集成了很多机器学习的模型,感觉很强大,官网
下面通过一个简单的预测的例子来上手sklearn这个库。
根据pythonprogramming上的例子改写。本次实验使用anaconda的集成环境,故不需要下载所需的包,但需要使用conda命令更新sklearn至官网教程所使用的版本(之前因为版本问题纠结了好久)。如果没有使用anaconda,需要先下载scipy,numpy等库作为支持,这里使用python自带的pip install命令即可方便下载。
导入相应的库
1 2 3 4 5 6 7 8 9 10 |
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np <span class="hljs-keyword">import</span> datetime <span class="hljs-keyword">import</span> pandas.io.data <span class="hljs-keyword">as</span> web <span class="hljs-keyword">import</span> math <span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt <span class="hljs-keyword">from</span> matplotlib <span class="hljs-keyword">import</span> style <span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> cross_val_score <span class="hljs-keyword">from</span> sklearn <span class="hljs-keyword">import</span> preprocessing, cross_validation, svm <span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LinearRegression |
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
首先从互联网上获取我们所需要的yahoo的股票数据:
1 2 3 4 5 |
start = datetime.datetime(<span class="hljs-number">2015</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>) <span class="hljs-function"><span class="hljs-keyword">end</span> = <span class="hljs-title">datetime</span>.<span class="hljs-title">datetime</span>(<span class="hljs-title">2016</span>, <span class="hljs-title">11</span>, <span class="hljs-title">20</span>)</span> <span class="hljs-comment">#从互联网获取数据</span> df = web.DataReader(<span class="hljs-string">"XOM"</span>, <span class="hljs-string">"yahoo"</span>, start, <span class="hljs-function"><span class="hljs-keyword">end</span>)</span> <span class="hljs-comment">#print(df.head())</span> |
- 1
- 2
- 3
- 4
- 5
构建我们的数据集
1 2 3 4 5 |
df = df<span class="hljs-string">[['Open', 'High', 'Low', 'Close', 'Volume']]</span> df[<span class="hljs-string">'HL_PCT'</span>] = (df[<span class="hljs-string">'High'</span>] - df[<span class="hljs-string">'Low'</span>]) / df[<span class="hljs-string">'Close'</span>] * <span class="hljs-number">100.0</span> df[<span class="hljs-string">'PCT_change'</span>] = (df[<span class="hljs-string">'Close'</span>] - df[<span class="hljs-string">'Open'</span>]) / df[<span class="hljs-string">'Open'</span>] * <span class="hljs-number">100.0</span> df = df<span class="hljs-string">[['Close', 'HL_PCT', 'PCT_change', 'Volume']]</span> #<span class="hljs-built_in">print</span>(df.head()) |
- 1
- 2
- 3
- 4
- 5
对空数据进行处理,同时对Close的股票value进行预测,forecast_out表示往后预测的天数
1 2 3 4 |
forecast_col = <span class="hljs-string">'Close'</span> df.fillna(<span class="hljs-keyword">value</span>=-<span class="hljs-number">99999</span>, inplace=True) forecast_out = <span class="hljs-keyword">int</span>(math.ceil(<span class="hljs-number">0.01</span> * len(df))) <span class="hljs-preprocessor">#预测forecast_out天后的</span> |
- 1
- 2
- 3
- 4
重新构建X和y,X为[‘Close’, ‘HL_PCT’, ‘PCT_change’, ‘Volume’],y为[‘label’]表示forecast_out天后的股票值,使用preprocessing.scale对数据集进行scaling。
X_lately 表示后forecast_out天 的数据集,既对应的y值为NAN
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
df[<span class="hljs-string">'label'</span>] = df[forecast_col].shift(-forecast_out) <span class="hljs-keyword">print</span>(df.shape) <span class="hljs-keyword">print</span>(df.tail()) X = np.<span class="hljs-keyword">array</span>(df.drop([<span class="hljs-string">'label'</span>], <span class="hljs-number">1</span>)) X = preprocessing.scale(X) X_lately = X[-forecast_out:] X = X[:-forecast_out] df.dropna(inplace=<span class="hljs-keyword">True</span>) <span class="hljs-keyword">print</span>(X) <span class="hljs-keyword">print</span>(X_lately) y = np.<span class="hljs-keyword">array</span>(df[<span class="hljs-string">'label'</span>]) <span class="hljs-comment">#print(y)</span> <span class="hljs-keyword">print</span>(X.shape) <span class="hljs-keyword">print</span>(y.shape) |
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
选择数据集80%作为训练集,20%作为测试集
使用sklearn提供的Linear Regression函数进行建模,最后使用测试集进行测试,计算相应的精确度
1 2 3 4 5 6 7 |
X_train, X_test, y_train ,y_test = cross_validation<span class="hljs-preprocessor">.train</span>_test_split(<span class="hljs-built_in">X</span>,<span class="hljs-built_in">y</span>,test_size=<span class="hljs-number">0.2</span>) clf = LinearRegression() clf<span class="hljs-preprocessor">.fit</span>(X_train,y_train) accuracy = clf<span class="hljs-preprocessor">.score</span>(X_test,y_test) print(accuracy) |
- 1
- 2
- 3
- 4
- 5
- 6
- 7
forecast_set是我们通过训练集训练出的模型和我们的最近的数据进行的预测
把我们的预测集放入之前的DataFrame,最后绘图得到股票走势图
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
forecast_<span class="hljs-keyword">set</span> = clf.predict(X_lately) print(forecast_<span class="hljs-keyword">set</span>,accuracy,forecast_out) style.use(<span class="hljs-string">'ggplot'</span>) df[<span class="hljs-string">'Forecast'</span>]=np.nan last_date = df.iloc[-<span class="hljs-number">1</span>].name last_unix = last_date.timestamp() print(last_date,last_unix) one_day = <span class="hljs-number">86400</span> next_unix = last_unix + one_day <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> forecast_<span class="hljs-keyword">set</span>: next_date = datetime.datetime.fromtimestamp(next_unix) next_unix += <span class="hljs-number">86400</span> df.loc[next_date] = [np.nan <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(len(df.columns)-<span class="hljs-number">1</span>)]+[i] print(df.tail()) df[<span class="hljs-string">'Close'</span>].plot() df[<span class="hljs-string">'Forecast'</span>].plot() plt.show() |
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
红线表示原始数据,蓝线表示预测的股票走势
具体代码,见Titanssword的github