Scikit Learn Preprocessing

[1]:
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd
from datetime import datetime
[2]:
hvac = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/HVAC_with_nulls.csv')

Überprüfen der Datenqualität

[3]:
hvac.dtypes
[3]:
Date           object
Time           object
TargetTemp    float64
ActualTemp      int64
System          int64
SystemAge     float64
BuildingID      int64
10            float64
dtype: object
[4]:
hvac.shape
[4]:
(8000, 8)
[5]:
hvac.head()
[5]:
Date Time TargetTemp ActualTemp System SystemAge BuildingID 10
0 6/1/13 0:00:01 66.0 58 13 20.0 4 NaN
1 6/2/13 1:00:01 NaN 68 3 20.0 17 NaN
2 6/3/13 2:00:01 70.0 73 17 20.0 18 NaN
3 6/4/13 3:00:01 67.0 63 2 NaN 15 NaN
4 6/5/13 4:00:01 68.0 74 16 9.0 3 NaN

Fehlenden Werten den Mittelwert zuschreiben

[6]:
imp = SimpleImputer(missing_values=np.nan,
                    strategy='mean')
[7]:
hvac_numeric = hvac[['TargetTemp', 'SystemAge']]
[8]:
imp = imp.fit(hvac_numeric.loc[:10])
[9]:
transformed = imp.fit_transform(hvac_numeric)
[10]:
transformed
[10]:
array([[66.        , 20.        ],
       [67.50773481, 20.        ],
       [70.        , 20.        ],
       ...,
       [67.50773481,  4.        ],
       [65.        , 23.        ],
       [66.        , 21.        ]])
[11]:
hvac['TargetTemp'], hvac['SystemAge'] = transformed[:,0], transformed[:,1]
[12]:
hvac.head()
[12]:
Date Time TargetTemp ActualTemp System SystemAge BuildingID 10
0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN
1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN
2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN
3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN
4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN

Temperaturwerte skalieren

[13]:
hvac['ScaledTemp'] = preprocessing.scale(hvac['ActualTemp'])
[14]:
hvac['ScaledTemp'].head()
[14]:
0   -1.293272
1    0.048732
2    0.719733
3   -0.622270
4    0.853934
Name: ScaledTemp, dtype: float64

Skalieren mit dem MinMaxScaler

[15]:
min_max_scaler = preprocessing.MinMaxScaler()
[16]:
temp_minmax = min_max_scaler.fit_transform(hvac[['ActualTemp']])
[17]:
temp_minmax
[17]:
array([[0.12],
       [0.52],
       [0.72],
       ...,
       [0.56],
       [0.32],
       [0.44]])