Verwalten von Nullen mit Pandas

In diesem Notebook werden einige Möglichkeiten zum Verwalten von Nullen mithilfe von Pandas DataFrames vorgestellt. Weitere Informationen dazu findet ihr in der Dokumentation von Pandas: Working with missing data.

Siehe auch:

[1]:
import pandas as pd
from numpy import random
[2]:
df = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv')

Prüfen der Daten

[3]:
df.head()
[3]:
timestamp username temperature heartrate build latest note
0 2017-01-01T12:00:23 michaelsmith 12.0 67 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval
1 2017-01-01T12:01:09 kharrison 6.0 78 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake
2 2017-01-01T12:01:34 smithadam 5.0 89 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN
3 2017-01-01T12:02:09 eddierodriguez 28.0 76 NaN 0.0 update
4 2017-01-01T12:02:36 kenneth94 29.0 62 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN
[4]:
df.dtypes
[4]:
timestamp       object
username        object
temperature    float64
heartrate        int64
build           object
latest         float64
note            object
dtype: object
[5]:
df.note.value_counts()
[5]:
wake        16496
user        16416
interval    16274
sleep       16226
update      16213
test        16068
Name: note, dtype: int64

Entfernen aller Nullwerte (einschließlich des Hinweises n/a)

df = pd.read_csv(‘https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv‘, na_values=[‚n/a‘])

Testen, ob wir dropna verwenden können

[6]:
df.shape
[6]:
(146397, 7)
[7]:
df.dropna().shape
[7]:
(46116, 7)
[8]:
df.dropna(how='all', axis=1).shape
[8]:
(146397, 7)

Finden aller Spalten, in denen alle Daten vorhanden sind

[9]:
my_columns = list(df.columns)
[10]:
my_columns
[10]:
['timestamp',
 'username',
 'temperature',
 'heartrate',
 'build',
 'latest',
 'note']
[11]:
list(df.dropna(thresh=int(df.shape[0] * .9), axis=1).columns)
[11]:
['timestamp', 'username', 'heartrate']

Finden aller Spalten, in denen Daten fehlen

[12]:
missing_info = list(df.columns[df.isnull().any()])
[13]:
missing_info
[13]:
['temperature', 'build', 'latest', 'note']
[14]:
for col in missing_info:
    num_missing = df[df[col].isnull() == True].shape[0]
    print('number missing for column {}: {}'.format(col,
                                                    num_missing))
number missing for column temperature: 32357
number missing for column build: 32350
number missing for column latest: 32298
number missing for column note: 48704
[15]:
for col in missing_info:
    percent_missing = df[df[col].isnull() == True].shape[0] / df.shape[0]
    print('percent missing for column {}: {}'.format(
        col, percent_missing))
percent missing for column temperature: 0.22102228870810195
percent missing for column build: 0.22097447352063226
percent missing for column latest: 0.22061927498514314
percent missing for column note: 0.332684412931959

Fehlende Daten durch Mehrheitswerte ersetzen

[16]:
df.note.value_counts()
[16]:
wake        16496
user        16416
interval    16274
sleep       16226
update      16213
test        16068
Name: note, dtype: int64
[17]:
df.build.value_counts().head()
[17]:
b1d3b3a7-6639-9b0b-9b4c-22a976563f74    1
43b11996-707a-0522-23d5-19d17b1f41e6    1
ee8339c4-cbab-8164-a17e-2efb4f80dc18    1
012ba321-84f3-83e6-7d63-b344674bd40c    1
aacd60a6-100c-ac70-8322-13b5909604d9    1
Name: build, dtype: int64
[18]:
df.latest.value_counts()
[18]:
0.0    75735
1.0    38364
Name: latest, dtype: int64
[19]:
df.latest = df.latest.fillna(0)

Beispiel für die fehlenden Temperaturwerte

[20]:
df.username.value_counts().head()
[20]:
esmith    45
zsmith    43
vsmith    41
ysmith    40
jsmith    37
Name: username, dtype: int64
[21]:
df = df.set_index('timestamp')
[22]:
df.head()
[22]:
username temperature heartrate build latest note
timestamp
2017-01-01T12:00:23 michaelsmith 12.0 67 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval
2017-01-01T12:01:09 kharrison 6.0 78 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake
2017-01-01T12:01:34 smithadam 5.0 89 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN
2017-01-01T12:02:09 eddierodriguez 28.0 76 NaN 0.0 update
2017-01-01T12:02:36 kenneth94 29.0 62 122f1c6a-403c-2221-6ed1-b5caa08f11e0 0.0 NaN
[23]:
df.temperature = df.groupby('username').temperature.fillna(
    method='backfill', limit=3)