TDDA: Test-Driven Data Analysis

In diesem Notebook werden wir eine Python-Bibliothek TDDA genauer anschauen, die Dateneingaben (wie NumPy-Arrays oder Pandas DataFrames) verwendet und eine Reihe von Constraints um diese herum erstellt. Ihr könnt dann eure Constraints speichern (JSON-Ausgabe) und neue Daten anhand der beobachteten Constraints testen.

1. Importe

[1]:
import pandas as pd
import numpy as np
from tdda.constraints.pdconstraints import discover_constraints, \
    verify_df
[2]:
df = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv')

2. Daten überprüfen

[3]:
df.sample(10)
[3]:
timestamp username temperature heartrate build latest note
39897 2017-01-17T10:34:58 starknicholas 9 63 8aa3d627-c9b4-b57e-bb32-012b7ad30033 1 sleep
56184 2017-01-23T22:48:32 hendersonsteven 11 70 1feb8055-fbe9-88dc-ddea-01d35aadc6f3 0 user
79080 2017-02-02T03:07:30 wardtimothy 7 79 787bcdfb-4a56-d377-a6ee-c534ad477814 1 NaN
59493 2017-01-25T06:39:51 xevans 27 74 6aa43c1d-2e74-3247-fa8d-699e38a4a0bd 0 interval
143746 2017-02-27T22:40:25 aaron53 18 83 c61a9b30-404c-d76a-19a4-22e27b6727c7 0 user
39847 2017-01-17T10:05:42 amynichols 20 64 ac97a2bb-f7db-976c-4063-836c3a931345 0 user
20457 2017-01-09T16:02:10 jonessarah 22 84 9a5e10ab-477c-793d-312e-957ff63031e5 0 NaN
123208 2017-02-19T17:19:59 jperkins 15 70 2eb83fa8-b99c-9e70-a89e-a0003aef7c57 0 sleep
80161 2017-02-02T13:27:04 allenjones 6 72 ec921130-a5e3-9b0a-e9aa-88cee0e74b7c 1 user
145244 2017-02-28T13:00:35 davidreese 19 72 a4f85ee7-7a79-a400-71f9-6f2316b55ebb 0 wake
[4]:
df.dtypes
[4]:
timestamp      object
username       object
temperature     int64
heartrate       int64
build          object
latest          int64
note           object
dtype: object

3. Erstellen eines constraint-Objekt mit discover_constraints

[5]:
constraints = discover_constraints(df)
[6]:
constraints
[6]:
<tdda.constraints.base.DatasetConstraints at 0x1156b82e8>
[7]:
constraints.fields
[7]:
Fields([('timestamp', <tdda.constraints.base.FieldConstraints at 0x1156b8518>),
        ('username', <tdda.constraints.base.FieldConstraints at 0x1156b8710>),
        ('temperature',
         <tdda.constraints.base.FieldConstraints at 0x1156b88d0>),
        ('heartrate', <tdda.constraints.base.FieldConstraints at 0x1156b8668>),
        ('build', <tdda.constraints.base.FieldConstraints at 0x1156b8b38>),
        ('latest', <tdda.constraints.base.FieldConstraints at 0x1156b8da0>),
        ('note', <tdda.constraints.base.FieldConstraints at 0x1156b8ef0>)])

4. Schreiben der Constraints in eine Datei

[8]:
with open('../../data/ignore-iot_constraints.tdda', 'w') as f:
    f.write(constraints.to_json())
[9]:
cat ../../data/ignore-iot_constraints.tdda
{
    "creation_metadata": {
        "local_time": "2020-07-06 14:14:58",
        "utc_time": "2020-07-06 12:12:58",
        "creator": "TDDA 1.0.31",
        "host": "eve.local",
        "user": "veit",
        "n_records": 146397,
        "n_selected": 146397
    },
    "fields": {
        "timestamp": {
            "type": "string",
            "min_length": 19,
            "max_length": 19,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "username": {
            "type": "string",
            "min_length": 3,
            "max_length": 21,
            "max_nulls": 0
        },
        "temperature": {
            "type": "int",
            "min": 5,
            "max": 29,
            "sign": "positive",
            "max_nulls": 0
        },
        "heartrate": {
            "type": "int",
            "min": 60,
            "max": 89,
            "sign": "positive",
            "max_nulls": 0
        },
        "build": {
            "type": "string",
            "min_length": 36,
            "max_length": 36,
            "max_nulls": 0,
            "no_duplicates": true
        },
        "latest": {
            "type": "int",
            "min": 0,
            "max": 1,
            "sign": "non-negative",
            "max_nulls": 0
        },
        "note": {
            "type": "string",
            "min_length": 4,
            "max_length": 8,
            "allowed_values": [
                "interval",
                "sleep",
                "test",
                "update",
                "user",
                "wake"
            ]
        }
    }
}

5. Überprüfen von Dataframes mit verify_df

[10]:
new_df = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv')
[11]:
v = verify_df(new_df, '../../data/ignore-iot_constraints.tdda')