Datenvalidierung mit Voluptuous (Schemadefinitionen)#

In diesem Notebook verwenden wir Voluptuous, um Schemata für unsere Daten zu definieren. Wir können dann die Schemaprüfung an verschiedenen Stellen unserer Bereinigung verwenden, um sicherzustellen, dass wir die Kriterien erfüllen. Schließllich können wir Ausnahmen für die Schemaüberprüfung verwenden, um unreine oder ungültige Daten zu markieren, beiseite zu legen oder zu entfernen.

Siehe auch

1. Importe#

[1]:
import logging

from datetime import datetime

import pandas as pd

from voluptuous import ALLOW_EXTRA, All, Range, Required, Schema
from voluptuous.error import Invalid, MultipleInvalid
  • Required markiert den Knoten eines Schemas als erforderlich und gibt optional einen Standardwert an, siehe auch voluptuous.schema_builder.Required.

  • Range begrenzt den Wert auf einen Bereich, wobei entweder min oder max weggelassen werden kann; siehe auch voluptuous.validators.Range.

  • ALL wird für feldübergreifende Validierungen verwendet: prüft die Grundstruktur der Daten in einem ersten Durchgang und erst im zweiten Durchgang wird die feldübergreifende Validierung angewendet; siehe auch voluptuous.validators.All.

  • ALLOW_EXTRA erlaubt zusätzliche Wörterbuchschlüssel

  • MultipleInvalid basiert auf Invalid, siehe auch voluptuous.error.MultipleInvalid.

  • Invalid kennzeichnet Daten als ungültig, siehe auch voluptuous.error.Invalid.

2. Logger#

[2]:
logger = logging.getLogger(0)

logger.setLevel(logging.WARNING)

3. Beispieldaten lesen#

[3]:
sales = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/sales_data.csv')

4. Daten untersuchen#

[4]:
sales.head()
[4]:
Unnamed: 0 timestamp city store_id sale_number sale_amount associate
0 0 2018-09-10 05:00:45 Williamburgh 6 1530 1167.0 Gary Lee
1 1 2018-09-12 10:01:27 Ibarraberg 1 2744 258.0 Daniel Davis
2 2 2018-09-13 12:01:48 Sarachester 2 1908 266.0 Michael Roth
3 3 2018-09-14 20:02:19 Caldwellbury 14 771 -108.0 Michaela Stewart
4 4 2018-09-16 01:03:21 Erikaland 11 1571 -372.0 Mark Taylor
[5]:
sales.dtypes
[5]:
Unnamed: 0       int64
timestamp       object
city            object
store_id         int64
sale_number      int64
sale_amount    float64
associate       object
dtype: object

5. Schema definieren#

In der Spalte sale_amount sollen alle Werte zwischen 2,5 and 1450,99 sein.

[6]:
schema = Schema(
    {
        Required("sale_amount"): All(float, Range(min=2.50, max=1450.99)),
    },
    extra=ALLOW_EXTRA,
)
[7]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
    try:
        schema(sale)
    except MultipleInvalid as e:
        logging.warning(
            "issue with sale: %s (%s) - %s", s_id, sale["sale_amount"], e
        )
        error_count += 1
WARNING:root:issue with sale: 3 (-108.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 4 (-372.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 5 (-399.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 6 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 7 (-295.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 10 (-89.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 13 (-303.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 15 (-432.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 19 (-177.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 20 (-154.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 22 (-130.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 23 (1487.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 25 (-145.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 28 (1471.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 31 (-259.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 38 (-241.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 40 (-4.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 41 (1581.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 45 (1529.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 46 (-238.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 48 (-284.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 51 (-164.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 55 (-184.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 56 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 59 (1579.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 60 (-455.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 63 (1551.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 65 (-397.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 69 (-400.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 70 (1482.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 71 (-321.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 74 (-47.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 76 (-68.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 86 (1454.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 101 (-213.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 103 (-144.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 104 (-265.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 107 (-349.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 111 (-78.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 112 (-310.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 116 (1570.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 120 (1490.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 123 (-179.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 124 (-391.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 129 (1504.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 130 (-91.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 132 (-372.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 141 (1512.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 142 (-449.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 149 (1494.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 152 (-405.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 155 (1599.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 156 (1527.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 157 (-462.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 162 (-358.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 164 (-78.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 167 (-358.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 171 (-391.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 178 (-304.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 180 (-9.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 187 (1475.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 194 (-433.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 195 (-329.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 196 (-147.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 203 (-319.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 206 (-132.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 207 (-20.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 209 (1539.0) - value must be at most 1450.99 for dictionary value @ data['sale_amount']
WARNING:root:issue with sale: 211 (-167.0) - value must be at least 2.5 for dictionary value @ data['sale_amount']

Um die Elemente einer Spalte als Schlüssel und die Elemente einer anderen Spalte als Werte verwenden zu können, machen wir einfach die gewünschte Spalte zum Index des DataFrame und transponieren sie mit der Funktion .T(); siehe auch pandas.DataFrame.transpose.

[8]:
error_count
[8]:
69

Aktuell wissen wir jedoch noch nicht, ob

  • wir ein falsch definiertes Schema haben

  • möglicherweise negative Werte zurückgegeben oder falsch markiert werden

  • höhere Werte kombinierte Einkäufe oder Sonderverkäufe sind

6. Hinzufügen einer benutzerdefinierten Validierung#

[9]:
def ValidDate(fmt="%Y-%m-%d %H:%M:%S"):
    return lambda v: datetime.strptime(v, fmt)
[10]:
schema = Schema(
    {
        Required("timestamp"): All(ValidDate()),
    },
    extra=ALLOW_EXTRA,
)
[11]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
    try:
        schema(sale)
    except MultipleInvalid as e:
        logging.warning(
            "issue with sale: %s (%s) - %s", s_id, sale["timestamp"], e
        )
        error_count += 1
[12]:
error_count
[12]:
0

7. Gültige Datumsstrukturen sind noch keine gültigen Daten#

[13]:
def ValidDate(fmt="%Y-%m-%d %H:%M:%S"):
    def validation_func(v):
        try:
            assert datetime.strptime(v, fmt) <= datetime.now()
        except AssertionError:
            raise Invalid("date is in the future! %s" % v)

    return validation_func
[14]:
schema = Schema(
    {
        Required("timestamp"): All(ValidDate()),
    },
    extra=ALLOW_EXTRA,
)
[15]:
error_count = 0
for s_id, sale in sales.T.to_dict().items():
    try:
        schema(sale)
    except MultipleInvalid as e:
        logging.warning(
            "issue with sale: %s (%s) - %s", s_id, sale["timestamp"], e
        )
        error_count += 1
[16]:
error_count
[16]:
0