Daten deduplizieren

In diesem Notebook deduplizieren wir Daten mithilfe der Dedupe-Bibliothek, die ein flaches neuronales Netzwerk verwendet, um aus einem kleinen Training zu lernen.

Zudem haben dieselben Entwickler*innen parserator erstellt, mit dem ihr Textfunktionen extrahieren und eure eigenen Textextraktion trainieren könnt.

1. Importe

[1]:
import pandas as pd
import dedupe
import os
[2]:
customers = pd.read_csv('https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv',
                        encoding='utf-8')

2. Datenqualität überprüfen

[3]:
customers.head()
[3]:
name job company street_address city state email user_name
0 Patricia Schaefer Programmer, systems Estrada-Best 398 Paul Drive Christianview Delaware lambdavid@gmail.com ndavidson
1 Olivie Dubois Ingénieur recherche et développement en agroal... Moreno rue Lucas Benard Saint Anastasie-les-Bains AR berthelotjacqueline@mahe.fr manonallain
2 Mary Davies-Kirk Public affairs consultant Baker Ltd Flat 3\nPugh mews Stanleyfurt ZA middletonconor@hotmail.com colemanmichael
3 Miroslawa Eckbauer Dispensing optician Ladeck GmbH Mijo-Lübs-Straße 12 Neubrandenburg Berlin sophia01@yahoo.de romanjunitz
4 Richard Bauer Accountant, chartered certified Hoffman-Rocha 6541 Rodriguez Wall Carlosmouth Texas tross@jensen-ware.org adam78
[4]:
customers.dtypes
[4]:
name              object
job               object
company           object
street_address    object
city              object
state             object
email             object
user_name         object
dtype: object
[5]:
for col in customers.columns:
    print(col, customers[col].isnull().sum())
name 0
job 0
company 0
street_address 0
city 0
state 0
email 0
user_name 0

3. Dedupe konfigurieren

Nun definieren wir die Felder, auf die bei der Deduplizierung geachtet werden soll und erstellen ein neues deduper-Objekt:

[6]:
variables = [
    {'field': 'name', 'type': 'String'},
    {'field': 'job', 'type': 'String'},
    {'field': 'company', 'type': 'String'},
    {'field': 'street_address','type': 'String'},
    {'field': 'city','type': 'String'},
    {'field': 'state', 'type': 'String', 'has_missing': True},
    {'field': 'email', 'type': 'String', 'has_missing': True},
    {'field': 'user_name', 'type': 'String'},
]

deduper = dedupe.Dedupe(variables)
[7]:
deduper
[7]:
<dedupe.api.Dedupe at 0x12736ed30>
[8]:
customers.shape
[8]:
(2080, 8)

4. Trainingsdaten erstellen

[9]:
deduper.prepare_training(customers.T.to_dict())
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, email), SimplePredicate: (wholeFieldPredicate, company))

5. Aktives Lernen

Wenn Dedupe ein Datensatzpaar findet, werdet ihr gebeten, es als Duplikat zu kennzeichnen. Ihr könnt hierfürdie Tasten y, n und u, um Duplikate zu kennzeichnen. Drückt f, wenn ihr fertig seid.

[10]:
dedupe.console_label(deduper)
name : Frédérique Lejeune-Daniel
job : Technicien chimiste
company : Schmitt
street_address : chemin Denise Ferrand
city : Saint CharlotteVille
state : IE
email : jchretien@costa.com
user_name : joseph60

name : Frédérique Lejeune-Daniel
job : Tecce cse
company : Sctmitt
street_address : chemin Denise Ferrand
city : Saint ChalotteVille
state : IE
email : jchretien@costacom
user_name : joseph60

0/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished
y
name : Monique Marty
job : Maoqiie
company : Arnfud
street_address : 70, rue de Carre
city : CheallierBour
state : EC
email : frederiquerichard@cohen.com
user_name : marquesseastie

name : Monique Marty
job : Maroquinier
company : Arnaud
street_address : 70, rue de Carre
city : ChevallierBourg
state : EC
email : frederiquerichard@cohen.com
user_name : marquessebastien

1/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
y
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, email), SimplePredicate: (wholeFieldPredicate, company))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, user_name), SimplePredicate: (wholeFieldPredicate, email))
name : Ing. Marian Heidrich MBA.
job : Civil engineer, consulting
company : Johann Heuser AG
street_address : Lilija-Ortmann-Straße 54
city : Husum
state : Hamburg
email : truebconcetta@googlemail.com
user_name : marie78

name : Ing. Marian Heidrich MBA.
job : Cii ngin, consuting
company : Johann Heuser AG
street_address : Lilija-Ortmann-Straße 54
city : Husu
state : Hamburg
email : truebcncetta@gglemail.cm
user_name : arie

2/10 positive, 0/10 negative
Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished / (p)revious
f
Finished labeling
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (twoGramFingerprint, email), SimplePredicate: (wholeFieldPredicate, street_address))
[11]:
training_file = 'csv_example_training.json'

if os.path.exists(training_file):
    print('reading labeled examples from ', training_file)
    with open(training_file, 'rb') as f:
        deduper.prepare_training(customers.T.to_dict(), f)
else:
    deduper.prepare_training(customers.T.to_dict())

if os.path.exists(training_file):
    print('reading labeled examples from ', training_file)
    with open(training_file, 'rb') as f:
        deduper.prepare_training(customers.T.to_dict(), f)
else:
    deduper.prepare_training(customers.T.to_dict())
reading labeled examples from  csv_example_training.json
INFO:dedupe.api:reading training from file
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, email), SimplePredicate: (wholeFieldPredicate, company))
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (alphaNumericPredicate, email), SimplePredicate: (firstIntegerPredicate, street_address))
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))
reading labeled examples from  csv_example_training.json
INFO:dedupe.api:reading training from file
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, job), SimplePredicate: (wholeFieldPredicate, street_address))
INFO:dedupe.training:Final predicate set:
INFO:dedupe.training:(SimplePredicate: (commonThreeTokens, street_address), SimplePredicate: (sameSevenCharStartPredicate, user_name))

Wenn Ihr fertig seid, speichert eure Trainingsdaten:

[12]:
with open(training_file, 'w') as tf:
    deduper.write_training(tf)

Speichert auch eure Gewichte und Prädikate. Wenn settings_file bereits existiert, werden beim nächsten Durchlauf Training und aktives Lernen übersprungen:

[13]:
settings_file = 'csv_example_learned_settings'
if os.path.exists(settings_file):
    print('reading from', settings_file)
    with open(settings_file, 'rb') as f:
        deduper = dedupe.StaticDedupe(f)
else:
    with open(settings_file, 'wb') as sf:
        deduper.write_settings(sf)