🐸

The Contact Formatter

The Data Tinkererpython-data-tinkerer-41-the-contact-formatter
Reward: 180 XP
|

The Contact Formatter

You have already cleaned a signup sheet, straightened out a config, and prepared a log summary. Now there is one last file in this pile that the next collaborator will probably use most directly: the contact list. The raw sheet still has extra spaces, duplicate emails, inconsistent tag names, and messy city information. If you hand it off like this, the next person will get stuck while trying to message people, group them, and follow up.

So this time you are not just returning another practice result. You are handing back a contact list that someone else can keep working from immediately. You will clean the fields, keep first-seen contacts by email, normalize tags and city groups with rules, and return both the cleaned list and the summary for the next organizer.

First, look at one smaller move: split and clean one contact row

Do not jump into the whole sheet yet. Start with one toy example: one contact row where the name, email, and tags all carry extra whitespace. That lets you rehearse the smaller cleanup moves first.

toy_row = "  Ava   Stone  , AVA@example.com , vip ; mentor ; vip "

name_text, email_text, tags_text = toy_row.split(",")
clean_name = " ".join(name_text.strip().split())
clean_email = email_text.strip().lower()
clean_tags = []

for raw_tag in tags_text.split(";"):
  tag = raw_tag.strip().lower()
  if tag != "" and tag not in clean_tags:
      clean_tags.append(tag)

print(clean_name)
print(clean_email)
print(clean_tags)

That example only demonstrates the moves: split, clean, and keep unique tags. It does not read today’s CSV, it does not use the JSON rules yet, and it does not build the final summary.

Today’s deliverable: read messy_contacts.csv and contact_rules.json, then hand back a contact list people can keep using

The script already reads messy_contacts.csv and contact_rules.json. Your job is to process the contact sheet row by row, clean the fields, keep first-seen contacts by email, normalize tags and city groups with the JSON rules, and place the result into formatted_contacts and contact_summary.

1
Turn the CSV text into real data rows first

Use raw_contacts_text.splitlines() to build contact_lines, then skip the header row and keep only the real contact rows in rows.

2
Clean name, email, tags, and city for each row

Use row.split(",") to unpack the 4 fields. The name can use " ".join(name_text.strip().split()) to collapse extra spaces. The email should use strip().lower(), and the city should be cleaned too.

3
Deduplicate by first-seen email and normalize with the JSON rules

If the cleaned email has already appeared, increase duplicate_emails_skipped and skip that row. Otherwise, keep going: split the tags with split(";"), clean them with strip().lower(), and map them through contact_rules["tag_aliases"]. For the city group, read from contact_rules["city_groups"].

4
Build the final structure and summary

Every kept contact should go into formatted_contacts with at least name, email, city, group, and tags. Then build contact_summary with at least the total row count, kept contact count, skipped duplicate-email count, group counts, and tag counts.

Hand the list to the next organizer

Once the names, emails, tags, and groups are stable, the next collaborator can keep contacting, filtering, and following up without asking you what the original sheet was supposed to mean.

Suggested Solution
Expand
Solution:
import json

CONTACTS_CSV = "messy_contacts.csv"
RULES_JSON = "contact_rules.json"

with open(CONTACTS_CSV, "r", encoding="utf-8") as file:
  raw_contacts_text = file.read().strip()

with open(RULES_JSON, "r", encoding="utf-8") as file:
  contact_rules = json.load(file)

print("Raw contacts text:")
print(raw_contacts_text)
print("Contact rules:", contact_rules)

contact_lines = raw_contacts_text.splitlines()
rows = contact_lines[1:]
formatted_contacts = []
seen_emails = set()
group_counts = {}
tag_counts = {}
duplicate_emails_skipped = 0

tag_aliases = contact_rules["tag_aliases"]
city_groups = contact_rules["city_groups"]

for row in rows:
  name_text, email_text, tags_text, city_text = row.split(",")

  clean_name = " ".join(name_text.strip().split())
  clean_email = email_text.strip().lower()
  clean_city = " ".join(city_text.strip().split())

  if clean_email in seen_emails:
      duplicate_emails_skipped += 1
      continue

  seen_emails.add(clean_email)
  clean_tags = []

  for raw_tag in tags_text.split(";"):
      clean_tag = raw_tag.strip().lower()
      if clean_tag == "":
          continue

      mapped_tag = tag_aliases.get(clean_tag, clean_tag)
      if mapped_tag not in clean_tags:
          clean_tags.append(mapped_tag)
          if mapped_tag not in tag_counts:
              tag_counts[mapped_tag] = 0
          tag_counts[mapped_tag] += 1

  group = city_groups.get(clean_city.lower(), "general")
  if group not in group_counts:
      group_counts[group] = 0
  group_counts[group] += 1

  formatted_contacts.append(
      {
          "name": clean_name,
          "email": clean_email,
          "city": clean_city,
          "group": group,
          "tags": clean_tags,
      }
  )

contact_summary = {
  "total_rows": len(rows),
  "formatted_count": len(formatted_contacts),
  "duplicate_emails_skipped": duplicate_emails_skipped,
  "group_counts": group_counts,
  "tag_counts": tag_counts,
}

print("Formatted contacts:", formatted_contacts)
print("Contact summary:", contact_summary)
Advanced Tips
Want more? Click to expand

The most important result in this lesson is that you can now connect reading records, cleaning fields, deduplicating, applying rules, and building structured output into one workflow another person can truly pick up.

When you can turn a messy contact sheet into a deliverable list, this whole set of data and text handling skills has become something you can actually carry into real work.

Loading...
Terminal
Terminal
Ready to run...