🐸

The Duplicate Scroll

The Data Tinkererpython-data-tinkerer-35-the-duplicate-scroll
Reward: 180 XP
|

The Duplicate Scroll

Hoppy reaches the second gate of the archivist gauntlet and finds an annoying bundle of copied records: the same batch of scrolls has been logged more than once, as if each entry left a shadow behind it. The archivist quietly warns him that the real danger is not that the data looks messy — it is that repeated rows quietly distort the counts that come after them.

So this lesson is not about learning one more new container. It is about making steadier decisions: what should remember “have I seen this already,” what should keep the first-seen order, and what should hold the final counts. At this point in Chapter 6, the work should feel more like a real small task than a one-button drill.

Ask first: what do you actually want to preserve?

A common mistake in duplicate-heavy tasks is not the loop itself. It is forcing everything into the same shape. But in this kind of job, different structures keep different things: a set is great for answering “have I seen this name already,” a list is good for keeping the first-seen order, and a dict is good for gathering counts like “wing -> total”.

sample_titles = ["Moon Thread", "Ember Ink", "Moon Thread", "Fern Seal"]

seen_titles = set()
ordered_unique = []

for title in sample_titles:
  if title not in seen_titles:
      seen_titles.add(title)
      ordered_unique.append(title)

print(ordered_unique)

This tiny example only shows one idea: the set is not there to display the order. It is there to remember “already seen.” The thing that actually keeps the first appearance order is the ordered_unique list. In the starter, you will add a dict count on top of that, so all three structures work together.

Today’s task: turn the duplicate scroll into one clean archive result

The starter already reads duplicate_scroll.txt and shelf_map.json for you. First complete clean_line(raw_line) and build_record(cleaned_line), then organize the steps: build cleaned_lines and all_records, use seen_scrolls and unique_records to remove the shadows, and finally gather counts with wing_counts and archive_summary.

1
Clean each scroll row back into a readable shape first

Every row carries the same noise: the "## " prefix, names squeezed together with "~", and a trailing "??". Put that cleanup sequence inside clean_line(raw_line) first.

2
Turn each cleaned text row into a full record

Inside build_record(cleaned_line), split out scroll_name, shelf_code, and status, then use shelf_map[shelf_code] to add wing_name and keeper_name.

3
Let each structure do the job it fits best

Use a set to remember which scroll names have already appeared, a list to keep the full first-seen records in order, and a dict to count how many unique scrolls remain in each archive wing. The point here is not fancy code. The point is clear structure roles.

4
Gather the result into one small but complete summary

Finish by building archive_summary so it tells you at least: the raw row count, the unique scroll count, how many duplicate rows were removed, how many unique scrolls are still ready, and the wing_counts. That is a very typical medium-sized cleanup task ending.

This is not an algorithms lesson

We are not chasing the flashiest deduplication trick, and we are not turning this into a hard optimization puzzle. What we really want to practice is whether you can place set, list, and dict in their most useful roles inside one slightly fuller task.

Suggested Solution
Expand
Solution:
import json

with open("duplicate_scroll.txt", "r", encoding="utf-8") as file:
  scroll_text = file.read().strip()

with open("shelf_map.json", "r", encoding="utf-8") as file:
  shelf_map = json.load(file)

print("Duplicate scroll text:")
print(scroll_text)
print("Shelf map:", shelf_map)

scroll_lines = scroll_text.splitlines()
print("Scroll lines:", scroll_lines)


def clean_line(raw_line):
  return raw_line.strip().replace("## ", "").replace("~", " ").replace("??", "")


def build_record(cleaned_line):
  parts = cleaned_line.split(" | ")
  scroll_name = parts[0].split("=")[1]
  shelf_code = parts[1].split("=")[1]
  status = parts[2].split("=")[1]
  shelf_record = shelf_map[shelf_code]

  return {
      "scroll_name": scroll_name,
      "shelf_code": shelf_code,
      "status": status,
      "wing_name": shelf_record["wing_name"],
      "keeper_name": shelf_record["keeper_name"],
  }


cleaned_lines = []
for raw_line in scroll_lines:
  cleaned_lines.append(clean_line(raw_line))

all_records = []
for cleaned_line in cleaned_lines:
  all_records.append(build_record(cleaned_line))

seen_scrolls = set()
unique_records = []
for record in all_records:
  scroll_name = record["scroll_name"]
  if scroll_name not in seen_scrolls:
      seen_scrolls.add(scroll_name)
      unique_records.append(record)

wing_counts = {}
for record in unique_records:
  wing_name = record["wing_name"]
  if wing_name not in wing_counts:
      wing_counts[wing_name] = 0
  wing_counts[wing_name] += 1

ready_unique_count = 0
for record in unique_records:
  if record["status"] == "ready":
      ready_unique_count += 1

archive_summary = {
  "raw_row_count": len(all_records),
  "unique_scroll_count": len(unique_records),
  "duplicate_row_count": len(all_records) - len(unique_records),
  "ready_unique_count": ready_unique_count,
  "wing_counts": wing_counts,
}

print("Cleaned lines:", cleaned_lines)
print("All records:", all_records)
print("Seen scrolls:", seen_scrolls)
print("Unique records:", unique_records)
print("Wing counts:", wing_counts)
print("Archive summary:", archive_summary)
Advanced Tips
Want more? Click to expand

The most useful thing to carry forward from this lesson is not the word “deduplication” by itself. It is the split of responsibilities: set checks whether you have seen something, list preserves the order you want to keep, and dict holds the counts. Once the structure choice is right, the steps become much steadier.

The next lesson will check these abilities in a different way: instead of building the flow from scratch, you will repair a broken script. That is where good structure choice starts to matter even more, because it helps you read and fix someone else’s small data workflow.

Loading...
Terminal
Terminal
Ready to run...