🐸

档案馆终试炼

数据与文本工具python-data-tinkerer-37-the-archivists-trial
奖励: 180 XP
|

档案馆终试炼

Hoppy 终于走到了档案馆最深处。高塔里的卷轴、柜门里的记录、散落的噪声文本,都会在这里汇成最后一次真正的试炼。档案官把一叠遗物记录推到桌前:这些行里有脏痕迹、有重复条目,也缺少需要从 vault 名册里补上的馆藏信息。只有把整条处理流程稳稳收住,档案馆的大门才会为你打开。

所以这节课不是再加新的花样,而是把这一整门 Series 学过的核心动作自然连起来:清理文本、拆字段、读取 JSON、选择合适的结构、保留第一次出现的记录、做统计、再给出一个明确的最终结论。做完这节课时,你应该能很踏实地感觉到:我真的已经会用了。

先看一个更小的动作:拆开一行,再从里面找出真正重要的线索

在终试炼里,你会不断做同一种判断:这一段文本到底是什么字段?清理之后,该拿哪一段继续判断?下面先用一个很小的玩具例子练一下这种感觉。

row = "relic=moon key | seal=amber spark"
parts = row.split(" | ")
seal_mark = parts[1].split("=")[1]

if seal_mark.find("amber") != -1:
  print("amber found")

这里没有解今天的整题,只是在示范一个关键动作:先把一行拆开,再从某个字段里继续读信息。真正的 starter 还要把 noisy 文本清干净、结合 JSON 资料补馆藏信息、去重、统计,并做出最后的通行判断。

今天的任务:完成整场档案试炼,并交出最终通行结论

starter 已经帮你读好了 archivist_trial.txtvault_index.json。你要把这份脚本补完整,让它完成一条完整的数据流程:

1
先把 noisy trial rows 清理干净

完成 clean_line(raw_line)。这里仍然是你熟悉的清理动作:去掉前缀 "## "、把 "~" 还原成空格、清掉尾巴上的 "??"。这一步做稳,后面的字段才会真正可读。

2
把每一行变成结构化记录,并补上 vault 名册里的信息

build_record(cleaned_line) 里,把一行拆成 relic_namevault_codestatusseal_mark,再用 vault_index[vault_code] 补出 hall_namekeeper_name

3
只保留真正的唯一遗物,并整理出统计结果

seen_relics 这个 set 和记住“这个 relic_name 有没有出现过”,再把第一次出现的记录按顺序放进 unique_records。接着用 hall_counts 统计每个 hall 的唯一记录数,再做出 amber_ready_relics 这样的最终试炼线索。

4
收束成最终 summary 和 access decision

这节课最重要的收尾不是“打印很多中间结果”,而是把它们收束成两个明确结果:trial_summaryaccess_decision。前者说明这场试炼里到底发生了什么,后者则给出清楚的通行判断。

这是一场收束型 mastery 课

这里不引入新知识,也不想把你推进一个开放式大项目。你要做的是把前面已经学过的动作自然接起来,让这份档案脚本真正完成一次可信、清楚、可交付的终试炼。

参考答案
点击展开
参考答案:
import json

with open("archivist_trial.txt", "r", encoding="utf-8") as file:
  trial_text = file.read().strip()

with open("vault_index.json", "r", encoding="utf-8") as file:
  vault_index = json.load(file)

print("Trial text:")
print(trial_text)
print("Vault index:", vault_index)

trial_lines = trial_text.splitlines()
print("Trial lines:", trial_lines)


def clean_line(raw_line):
  return raw_line.strip().replace("## ", "").replace("~", " ").replace("??", "")


def build_record(cleaned_line):
  parts = cleaned_line.split(" | ")
  relic_name = parts[0].split("=")[1]
  vault_code = parts[1].split("=")[1]
  status = parts[2].split("=")[1]
  seal_mark = parts[3].split("=")[1]
  vault_record = vault_index[vault_code]

  return {
      "relic_name": relic_name,
      "vault_code": vault_code,
      "status": status,
      "seal_mark": seal_mark,
      "hall_name": vault_record["hall_name"],
      "keeper_name": vault_record["keeper_name"],
  }


cleaned_lines = []
for raw_line in trial_lines:
  cleaned_lines.append(clean_line(raw_line))

all_records = []
for cleaned_line in cleaned_lines:
  all_records.append(build_record(cleaned_line))

seen_relics = set()
unique_records = []
for record in all_records:
  relic_name = record["relic_name"]
  if relic_name not in seen_relics:
      seen_relics.add(relic_name)
      unique_records.append(record)

hall_counts = {}
for record in unique_records:
  hall_name = record["hall_name"]
  if hall_name not in hall_counts:
      hall_counts[hall_name] = 0
  hall_counts[hall_name] += 1

amber_ready_relics = []
for record in unique_records:
  if record["status"] == "ready" and record["seal_mark"].find("amber") != -1:
      amber_ready_relics.append(record["relic_name"])

keeper_names = []
for record in unique_records:
  keeper_name = record["keeper_name"]
  if keeper_name not in keeper_names:
      keeper_names.append(keeper_name)

trial_summary = {
  "raw_row_count": len(all_records),
  "unique_relic_count": len(unique_records),
  "duplicate_row_count": len(all_records) - len(unique_records),
  "ready_unique_count": len([record for record in unique_records if record["status"] == "ready"]),
  "amber_ready_relics": amber_ready_relics,
  "hall_counts": hall_counts,
}

trial_passed = (
  trial_summary["unique_relic_count"] == 5
  and trial_summary["ready_unique_count"] >= 4
  and len(trial_summary["amber_ready_relics"]) >= 3
  and len(trial_summary["hall_counts"]) == len(vault_index)
)

access_decision = {
  "verdict": "pass" if trial_passed else "retry",
  "keeper_roll_call": ", ".join(keeper_names),
  "final_message": "The archive opens." if trial_passed else "The archive asks for another pass.",
}

print("Cleaned lines:", cleaned_lines)
print("All records:", all_records)
print("Seen relics:", seen_relics)
print("Unique records:", unique_records)
print("Hall counts:", hall_counts)
print("Amber ready relics:", amber_ready_relics)
print("Keeper names:", keeper_names)
print("Trial summary:", trial_summary)
print("Access decision:", access_decision)
高级技巧
想更进一步?点击展开

如果你能把这节课稳稳做完,你带走的就不只是某几个方法,而是一条真正能做事的数据处理路径:从脏文本开始,清理、拆解、补信息、去重、统计、判断,再交出一个清楚的结果。

Chapter 6 到这里就完整收束了。下一章会把这份能力带离 Hoppy 世界,进入现实任务;但在迈出去之前,你已经在档案馆里完成了最后一次真正的主线试炼。

Loading...
终端输出
Terminal
Ready to run...