GATE C

CATALOGED

contents inventoried · dedup done
0
items here now
aggregate size

What this gate means

An item is at CATALOGED once we know what's in it. For cloud exports this means metadata extraction (photos, emails, docs). For RAID + backups this means dedup — the redundant generations have collapsed and we know the unique set. For old hard drives this means a forensic catalog: file-type histogram, suspicious filenames flagged, hidden directories surfaced. Nothing moves to canonical home until this gate clears.

Sub-routines · operational playbook

Dedup · jdupes / rmlint
collapse redundant copies before sorting
jdupes is the modern fdupes — fast, recursive, supports hardlinking instead of deleting. rmlint goes further: handles partial-match scoring, supports BTRFS reflink dedup, generates a shell script you can review before executing. For the RAID+backups stack of duplicates, rmlint is the right tool.
jdupes -r /Volumes/RAID/intake/ > dupes.txt
rmlint --types=duplicates /Volumes/RAID/intake/ -o sh:rmlint.sh
# REVIEW rmlint.sh before running
when: RAID + backups · old HDs
Photo metadata · exiftool
exiftool -recursive → write CSV sidecar
Extract EXIF/IPTC/XMP into a sidecar CSV so the photos can be sorted by date, camera, GPS, etc. without re-reading every file later. Run with -r for recursive and -csv for tabular output.
exiftool -r -csv -DateTimeOriginal -GPSLatitude -GPSLongitude -Model \
  /Volumes//photos/ > photo-metadata.csv
when: Photo-heavy intakes (iCloud, Google Photos)
Audio dedup · fpcalc chromaprint
cluster near-duplicates by acoustic fingerprint
Same song from different rips will have different MD5s but the same chromaprint. fpcalc (chromaprint) generates an acoustic fingerprint per file; cluster matching fingerprints to identify near-duplicates that filename + size dedup misses.
find . -name '*.mp3' -o -name '*.flac' -o -name '*.m4a' | \
  parallel fpcalc {} > audio-fingerprints.txt
when: Music libraries — old iMac iTunes folders, Audrey audio archives
Forensic catalog · file-type histogram
for unknown-content drives, know what's there before you sort
Old drives may hold business docs, photos, source code, OS install media, garbage. Build a histogram of file extensions + sizes per directory, flag oversized binaries and hidden directories. fd + file + a quick awk pipeline does it.
fd . /Volumes// -e -t f | parallel file -b {} | sort | uniq -c | sort -rn
when: Every unlabeled HD before sort
Filename hygiene · detox / convmv
fix pre-UTF8 encoding, weird chars, case-collisions
Drives from the old iMac era have filenames in MacRoman or cp1252. macOS can read them but rsync onto APFS will mangle them. Convert encodings up-front with convmv and detox stray :/? characters that break tar archives later.
detox -r -v /Volumes//
convmv -r -f cp1252 -t utf8 --notest /Volumes//
when: Drives pre-2010 · imports from Windows systems

Items at this gate

No items currently at this gate.

Gate-exit checklist

Verify before moving items into SORTED:

Gate C · CATALOGED · baked 2026-05-29 from migrations.yml + GATE_DETAIL