GATE C
CATALOGED
contents inventoried · dedup done
0
items here now
—
aggregate size
What this gate means
An item is at CATALOGED once we know what's in it. For cloud exports this means metadata extraction (photos, emails, docs). For RAID + backups this means dedup — the redundant generations have collapsed and we know the unique set. For old hard drives this means a forensic catalog: file-type histogram, suspicious filenames flagged, hidden directories surfaced. Nothing moves to canonical home until this gate clears.
Sub-routines · operational playbook
Dedup · jdupes / rmlint
collapse redundant copies before sorting
jdupes is the modern fdupes — fast, recursive, supports hardlinking instead of deleting. rmlint goes further: handles partial-match scoring, supports BTRFS reflink dedup, generates a shell script you can review before executing. For the RAID+backups stack of duplicates, rmlint is the right tool.jdupes -r /Volumes/RAID/intake/ > dupes.txt
rmlint --types=duplicates /Volumes/RAID/intake/ -o sh:rmlint.sh
# REVIEW rmlint.sh before running
when: RAID + backups · old HDs
Photo metadata · exiftool
exiftool -recursive → write CSV sidecar
Extract EXIF/IPTC/XMP into a sidecar CSV so the photos can be sorted by date, camera, GPS, etc. without re-reading every file later. Run with
-r for recursive and -csv for tabular output.exiftool -r -csv -DateTimeOriginal -GPSLatitude -GPSLongitude -Model \
/Volumes//photos/ > photo-metadata.csv
when: Photo-heavy intakes (iCloud, Google Photos)
Audio dedup · fpcalc chromaprint
cluster near-duplicates by acoustic fingerprint
Same song from different rips will have different MD5s but the same chromaprint.
fpcalc (chromaprint) generates an acoustic fingerprint per file; cluster matching fingerprints to identify near-duplicates that filename + size dedup misses.find . -name '*.mp3' -o -name '*.flac' -o -name '*.m4a' | \
parallel fpcalc {} > audio-fingerprints.txt
when: Music libraries — old iMac iTunes folders, Audrey audio archives
Forensic catalog · file-type histogram
for unknown-content drives, know what's there before you sort
Old drives may hold business docs, photos, source code, OS install media, garbage. Build a histogram of file extensions + sizes per directory, flag oversized binaries and hidden directories.
fd + file + a quick awk pipeline does it.fd . /Volumes// -e -t f | parallel file -b {} | sort | uniq -c | sort -rn
when: Every unlabeled HD before sort
Filename hygiene · detox / convmv
fix pre-UTF8 encoding, weird chars, case-collisions
Drives from the old iMac era have filenames in MacRoman or cp1252. macOS can read them but rsync onto APFS will mangle them. Convert encodings up-front with
convmv and detox stray :/? characters that break tar archives later.detox -r -v /Volumes//
convmv -r -f cp1252 -t utf8 --notest /Volumes//
when: Drives pre-2010 · imports from Windows systems
Items at this gate
No items currently at this gate.
Gate-exit checklist
Verify before moving items into SORTED:
- Dedup pass run · before/after byte counts logged
- Per-source metadata sidecar persisted (CSVs, fingerprint files)
- Filename hygiene clean (no broken-encoding paths)
- Catalog committed alongside the data (not just in memory)
Gate C · CATALOGED · baked 2026-05-29 from
migrations.yml + GATE_DETAIL