CATALOGED

contents inventoried · dedup done

items here now

—

aggregate size

What this gate means

An item is at CATALOGED once we know what's in it. For cloud exports this means metadata extraction (photos, emails, docs). For RAID + backups this means dedup — the redundant generations have collapsed and we know the unique set. For old hard drives this means a forensic catalog: file-type histogram, suspicious filenames flagged, hidden directories surfaced. Nothing moves to canonical home until this gate clears.

Sub-routines · operational playbook

Dedup · jdupes / rmlint

collapse redundant copies before sorting

jdupes is the modern fdupes — fast, recursive, supports hardlinking instead of deleting. rmlint goes further: handles partial-match scoring, supports BTRFS reflink dedup, generates a shell script you can review before executing. For the RAID+backups stack of duplicates, rmlint is the right tool.

jdupes -r /Volumes/RAID/intake/ > dupes.txt
rmlint --types=duplicates /Volumes/RAID/intake/ -o sh:rmlint.sh
# REVIEW rmlint.sh before running

when: RAID + backups · old HDs

Photo metadata · exiftool

exiftool -recursive → write CSV sidecar

Extract EXIF/IPTC/XMP into a sidecar CSV so the photos can be sorted by date, camera, GPS, etc. without re-reading every file later. Run with -r for recursive and -csv for tabular output.

exiftool -r -csv -DateTimeOriginal -GPSLatitude -GPSLongitude -Model \
  /Volumes//photos/ > photo-metadata.csv

when: Photo-heavy intakes (iCloud, Google Photos)

Audio dedup · fpcalc chromaprint

cluster near-duplicates by acoustic fingerprint

Same song from different rips will have different MD5s but the same chromaprint. fpcalc (chromaprint) generates an acoustic fingerprint per file; cluster matching fingerprints to identify near-duplicates that filename + size dedup misses.

find . -name '*.mp3' -o -name '*.flac' -o -name '*.m4a' | \
  parallel fpcalc {} > audio-fingerprints.txt

when: Music libraries — old iMac iTunes folders, Audrey audio archives

Forensic catalog · file-type histogram

for unknown-content drives, know what's there before you sort

Old drives may hold business docs, photos, source code, OS install media, garbage. Build a histogram of file extensions + sizes per directory, flag oversized binaries and hidden directories. fd + file + a quick awk pipeline does it.

fd . /Volumes// -e -t f | parallel file -b {} | sort | uniq -c | sort -rn

when: Every unlabeled HD before sort

Filename hygiene · detox / convmv

fix pre-UTF8 encoding, weird chars, case-collisions

Drives from the old iMac era have filenames in MacRoman or cp1252. macOS can read them but rsync onto APFS will mangle them. Convert encodings up-front with convmv and detox stray :/? characters that break tar archives later.

detox -r -v /Volumes//
convmv -r -f cp1252 -t utf8 --notest /Volumes//

when: Drives pre-2010 · imports from Windows systems

Items at this gate

No items currently at this gate.

Gate-exit checklist

Verify before moving items into SORTED:

Dedup pass run · before/after byte counts logged
Per-source metadata sidecar persisted (CSVs, fingerprint files)
Filename hygiene clean (no broken-encoding paths)
Catalog committed alongside the data (not just in memory)

← Gate B · LANDED

↑ dashboard

Gate D · SORTED →

Gate C · CATALOGED · baked 2026-05-29 from migrations.yml + GATE_DETAIL