Checking for images that are no longer used
We sometimes remove images from documentation and there's a chance we'll forget to remove them from Git. So I'm looking into what it would take to audit images and find the ones we aren't using.
Finding all the images we are using
To get a list of all the MDX files under a directory:
find product_docs/docs/eprs/ -name '*.mdx'
Then we can parse the MDX files using Pandoc. I found an example of how to extract the code from Markdown and adjusted it to extract images. Here's extract_images.lua:
function Image(el)
print(el.src)
end
And the command to run it:
pandoc --lua-filter extract_images.lua -o /dev/null [list_of_mdx_files]
Putting everything together:
find product_docs/docs/eprs/ -name '*.mdx' \
| xargs pandoc --lua-filter extract_images.lua -o /dev/null
But that results in duplicates. So sort them:
find product_docs/docs/eprs/ -name '*.mdx' \
| xargs pandoc --lua-filter extract_images.lua -o /dev/null \
| sort -u
It's also relative paths, which is awkward:
$ find product_docs/docs/eprs/ -name '*.mdx'| xargs pandoc --lua-filter extract_images.lua -o /dev/null | sort -u | head
../../images/image100.png
../../images/image101.png
../../images/image102.png
../../images/image103.png
../../images/image104.png
../../images/image105.png
../../images/image106.png
../../images/image107.png
../../images/image108.png
../../images/image109.png
One approach would be to extract just the filename:
find product_docs/docs/eprs/ -name '*.mdx' \
| xargs pandoc --lua-filter extract_images.lua -o /dev/null \
| xargs -l basename \
| sort -u
But that only works if there are no duplicate names in the files across different directory structures. Probably better to adjust the Lua filter to output absolute paths.
Finding all the image files in a directory
We can use find to list images:
find product_docs/docs/eprs/ -type f \
| xargs file | grep -o -P '^.+: +\w+ image' | cut -d: -f1 \
| sort
This gives a list relative to the current working directory, but we can fix that by using the absolute path in the find command:
find "$(pwd)"/product_docs/docs/eprs/ -type f \
| xargs file | grep -o -P '^.+: +\w+ image' | cut -d: -f1 \
| sort
Or use basename to just get the filename to match what we did in looking for images we are using:
find product_docs/docs/eprs/ -type f \
| xargs file | grep -o -P '^.+: +\w+ image' | cut -d: -f1 \
| xargs -l basename \
| sort
Find the differences
If you pipe the output of the files used command into one file (say, "used_images.txt") and the list of all images in another file ("all_images.txt", perhaps), you can diff those two lists to get the images that aren't used and (possibly) the images that aren't in Git:
diff all_images.txt used_images.txt | head
1,12d0
< aws_instance.png
< aws_instance.png
< edb_logo.png
< edb_logo.png
< efm_slot_old.png
< efm_slot_old.png
< efm_slot.png
< efm_slot.png
< google_security_settings.png
Might make sense to do this one product version at a time. For instance, the list above is the first 10 images that aren't used in the EPRS 6.2 documentation on the branch I happen to be using.
Removing unused images
I removed the images with:
diff all_images.txt used_images.txt | grep '<' | sed -e 's|< |product_docs/docs/eprs/6.2/images/|' | xargs git rm
There are also copies of the images in another (unused) directory. So I removed them:
git rm -r product_docs/docs/eprs/6.2/images/media/
Possibly add this as a build step...