Testing, testing… and decoding

Digital Curation week one and our mission was to break a lot of files and decode some clues; could we do it? Tessa, Matt, Musa and Lucy had a go…

But not like this…!

Testing file resilience

Having been given a set of clean files in various formats, the idea was to use a wee program called ShotGun to try and damage or corrupt the files and see which ones stood up best to that kind of treatment.

Which is exactly what Tessa did. Here are her notes and results, nicely recorded and tabulated.

All corrupt file and shoot file operations were undertaken at medium level (sliders halfway). In the future the experiment should ideally be conducted at various levels (low and high), but time was a limiting factor.

The table below displays the results:

Format Shoot File Corrupt File
GIF All black, some lines at top All black, some lines at top
JPG Open but not legible (in blocks) Does not open
PNG Opens but all black Opens but all black
TIF Does not open Does not open
HTML Opens but blank (does not load) Opens but blank
DOC Does not open Does not open
PDF Does not open Does not open
RTF Does not open (not enough space) Opens but only displays one page
Docx Does not open Does not open
WAV Damaged – plays but hardly any sound Does not open
MID Does not open Does not open

And her conclusions were that, in general:

  1. The shoot file is less damaging than the corrupt file
  2. The music files and the document files are more easily damaged.
  3. The image files are more durable, in that they still open, even if the file cannot be easily viewed (pixels missing). Nevertheless, GIF is more durable than TIF (this may be due to the smaller file sizes of the Gif)

Not testing file resilience

It was great that Tessa did this because I did not!

First of all I wanted to know what the shoot and corrupt sliders were doing. But that wasn’t totally clear because apparently there’s no manual (yes, I like to read the manual before using a new toy).

Then I wanted to set up a range of criteria for assessing and rating the degrees of change. Was a visual or audio assessment sufficient or would the properties of information in the files reveal additional information?

So I started ‘shooting’ and ‘corrupting’ the Word doc. with the sliders at 0, both about halfway and both at 100% for each operation. I made a note of the indications of data loss presented by the ShotGun program.

Then I looked at the files. All were un-openable except the one shot at 0%. So I did a visual comparison with the clean version and found that the only change seemed to be that one image had corners and an edge missing.

But was there hidden damage? Had metadata been changed or lost, or did it show any differences? A comparison of the properties seemed to indicate not. Both were 379KB and had 11 pages. Even one of the un-openable files (shot at 50%) seemed to still have a size of 379KB, though some of the others were down to 349KB and 286KB.

I then tried the same process with the jpeg creating a shot version, shot 50% and a corrupt version to compare to the clean one. These processes definitely had an impact on how the image looked, with it becoming more and more degraded until the image could no longer be seen, yet the files continued to be openable. There were changes in some of the information such as the numbers of unique colours and the time it took to open.

Yet other information seemed to remain the same, such as the number of pixels and size. So the pixels were still all there but just in a different order. A bit like the poor pig-lizard of Galaxy Quest…

The point of this was that I was trying to figure out if there might be ways to run automated assessment of files to identify ones that are corrupted. If you have half a million Word files maybe you could run an automated process to extract all the ones that are un-openable so at least you know how many are damaged and which they are. But if jpeg images are still openable even when corrupted, would you have to manually view half a million digital images to spot the mangled ones…? Or is it possible to set some parameters related to the file metadata to try and identify and pull out potential duds for a closer look?

Obviously I didn’t have the answer but hoping we’ll find it later in the course!

Being code detectives

The second part of the task was not just an excuse to look out a Cumberbatch GIF but a way to help us understand that computer code is a way of representing something.

We discovered the clues to some questions by converting the code into text from binary and hexadecimal, and into a jpeg from Base 64 using these code translators:

The answers were:

  1. 1968
  2. Toy Story
  3. Latin Alphabet upper case and lower case letters, Arabic numerals 0 to 9, and the symbols + and /

You can guess the questions from them!

So we got to the answers but I’m still not entirely sure I yet understand what binary, hexadecimal, base 64 and ASCII really are and how they are different to each other. Just as I know ‘there are only 10 types of people in the world: those who understand binary, and those who don’t’ is a joke but I’m not entirely sure why!  Hoping by the end of this course I’ll be better able to decode it…

One thought on “Testing, testing… and decoding

Leave a comment