Data compression is used everywhere. Mp3, mp4, rar, zip, jpg and png files (along with many others) all use compressed data. Without data compression a 3 minute song would be over 100Mb and a 10 minute video would easily be over 1Gb. Data compression condenses large files into much smaller ones. It does this by getting rid of data that isn’t needed while retaining the information in the file.
Does that mean information is different to data? Yes. Lets take an example: I ask Bob who won the Arsenal game. He then launches into a 30 minute monologue about the match, detailing every pass, throw-in, tackle etc. Right at the end he tells me Arsenal won. I only wanted to know who won, so all the data Bob gave me about the game was useless. He could have compressed the data he sent into two words, ‘Arsenal won’, because that’s all the information I needed. Data compression works on the same principle.
There are two types of compression: Lossless and Lossy. As the names suggest, lossy compression loses data in the compression process while lossless compression keeps all the data. Both are interesting and easy to understand, and I’ll explain them here. First, though, you need to understand a bit about encoding and binary.
Encoding and Binary
When you want to send a message you need to translate it into language a computer will understand. A computer uses only 0’s and 1’s. This is called binary. The process of putting a message like ‘hello’ into 0’s and 1’s is called encoding.
Binary is a way of representing any number using only 0’s and 1’s. It is called a ‘base 2’ number system. In order to understand it, we need to take a step back and think about how we usually represent numbers.
We use a ‘base 10’ number system in everyday life. It’s called ‘base 10’ because we use 10 numbers: 0,1,2,3,4,5,6,7,8,9. When you want to represent a number higher than 9 you need to add an extra digit. The extra digit represents how many 10’s there are. So the number 27 means you have 2 lots of ten (represented by the first digit), and 7 lots of one (represented by the second digit). If I want to represent a number higher than 99 I need to add another digit. The third digit represents the amount of 100’s there are in the number. A fourth digit represents how many 1000’s there are and so on. You can break a number down and see what each digit represents, so for instance the number 586732 can be broken down like so:
Of course we don’t usually do this because it’s common sense, but bear with me. Each extra digit we use in ‘base 10’ is always 10 times the previous digit. This is not a coincidence. If you choose to use 10 numbers (0…9), then each new digit must always be 10 times larger than the previous digit.
A ‘base 7′ number system, for example, uses only 7 numbers 0,1,2,3,4,5,6. The number 8 is meaningless in base 7. This means that if I want to represent a number higher than 6 I need to use a new digit. This time the new digit represents how many 7’s I am using. So for instance the number 8 would be written ’11’ in base 7. The first digit says I am using 1 seven, and the second digit says i’m using 1 one. The third digit in base 7 will be 7 times larger than the previous digit, so it would represent 49 (7×7). A fourth digit would represent 7×49 = 343. To figure out what a number is in base 7 it is now convenient to break it down, So for the number 5423:
There are five 343’s, four 49’s, two 7’s and three 1’s. This means that the number 5423 in base 7 is equal to (5×343)+(4×49)+(2×7)+(3×1)=1928.
For engineering reasons computers talk in ‘base 2’, this means that there are only 2 numbers being used, 0 and 1. Continuing the logic of the previous examples, each new digit is always 2 times larger than the previous one. The first digit represents 1, the second digit represents 2, the third digit represents 4, the fourth 8, the fifth 16 and so on. As before, we can make a table to translate a binary number into a recognisable number. For instance the binary number 100110 can be written like this:
Figuring out a number in binary is super easy, you just add up the values of the digits that have a ‘1’ beneath them. So 100110 is 32+4+2=38. Remember: you can represent any number in binary. Try it for yourself, think of a number between 0 and 63 and figure out the binary representation of it using the table above.
A couple of things to note:
- When we’re talking about computers and binary we call a digit a bit. So a binary number that is 10 digits long is 10 bits long. Data compression is about using as few bits as possible
- If there are a large amount of numbers to represent, I need to include all the bits (or digits) to accommodate for all the numbers, even if i’m not using them.
Let me explain the second point. If I want to send you a number between 0-63, I need to send you 6 bits of data, even if the number i’m going to send you is 4, which is 000100. I’m not using the 3 leftmost bits, but I still need to include them. You can think of it in normal numbers. If I have 9999 different possible numbers I could send you, I can’t send you ‘9’, I have to send ‘0009’. The reason for this is because it makes it easier for computers. But keep it in mind, it’ll come back later.
Back to encoding. As we’ve seen computers talk in binary, but binary is just a way to represent numbers. Lets say I want to encode the alphabet into binary so a computer can understand it. The alphabet is 26 letters long, so I need enough bits to count to 26. With 4 bits the highest number is ‘1111’ which is 8+4+2+1=15, so we can’t use 4 bits. With 5 bits (16 representing the far left bit) the highest number is ‘11111’ which is 16+8+4+2+1= 31. As 26 is less than 31 it means we can represent the alphabet with 5 bits. We can then encode the alphabet into binary in a common sense way: a=1, b=2, c=3 etc. In binary this would translate to a = 00001, b = 00010, c = 00011, d = 00100, e = 00101 etc.
Great, now we’ve got encoding and binary covered we can get down to the really interesting part: how to compress data. With lossless compression we don’t actually get rid of any data (hence the name), instead it is based on clever ways to encode the data. With lossy compression we do get rid of data, which is why we need to differentiate data from information.
What’s the purpose of compression? How can we measure if it is successful? Well the purpose is to make a file, message, or any chunk of data smaller. If we had a 100Mb file and could condense it down to 50Mb we have compressed it. Furthermore we can say it has a compression ratio of 2, because it is half the size of the original file. If we compressed the 100Mb file to 10Mb it would have a compression ratio of 10 because the new file is a 10th the size of the original. A higher compression ratio means better compression.
As I said at the end of the last section, lossless compression involves finding the smartest way to encode data. The best way to explain this is through an example:
Imagine that instead of binary, computers could only talk in 1’s (rather than 0′ and 1’s). So 1=1, 2=11, 3=111, 4=1111, 5=11111 etc. Also imagine that our alphabet only had 8 letters, from ‘a’ to ‘h’.
So I want to send a message to you (consisting of letters from a-h) and I encode it (into 1’s). It makes sense to encode it as: a=1, b=11, c=111, d=1111….h=11111111 (the equivalent of a=1, b=2 etc.). There’s no reason to encode it any different way, so lets do it like that.
Once I’ve written my message down, I count the amount of times I use each letter and come up with a chart like this:
This is called a frequency distribution; it shows how frequently we used each letter. I used ‘a’ once, and ‘a’ only uses one bit of data as it’s encoded as ‘1’, this means all the ‘a’s use 1 bit of data. I used ‘b’ fifteen times, and ‘b’ is two bits of data ’11’, so all the ‘b’s use 15×2=30 bits of data. Calculate this for the rest of the letters and we come up with 1+30+12+12+60+60+49+64=288 bits of data.
Can you see a way how we can cut down the amount of data we transmit? We need to change how we encode the alphabet. As ‘b’ is used most in this message, why not encode ‘b’ with the lowest amount of data, ‘1’? ‘e’ was the next most common letter, so let’s encode that as ’11’. ‘f’ was next so that’s ‘111’ and ‘a’ will be the last letter because it was used least, encoded as ‘11111111’. If we calculate how much data we transmit now, the calculation will be:
That’s a reduction of 99 bits, or over a third of the data! We’re still sending the same message but we’ve just encoded the data in an efficient way. That’s how lossless compression works. It finds the best way to encode data given the amount each letter, or ‘symbol’, is used. Pretty simple right? It also uses shortcuts, for instance if there is a string of 300 1’s, instead of sending each bit of data it will just say ‘300 1’s’ in an encoded way. That’s all there really is to lossless compression. Efficient encoding.
Imagine we happened to use each letter exactly the same amount of times, would there be an efficient way to encode the data? The graph would look like this:
No letter is used more than another one, so we can encode ‘a’ as ‘1’ or ‘11111111’ and it wouldn’t make a difference to the amount of data we transmitted. There is no way to encode this in an efficient way. The shape of this graph, a straight line, is therefore the most efficient shape of the frequency distribution (in terms of information). When the frequency distribution is anything other than this shape (which is almost all the time) it is said to have a redundancy of data. Whenever there is a redundancy in data, lossless compression can compress the data. Therefore lossless compression exploits data redundancy.
In lossy compression we get rid of data to make our message more efficient. As I said previously, lossy compression tries to get rid of as much data as possible while still retaining as much information as it can. How do we decide what is information and what is fine to discard? Again it is much easier to understand with an example. This time we’ll look at how to compress a song.
Audio is made up of frequencies. These are measured in Hertz. If something has a high frequency we hear it as a high pitch, if something has a low frequency we hear it as a low pitch. If you cut a song at a moment in time you can make a graph of the frequencies and how loud they are. So let’s say I look at the frequency graph while only a bassline is playing, the graph would look like this:
See how there’s stuff going on at the lower end? That’s the bass. If I looked at the frequency graph while there was only a high voice it would look like this:
These are two extreme examples. Usually if I were to pause a song and get the frequency graph at that moment in time it would look something like this.
Note that a second later in the song the graph would look different, because frequencies change throughout the song (the bass may come in, the hit hat might stop). So where is the information that I need to keep here? Well think about what we want from a song, or any piece of audio, we want to hear it! We can therefore say that the information that we want to keep is the frequencies that we can hear during the song. So for instance on the graph above, we can draw a line where the volume is too low for humans to hear:
Anything below the green line is too quiet to hear, so we can regard that as data to discard during compression. We get any frequency that is below the green line and make it equal to zero, as it’s not important:
This will sound exactly the same as the original graph because we’re only getting rid of frequencies we can’t hear. However there is now less data for us to encode. This means that we have compressed the data!
Of course a second later one of the frequencies that we made zero may well become audible, in which case we will not be able to discard it. So when we compress a song the process actually goes like this:
1. decide the threshold volume below which the frequency is inaudible
2. look at the frequency graph throughout the whole song
3. get all the frequencies that never reach above the threshold volume and set them to zero
Unless you have a crazy piece of music, there will always be frequencies that you are able to discard. Being able to discard frequencies has two benefits in terms of compression:
1. you don’t have to send data about these frequencies. Instead of saying ‘at 1s, 50Hz = 0.0001, at 1.5s 50Hz = 0.0002′ you can just say ’50Hz = 0’. If there is less data to send, you are compressing the file!
2. you are reducing the amount of frequencies you are using, and therefore you have to use less bits to represent your data. This is important. If I had 5,000 frequencies that I was representing before compression (human hearing is approx 0-3.4KHz) and after compressing I only need to represent 2,000 frequencies, I can reduce the amount of bits I need to represent my data. Think back to when I was explaining binary. To encode 5,000 frequencies in binary I need 13 bits. If I wanted to represent 1Hz here it would be represented as ‘0000000000001’. Now to represent 2,000 frequencies I only need 8 bits of data. If I wanted to represent 1Hz here it would be ‘00000001’. As each number needs 8 bits rather than 13, we have less data to send, hence we can send over a third less data by choosing to represent less numbers! Pretty smart right?
Lossy vs Lossless
When choosing between lossy and lossless compression there is, like in all engineering, a trade off. Lossless compression usually has a smaller compression ratio, but the benefit is that you are not losing any data when you send your message. Lossy compression can achieve much higher compression ratios, but data is lost.
So if I wanted to send you a complicated computer program I have written I would choose lossless compression. The reason would be that all the data in the file (every line of code) is needed. There is no point in sending the file if I have to scrap data because it simply won’t work. Therefore even though I would have to send a bigger file, it’s worth using lossless compression because of the type of file i’m sending.
However if I had created a VoIP or Skype-like software, in which live video calls were taking place, I would use lossy compression to send the video data. The reason is that if the video is not in ‘real time’, or takes a long time to load, it is annoying and people won’t use it. I don’t mind if the picture is slightly pixelated as long as I am receiving it on time. If there is less data to send it will take less time to arrive at the other person’s computer. Hence lossy compression is perfect in this scenario. I am willing to get rid of data (quality of the picture) in order to heavily reduce the amount of data I am sending.