Having been in the social sciences for a couple of weeks it seems like a large amount of quantitative analysis relies on Principal Component Analysis (PCA). This is usually referred to in tandem with eigenvalues, eigenvectors and lots of numbers. So what’s going on? Is this just mathematical jargon to get the non-maths scholars to stop asking questions? Maybe, but it’s also a useful tool to use when you have to look at data. This post will give a very broad overview of PCA, describing eigenvectors and eigenvalues (which you need to know about to understand it) and showing how you can reduce the dimensions of data using PCA. As I said it’s a neat tool to use in information theory, and even though the maths is a bit complicated, you only need to get a broad idea of what’s going on to be able to use it effectively.
There’s quite a bit of stuff to process in this post, but i’ve got rid of as much maths as possible and put in lots of pictures.
What is Principal Component Analysis?
First of all Principal Component Analysis is a good name. It does what it says on the tin. PCA finds the principal components of data.
It is often useful to measure data in terms of its principal components rather than on a normal x-y axis. So what are principal components then? They’re the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. This is easiest to explain by way of example. Here’s some triangles in the shape of an oval:
Imagine that the triangles are points of data. To find the direction where there is most variance, find the straight line where the data is most spread out when projected onto it. A vertical straight line with the points projected on to it will look like this:
The data isn’t very spread out here, therefore it doesn’t have a large variance. It is probably not the principal component.
A horizontal line are with lines projected on will look like this:
On this line the data is way more spread out, it has a large variance. In fact there isn’t a straight line you can draw that has a larger variance than a horizontal one. A horizontal line is therefore the principal component in this example.
Luckily we can use maths to find the principal component rather than drawing lines and unevenly shaped triangles. This is where eigenvectors and eigenvalues come in.
Eigenvectors and Eigenvalues
When we get a set of data points, like the triangles above, we can deconstruct the set into eigenvectors and eigenvalues. Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue. An eigenvector is a direction, in the example above the eigenvector was the direction of the line (vertical, horizontal, 45 degrees etc.) . An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line. The eigenvector with the highest eigenvalue is therefore the principal component.
Okay, so even though in the last example I could point my line in any direction, it turns out there are not many eigenvectors/values in a data set. In fact the amount of eigenvectors/values that exist equals the number of dimensions the data set has. Say i’m measuring age and hours on the internet. there are 2 variables, it’s a 2 dimensional data set, therefore there are 2 eigenvectors/values. If i’m measuring age, hours on internet and hours on mobile phone there’s 3 variables, 3-D data set, so 3 eigenvectors/values. The reason for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions have to be equal to the original amount of dimensions. This sounds complicated, but again an example should make it clear.
Here’s a graph with the oval:
At the moment the oval is on an x-y axis. x could be age and y hours on the internet. These are the two dimensions that my data set is currently being measured in. Now remember that the principal component of the oval was a line splitting it longways:
It turns out the other eigenvector (remember there are only two of them as it’s a 2-D problem) is perpendicular to the principal component. As we said, the eigenvectors have to be able to span the whole x-y area, in order to do this (most effectively), the two directions need to be orthogonal (i.e. 90 degrees) to one another. This why the x and y axis are orthogonal to each other in the first place. It would be really awkward if the y axis was at 45 degrees to the x axis. So the second eigenvector would look like this:
The eigenvectors have given us a much more useful axis to frame the data in. We can now re-frame the data in these new dimensions. It would look like this::
Note that nothing has been done to the data itself. We’re just looking at it from a different angle. So getting the eigenvectors gets you from one set of axes to another. These axes are much more intuitive to the shape of the data now. These directions are where there is most variation, and that is where there is more information (think about this the reverse way round. If there was no variation in the data [e.g. everything was equal to 1] there would be no information, it’s a very boring statistic – in this scenario the eigenvalue for that dimension would equal zero, because there is no variation).
But what do these eigenvectors represent in real life? The old axes were well defined (age and hours on internet, or any 2 things that you’ve explicitly measured), whereas the new ones are not. This is where you need to think. There is often a good reason why these axes represent the data better, but maths won’t tell you why, that’s for you to work out.
How does PCA and eigenvectors help in the actual analysis of data? Well there’s quite a few uses, but a main one is dimension reduction.
PCA can be used to reduce the dimensions of a data set. Dimension reduction is analogous to being philosophically reductionist: It reduces the data down into it’s basic components, stripping away any unnecessary parts.
Let’s say you are measuring three things: age, hours on internet and hours on mobile. There are 3 variables so it is a 3D data set. 3 dimensions is an x,y and z graph, It measure width, depth and height (like the dimensions in the real world). Now imagine that the data forms into an oval like the ones above, but that this oval is on a plane. i.e. all the data points lie on a piece of paper within this 3D graph (having width and depth, but no height). Like this:
When we find the 3 eigenvectors/values of the data set (remember 3D probem = 3 eigenvectors), 2 of the eigenvectors will have large eigenvalues, and one of the eigenvectors will have an eigenvalue of zero. The first two eigenvectors will show the width and depth of the data, but because there is no height on the data (it is on a piece of paper) the third eigenvalue will be zero. On the picture below ev1 is the first eignevector (the one with the biggest eigenvalue, the principal component), ev2 is the second eigenvector (which has a non-zero eigenvalue) and ev3 is the third eigenvector, which has an eigenvalue of zero.
We can now rearrange our axes to be along the eigenvectors, rather than age, hours on internet and hours on mobile. However we know that the ev3, the third eigenvector, is pretty useless. Therefore instead of representing the data in 3 dimensions, we can get rid of the useless direction and only represent it in 2 dimensions, like before:
This is dimension reduction. We have reduced the problem from a 3D to a 2D problem, getting rid of a dimension. Reducing dimensions helps to simplify the data and makes it easier to visualise.
Note that we can reduce dimensions even if there isn’t a zero eigenvalue. Imagine we did the example again, except instead of the oval being on a 2D plane, it had a tiny amount of height to it. There would still be 3 eigenvectors, however this time all the eigenvalues would not be zero. The values would be something like 10, 8 and 0.1. The eigenvectors corresponding to 10 and 8 are the dimensions where there is alot of information, the eigenvector corresponding to 0.1 will not have much information at all, so we can therefore discard the third eigenvector again in order to make the data set more simple.
Example: the OxIS 2013 report
The OxIS 2013 report asked around 2000 people a set of questions about their internet use. It then identified 4 principal components in the data. This is an example of dimension reduction. Let’s say they asked each person 50 questions. There are therefore 50 variables, making it a 50-dimension data set. There will then be 50 eigenvectors/values that will come out of that data set. Let’s say the eigenvalues of that data set were (in descending order): 50, 29, 17, 10, 2, 1, 1, 0.4, 0.2….. There are lots of eigenvalues, but there are only 4 which have big values – indicating along those four directions there is alot of information. These are then identified as the four principal components of the data set (which in the report were labelled as enjoyable escape, instrumental efficiency, social facilitator and problem generator), the data set can then be reduced from 50 dimensions to only 4 by ignoring all the eigenvectors that have insignificant eigenvalues. 4 dimensions is much easier to work with than 50! So dimension reduction using PCA helped simplify this data set by finding the dominant dimensions within it.
As part of a job application last month I wrote a draft blog post about DNA legislation. I thought I would put it here for posterity’s sake:
Last month the United States Supreme Court made a ruling that was in direct opposition to the European Court of Human Rights. The ruling, bought by the case ‘Maryland vs King’, was in regard to the collection of DNA of those in custody. It held that ‘taking and analyzing a cheek swab of the arrestee’s DNA is, like fingerprinting and photographing, a legitimate police booking procedure that is reasonable under the Fourth Amendment’. In contrast, the case ‘S and Marper v United Kingdom’, brought to the European Court of Human Rights in 2008, ruled that DNA collection of those in custody was in direct breach of Article 8 of the European Convention on Human Rights, which guarantees ‘the right to respect for his private and family life, his home and his correspondence’. In the ruling the
European Court said Article 8 ‘would be unacceptably weakened if the use of modern scientific techniques in the criminal justice system were allowed at any cost and without carefully balancing the potential benefits of the extensive use of such techniques against important private-life interests’. This disagreement between the courts highlights the ethical ambiguities that have arisen from the widespread adoption of DNA databases in the last 15 years. Where should we draw the line between the state’s duty to maintain law and order and the individual’s right to privacy?
DNA can be immensely helpful when combatting crime. Due to the unique genetic fingerprint of each individual, DNA testing is used for identification. If an individual leaves their DNA at the scene of a crime (for instance from a strand of hair), the police can sequence this information and compare it to an existing database. If the individual’s DNA is already recorded on the database, the match reveals a very high level of proof that they were at the scene of the crime. This means that if the police had a sample of every citizen’s DNA in a database they would be much more effective at solving crimes. All they would have to do is arrive at the crime scene, take DNA samples and question everyone who was a match. It is therefore hardly surprising that the home secretary Jacqui Smith was ‘disappointed by the European court of human rights‘ decision’ because ‘DNA and fingerprinting is vital to the fight against crime’. However what Smith failed to acknowledge is that there is a trade-off between fighting crime and privacy. Having a collection of every citizen’s DNA would help police investigations, but it would be a massive invasion of privacy. Even the most lax interpretation of Article 8 would not be able to justify the state taking a mandatory DNA sample of every citizen, so there is a legal boundary to draw.
Most countries that have a DNA database agree that a convict is obliged to provide a sample of his DNA to the state. Countries differ on which convicts are required to provide a sample (for instance in Sweden convicts with a jail term longer than 4 years are required to provide a sample, whereas in Russia people who are convicted of ‘serious crimes’ are added to the database), but there is a consensus that once a person is convicted of a crime, their right to keep their DNA private is waivered. The reasoning is that people who are convicted criminals are more likely to reoffend, so having a database of convict’s DNA means that if they were to reoffend in the future, the police would find it much easier to match their DNA to the crime scene. Therefore after the criminal has served their time, their DNA will still be on the database. Having convicts’ details on databases is not new: an obvious example is the sex offenders register. There is a strong argument that it is in the public good that these databases exist, especially for crimes that have a high rate of re-offense (for instance rape and voluntary manslaughter).
We have seen how the state can justify taking convicts’ DNA but not that of innocent civilians. What about people who are charged with a crime but not convicted? This is where the European Court and the Supreme Court disagree. In the context of those charged, it becomes a matter of whether the court regards DNA as identity or property.
DNA’s status as an identifier is immediately obvious. As it is highly discriminatory it can be used to distinguish people with unparalleled precision. When an individual is charged with a crime, the police have a right to take their identity and store it. Currently this is in the form of photographs and fingerprints. One of the reasons for doing this is to help the prosecutor build a case. They can use photographs to get witnesses and if the fingerprints match those of the crime scene, there is additional evidence against the defendant. DNA can be viewed in the same way. If one takes the DNA of the defendant and matches it with that of the crime scene, the prosecution has a stronger case. South Africa, Australia and the United States view DNA in this way. As the Supreme Court said, it is ‘a legitimate police booking procedure that is reasonable under the Fourth Amendment’.
There is however another way of viewing DNA. While it is a powerful identifier, storing a mass amount of it means that the state is a capable of using an individual’s DNA for purposes other than aiding the investigation at hand. As most countries store DNA samples of unsolved crimes, it is possible to cross-reference an individual’s DNA with all previous crimes to see if there is any match. Therefore when the police take a defendant’s DNA, they can not only use it to further their current case: they can also easily discover any other charges which may be brought against the accused. With modern database technology this is both straightforward and effectively instantaneous – a very significant change from manually checking fingerprints against paper records. While this has potential benefits, it is open for abuse (for instance by individual police officers or corrupt states).
Treating DNA as property is a way of protecting the rights of the individual against such abuses. If the police suspect someone of storing a large amount of Class A drugs in their house, they can apply for a warrant from a judge to search the property. If during the search a dead body is discovered, the police have every right to file charges against that individual for murder. Although Article 8 and the Fourth Amendment protect the individual from unnecessary police surveillance, the police must have the ability to pursue evidence of crimes they come across. DNA cross-referencing can be viewed in the same way. If the police have reasonable suspicion that an individual has committed a crime in the past they are allowed to run that person’s DNA through their database. If they find DNA evidence linking the accused to a different crime then they may open a new investigation, but a judicial warrant was required to search the database in the first place. This is the view that the European Court took, emphasising the need of ‘carefully balancing the potential benefits of the extensive use of such techniques against important private-life interests’.
Technological advancements in the last 15 years have made it possible for the state to use new, effective techniques for fighting crime, but they have also raised serious questions about privacy and the extent of our rights. Unless these issues are discussed and challenged by the public, there is a danger that the state will adopt these techniques without the consideration they deserve.
Data compression is used everywhere. Mp3, mp4, rar, zip, jpg and png files (along with many others) all use compressed data. Without data compression a 3 minute song would be over 100Mb and a 10 minute video would easily be over 1Gb. Data compression condenses large files into much smaller ones. It does this by getting rid of data that isn’t needed while retaining the information in the file.
Does that mean information is different to data? Yes. Lets take an example: I ask Bob who won the Arsenal game. He then launches into a 30 minute monologue about the match, detailing every pass, throw-in, tackle etc. Right at the end he tells me Arsenal won. I only wanted to know who won, so all the data Bob gave me about the game was useless. He could have compressed the data he sent into two words, ‘Arsenal won’, because that’s all the information I needed. Data compression works on this principle.
There are two types of compression: Lossless and Lossy. As the names suggest, lossy compression loses data in the compression process while lossless compression keeps all the data. Both are interesting and easy to understand, and I’ll explain them here. First, though, you need to understand a bit about encoding and binary.
Encoding and Binary
When you want to send a message you need to translate it into language a computer will understand. A computer uses only 0′s and 1′s. This is called binary. The process of putting a message like ‘hello’ into 0′s and 1′s is called encoding.
Binary is a way of representing any number using only 0′s and 1′s. It is called a ‘base 2′ number system. In order to understand it, we need to take a step back and think about how we usually represent numbers.
We use a ‘base 10′ number system in everyday life. It’s called ‘base 10′ because we use 10 numbers: 0,1,2,3,4,5,6,7,8,9. When you want to represent a number higher than 9 you need to add an extra digit. The extra digit represents how many 10′s there are. So the number 27 means you have 2 lots of ten (represented by the first digit), and 7 lots of one (represented by the second digit). If I want to represent a number higher than 99 I need to add another digit. The third digit represents the amount of 100′s there are in the number. A fourth digit represents how many 1000′s there are and so on. You can break a number down and see what each digit represents, so for instance the number 586732 can be broken down like so:
Of course we don’t usually do this because it’s common sense, but bear with me. Each extra digit we use in ‘base 10′ is always 10 times the previous digit. This is not a coincidence. If you choose to use 10 numbers (0…9), then each new digit must always be 10 times larger than the previous digit.
A ‘base 7′ number system, for example, uses only 7 numbers 0,1,2,3,4,5,6. The number 8 is meaningless in base 7. This means that if I want to represent a number higher than 6 I need to use a new digit. This time the new digit represents how many 7′s I am using. So for instance the number 8 would be written ’11′ in base 7. The first digit says I am using 1 seven, and the second digit says i’m using 1 one. The third digit in base 7 will be 7 times larger than the previous digit, so it would represent 49 (7×7). A fourth digit would represent 7×49 = 343. To figure out what a number is in base 7 it is now convenient to break it down, So for the number 5423:
There are five 343′s, four 49′s, two 7′s and three 1′s. This means that the number 5423 in base 7 is equal to (5×343)+(4×49)+(2×7)+(3×1)=1928.
For engineering reasons computers talk in ‘base 2′, this means that there are only 2 numbers being used, 0 and 1. Continuing the logic of the previous examples, each new digit is always 2 times larger than the previous one. The first digit represents 1, the second digit represents 2, the third digit represents 4, the fourth 8, the fifth 16 and so on. As before, we can make a table to translate a binary number into a recognisable number. For instance the binary number 100110 can be written like this:
Figuring out a number in binary is super easy, you just add up the values of the digits that have a ’1′ beneath them. So 100110 is 32+4+2=38. Remember: you can represent any number in binary. Try it for yourself, think of a number between 0 and 63 and figure out the binary representation of it using the table above.
A couple of things to note:
- When we’re talking about computers and binary we call a digit a bit. So a binary number that is 10 digits long is 10 bits long. Data compression is about using as few bits as possible
- If there are a large amount of numbers to represent, I need to include all the bits (or digits) to accommodate for all the numbers, even if i’m not using them.
Let me explain the second point. If I want to send you a number between 0-63, I need to send you 6 bits of data, even if the number i’m going to send you is 4, which is 000100. I’m not using the 3 leftmost bits, but I still need to include them. You can think of it in normal numbers. If I have 9999 different possible numbers I could send you, I can’t send you ’9′, I have to send ’0009′. The reason for this is because it makes it easier for computers. But keep it in mind, it’ll come back later.
Back to encoding. As we’ve seen computers talk in binary, but binary is just a way to represent numbers. Lets say I want to encode the alphabet into binary so a computer can understand it. The alphabet is 26 letters long, so I need enough bits to count to 26. With 4 bits the highest number is ’1111′ which is 8+4+2+1=15, so we can’t use 4 bits. With 5 bits (16 representing the far left bit) the highest number is ’11111′ which is 16+8+4+2+1= 31. As 26 is less than 31 it means we can represent the alphabet with 5 bits. We can then encode the alphabet into binary in a common sense way: a=1, b=2, c=3 etc. In binary this would translate to a = 00001, b = 00010, c = 00011, d = 00100, e = 00101 etc.
Great, now we’ve got encoding and binary covered we can get down to the really interesting part: how to compress data. With lossless compression we don’t actually get rid of any data (hence the name), instead it is based on clever ways to encode the data. With lossy compression we do get rid of data, which is why we need to differentiate data from information.
What’s the purpose of compression? How can we measure if it is successful? Well the purpose is to make a file, message, or any chunk of data smaller. If we had a 100Mb file and could condense it down to 50Mb we have compressed it. Furthermore we can say it has a compression ratio of 2, because it is half the size of the original file. If we compressed the 100Mb file to 10Mb it would have a compression ratio of 10 because the new file is a 10th the size of the original. A higher compression ratio means better compression.
As I said at the end of the last section, lossless compression involves finding the smartest way to encode data. The best way to explain this is through an example:
Imagine that instead of binary, computers could only talk in 1′s (rather than 0′ and 1′s). So 1=1, 2=11, 3=111, 4=1111, 5=11111 etc. Also imagine that our alphabet only had 8 letters, from ‘a’ to ‘h’.
So I want to send a message to you (consisting of letters from a-h) and I encode it (into 1′s). It makes sense to encode it as: a=1, b=11, c=111, d=1111….h=11111111 (the equivalent of a=1, b=2 etc.). There’s no reason to encode it any different way, so lets do it like that.
Once I’ve written my message down, I count the amount of times I use each letter and come up with a chart like this:
This is called a frequency distribution; it shows how frequently we used each letter. I used ‘a’ once, and ‘a’ only uses one bit of data as it’s encoded as ’1′, this means all the ‘a’s use 1 bit of data. I used ‘b’ fifteen times, and ‘b’ is two bits of data ’11′, so all the ‘b’s use 15×2=30 bits of data. Calculate this for the rest of the letters and we come up with 1+30+12+12+60+60+49+64=288 bits of data.
Can you see a way how we can cut down the amount of data we transmit? We need to change how we encode the alphabet. As ‘b’ is used most in this message, why not encode ‘b’ with the lowest amount of data, ’1′? ‘e’ was the next most common letter, so let’s encode that as ’11′. ‘f’ was next so that’s ’111′ and ‘a’ will be the last letter because it was used least, encoded as ’11111111′. If we calculate how much data we transmit now, the calculation will be:
That’s a reduction of 99 bits, or over a third of the data! We’re still sending the same message but we’ve just encoded the data in an efficient way. That’s how lossless compression works. It finds the best way to encode data given the amount each letter, or ‘symbol’, is used. Pretty simple right? It also uses shortcuts, for instance if there is a string of 300 1′s, instead of sending each bit of data it will just say ’300 1′s’ in an encoded way. That’s all there really is to lossless compression. Efficient encoding.
Imagine we happened to use each letter exactly the same amount of times, would there be an efficient way to encode the data? The graph would look like this:
No letter is used more than another one, so we can encode ‘a’ as ’1′ or ’11111111′ and it wouldn’t make a difference to the amount of data we transmitted. There is no way to encode this in an efficient way. The shape of this graph, a straight line, is therefore the most efficient shape of the frequency distribution (in terms of information). When the frequency distribution is anything other than this shape (which is almost all the time) it is said to have a redundancy of data. Whenever there is a redundancy in data, lossless compression can compress the data. Therefore lossless compression exploits data redundancy.
In lossy compression we get rid of data to make our message more efficient. As I said previously, lossy compression tries to get rid of as much data as possible while still retaining as much information as it can. How do we decide what is information and what is fine to discard? Again it is much easier to understand with an example. This time we’ll look at how to compress a song.
Audio is made up of frequencies. These are measured in Hertz. If something has a high frequency we hear it as a high pitch, if something has a low frequency we hear it as a low pitch. If you cut a song at a moment in time you can make a graph of the frequencies and how loud they are. So let’s say I look at the frequency graph while only a bassline is playing, the graph would look like this:
See how there’s stuff going on at the lower end? That’s the bass. If I looked at the frequency graph while there was only a high voice it would look like this:
These are two extreme examples. Usually if I were to pause a song and get the frequency graph at that moment in time it would look something like this.
Note that a second later in the song the graph would look different, because frequencies change throughout the song (the bass may come in, the hit hat might stop). So where is the information that I need to keep here? Well think about what we want from a song, or any piece of audio, we want to hear it! We can therefore say that the information that we want to keep is the frequencies that we can hear during the song. So for instance on the graph above, we can draw a line where the volume is too low for humans to hear:
Anything below the green line is too quiet to hear, so we can regard that as data to discard during compression. We get any frequency that is below the green line and make it equal to zero, as it’s not important:
This will sound exactly the same as the original graph because we’re only getting rid of frequencies we can’t hear. However there is now less data for us to encode. This means that we have compressed the data!
Of course a second later one of the frequencies that we made zero may well become audible, in which case we will not be able to discard it. So when we compress a song the process actually goes like this:
1. decide the threshold volume below which the frequency is inaudible
2. look at the frequency graph throughout the whole song
3. get all the frequencies that never reach above the threshold volume and set them to zero
Unless you have a crazy piece of music, there will always be frequencies that you are able to discard. Being able to discard frequencies has two benefits in terms of compression:
1. you don’t have to send data about these frequencies. Instead of saying ‘at 1s, 50Hz = 0.0001, at 1.5s 50Hz = 0.0002′ you can just say ’50Hz = 0′. If there is less data to send, you are compressing the file!
2. you are reducing the amount of frequencies you are using, and therefore you have to use less bits to represent your data. This is important. If I had 5,000 frequencies that I was representing before compression (human hearing is approx 0-3.4KHz) and after compressing I only need to represent 2,000 frequencies, I can reduce the amount of bits I need to represent my data. Think back to when I was explaining binary. To encode 5,000 frequencies in binary I need 13 bits. If I wanted to represent 1Hz here it would be represented as ’0000000000001′. Now to represent 2,000 frequencies I only need 8 bits of data. If I wanted to represent 1Hz here it would be ’00000001′. As each number needs 8 bits rather than 13, we have less data to send, hence we can send over a third less data by choosing to represent less numbers! Pretty smart right?
Lossy vs Lossless
When choosing between lossy and lossless compression there is, like in all engineering, a trade off. Lossless compression usually has a smaller compression ratio, but the benefit is that you are not losing any data when you send your message. Lossy compression can achieve much higher compression ratios, but data is lost.
So if I wanted to send you a complicated computer program I have written I would choose lossless compression. The reason would be that all the data in the file (every line of code) is needed. There is no point in sending the file if I have to scrap data because it simply won’t work. Therefore even though I would have to send a bigger file, it’s worth using lossless compression because of the type of file i’m sending.
However if I had created a VoIP or Skype-like software, in which live video calls were taking place, I would use lossy compression to send the video data. The reason is that if the video is not in ‘real time’, or takes a long time to load, it is annoying and people won’t use it. I don’t mind if the picture is slightly pixelated as long as I am receiving it on time. If there is less data to send it will take less time to arrive at the other person’s computer. Hence lossy compression is perfect in this scenario. I am willing to get rid of data (quality of the picture) in order to heavily reduce the amount of data I am sending.
Ever wondered how an Anti Aircraft Missile works? A plane can move at different speeds and altitudes, so how do you know when to fire your missile? Well you need to know two things: where the aircraft is now and where it will be a short time in the future. The reason you need to know where it will be in the future is because your missile takes time to reach the plane’s altitude, so you need to aim at where the plane will be, not where it is now. Engineers use a nifty thing called the Kalman Filter in order track objects and predict where they will be. The Kalman Filter isn’t just used for missiles, it also plays an integral role in GPS, self driving cars, auto pilot, AI and robotics. What is this Kalman Filter then? It’s a recursive way to use Bayes’ Theorem. What’s Bayes’ Theorem? It’s a useful tool in probability. What’s probability? Read on, you’ll find out.
This post therefore describes some basic probability, what Bayes’ Theorem is, what the Kalman Filter is and finally how it is used in an Anti Aircraft Missile. Nothing I describe here requires maths knowledge beyond A-level (or even GCSE) as i’m giving a very broad overview of the whole process. But hopefully after you read it you will have learnt something new.
Here’s a box:
It has an area of 1. Imagine the area represents all your certainty about an event. What does that mean? Let’s say i’m about to flip a coin. I’m 50% certain it will be heads and 50% certain it will be tails, so I give half the area to heads (as it has half of my certainty) and half the area to tails. If i’m 90% sure it will rain tomorrow, I will give 0.9 of the area to rain happening tomorrow and 0.1 to it not happening tomorrow, like this:
So I can distribute my certainty, or area in the box, how I want. If there are three outcomes I can distribute my certainty: 0.3, 0.6, 0.1 or 0.9, 0.1, 0 or if you were completely certain that the first outcome would happen 1, 0, 0. The one rule about the box is that the area must always equal 1.
When you are estimating a coin toss or rainfall you are distributing your certainty into discrete outcomes, this means you are separating your box into different segments of different sizes. You separate your box into two other discrete boxes, rain/no rain, heads/tails. As well as discrete outcomes to events, there are also continuous ones. When there is a continuous outcome you don’t split your box of certainty into smaller boxes, you mould your box into the shape of your certainty. Sound weird? It’s not. The next example will make the difference clear.
I have two sensors that measure distance, one is a good sensor and one is bad. Instead of giving an answer like ’10 metres’ it gives the upper and lower limit of what it thinks it is, like ’9.90-1.13 metres’. I measure an object that is two metres away with the bad sensor and it gives me the answer ’1.5-2.5 metres’. I can then mould my box of certainty between 1.5 and 2.5 like this:
Remember the box area always has to be equal to 1. The width is 2.5-1.5=1 so the height must be 1. Now I take a measurement with my good sensor and get the answer ’1.95-2.05′. I can now mould my box like this:
See how the area is still 1? The range we’re given with the better sensor is more narrow, so the box must be higher. When the outcomes were discrete, our certainty was measured by what area of the box we gave to certain outcomes. In continuous outcomes (like the graph above) we still use the area of the box to measure certainty, but we now need to measure the area in a range.
Let’s say we want to measure how certain each sensor was that the object is between 2-2.05 metres away. On the good sensor the height is 10 and the width is (2.05-2=) 0.05 metres. So the area of the box is 10×0.05= 0.5, which means it is 50% certain the object is within that range. On the bad sensor the height is 1 and the width is still 0.05, so the area of the box is 1×0.05=0.05. It is 5% certain the object is within that range. In the real world a sensor that measures distances doesn’t give you a range, but an exact value. However some sensors are better than others, so we still need a way of quantifying how good a sensor is. So what shape accounts for these two facts and still gives us a good way to shape our probability? A bell curve!
You’ve probably seen this shape before. It occurs everywhere in maths and nature. The posh word for it is a Gaussian but we’ll stick with bell curve. What’s so useful about the bell curve is that you only need two numbers, it’s mean (where the highest point is) and it’s variance (how wide it is). So what do variance and mean represent? Well the mean is obvious, it is what the sensor thinks the value is. In the graph above the mean is at 2 (because that’s the highest point in the bell curve) so the sensor will tell me that the object is 2 meters away. Variance isn’t as obvious. The variance is a measurement of how sure the sensor is about it’s reading. On the graph below the blue bell curve has a variance of 1 and the red bell curve has a variance of 4, both have the same mean of 2:
As you can see, if the variance of a bell curve is increased then it gets wider. The reason that the wider curve (the red line) is shorter than the thinner curve is because the area underneath both curves still has to equal 1. So if I make a curve fatter, by increasing the variance, then is has to get shorter in order to keep the same area. It’s exactly like the sensors that gave a range, if you have a larger range the rectangle had to be shorter to keep the same area. A tall thin bell curve is more certain about it’s value than a short wide bell curve. This is because all of the certainty (which remember is the area of the shape) is concentrated in a smaller range when it is thinner. The variance is a measure of how wide a bell curve is, so lets put this into a rule:
The larger the variance, the worse the measurement
Okay enough about bell curves and certainties. Let’s move onto Bayes’ Theorem and see why it’s useful.
So we have seen how to represent probability and the most useful way to represent it: with a bell curve. The reason a bell curve is useful because all you need to define it is a mean and a variance. Bayes’ Theorem is a way of getting a load of different bell curves and creating one bell curve that sums them all up. What does this mean practically? Well if I have 20 different sensors, of different quality, trying to measure the same thing, Bayes’ Theorem says ‘i’ve looked at all the measurements and it’s probably this value’. Let’s look at a couple of examples. I have two sensors measuring distance again and they are exactly the same quality. One tells me the object is 2 meters away and the other says it is 3 meters. They are the same quality, so this means that they have the same variance. The two sensors’ certainty will look like this:
So if we want to infer what the distance is from these two measurements, it would be 2.5 metres right? We trust each sensor exactly the same, and 2.5 metres is in the middle of the two readings given. How about if we trust one sensor more than the other? Let’s say the sensor that gave us a reading of 3 metres has a large variance (i.e. it isn’t very good) and the sensor that gave us 2 metres has a small variance (it is good). Our certainties would then look like this:
How do we infer the distance now? The blue curve is more certain, so surely our result should be closer to 2 metres than to 3 metres. Bayes’ Theorum helps us calculate what the ‘actual’ distance is from lots of measurements from sensors of different qualities. It does this by giving us a new bell curve. This is really useful because it’s condensing information for us. We are given a load of bell curves and are given one bell curve as a result. Remember a bell curve needs two bits of information to be defined, mean and variance. The mean of the new bell curve moves around based on where the mean of the old bell curves are. Think of it like a vote. Each sensor casts a vote on what it thinks the measurement is, but the better sensors have a more influence in their vote. Bayes’ Theorum counts up all the vote and takes the best guess at where the mean for the new bell curve is.
What happens to the variance? Interestingly it ALWAYS decreases, i.e. the new bell curve you get will have a smaller variance than the old bell curves you used. This means that the new bell curve you have will always be more certain of it’s value than the previous bell curves. This translates to: The more information you have about something the better. You can never be given more information about something and be less certain about it’s value. It seems obvious, but it has interesting corollaries (perhaps for another post).
The Kalman Filter
Let’s take a step back and see what we know. Sensors give us measurements of objects and also how certain they are of their measurements. This is represented in a bell curve of certainty, the peak of the curve being the value of the measurement. The variance is a measure of how certain the sensor is of its prediction. Bayes’ Theorem allows us to get multiple readings from different sensors and produce one bell curve which has a mean that is based on the mean of all the other sensors, and a variance that will always be less (better) than any of the other sensors. What’s this Kalman Filter then? Well first of all it’s not really a filter. A better name would be a Kalman Tracker, because it’s main job is to track moving objects. Let’s say we have a ball that is approaching a radar. The radar tells us how far away the ball is in meters and takes a new measurement every t seconds:
As the ball is moving towards us there will be a new value every time we take a measurement. The measurement at x-1 will be further away than the measurement at x because the ball is moving. So this problem is different to the one that Bayes’ Theorem solves. Bayes’ Theorem lets us take lots of measurements of one thing and guesses the value. So if there were 20 radars that all guessed the distance x-1, then we could easily use Bayes to get a good guess on what the actual value of x-1 is. This problem is different because we only have 1 thing doing the measuring (the radar), and the thing we’re measuring is going to move around all the time. However Bayes can still be used in a sneaky way here. The trick is that we use the past measurements of the ball to predict the future position of it. So we use the position of the ball at x-1 and x in order to predict where it will be at x+1. Here’s how:
The ball is 100m away at x-1. The radar takes a new measurement every 0.5 seconds. When the radar takes another reading (where x is) it is 95m away. Therefore in 0.5 seconds the ball has traveled 5 meters. Speed = distance/time = 5/0.5 = 10m/s, so the average speed of the ball is 10m/s. Now assuming that the ball is travelling at a constant speed of 10m/s, we can predict that in the 0.5 seconds it takes for the radar to take its next reading, the ball will have traveled: Distance = time x speed = 0.5 x 10 = 5 meters from its previous position, therefore we predict it will be 95 – 5= 90 meters away at x+1. We can now shift our bell curve in order to predict what the next response will be. When we measured x it was 95m away, so the bell curve will look like this:
Then we predict it will be at 90m at x+1, so the bell curve will look like this:
Incase you can’t read the scale the mean has shifted from 95m to 90m. Also notice how the variance has got wider when we shift the bell curve. That’s because we are less sure about our prediction of x+1 than we are about our measurement of x. This is due to the fact we are assuming that the ball will be travelling at a constant speed, which increases our uncertainty, and hence the variance. So we have used the measurements of x-1 and x in order to predict where the ball will be at x+1. We then take the radar measurement of x+1 and see how similar they are:
The blue bell curve is our prediction of where the ball will be (90m) and the red bell curve is where the new measurement from the radar thinks the ball is (91m). Now it’s time to use Bayes’ Theorem. We combine our prediction and the measurement using Bayes in order to produce one bell curve that is our best guess of where the ball will be at x+1. We then use the measurement at x+1 and x in order to predict where the ball will be at x+2, make a predicition, take a measurement, and combine the two to get the best guess of x+2. The process is recursive, looking like this:
This is better than only using the measurement from the radar. Using information about past measurements lets us take an extra guess at where the ball is. Remember if you have more information about something the better. The Kalman filter is a really effective way of using the past to inform the present and predict the future.
Anti Aircraft Missiles
Lets wrap this up by giving an overview of how an Anti Aircraft Missile works. It’s almost the same as the example of the ball moving towards the radar, but more tailored towards missiles. Your missile can shoot vertically up into the air. If a plane passes over your missile station, you want to fire the missile so that it will come into contact with the plane. You have a few radars that are tracking the plane and they take a new measurement every 0.1 seconds. You also know how long it will take for your missile to reach a certain altitude.
Say the plane approaching your station is flying at 30,000ft and you know that it will take your missile 10 seconds to reach that altitude. You use the Kalman Filter on the radars in order to track where the plane is and it’s speed. But the Kalman filter can also make predictions about the future position of the plane. So you not only plan 0.1 seconds in advance (the amount of time it takes for the radar to take a new measurement), but also 10 seconds in advance (the amount of time it takes for the missile to reach the appropriate altitude). When the prediction 10 seconds in advance says that the plane will be vertically above the missile, you shoot.
As the plane is already being tracked, figuring out the altitude of the plane is as simple as looking at where the Kalman Filter thinks the plane is. So if the plane is at 20,000ft and the missile takes 7 seconds to reach that altitude, the Kalman Filter will predict 7 seconds in advance. It’s easy to write a program that will do this automatically without any human input. What if the plane is changing altitude? The problem becomes slightly harder, but not too hard. It’s just a matter of figuring out how far to plan in advance.
In this post i’ve described what a Kalman Filter is and hopefully you’ve understood the potential of it. To get there i’ve gone through how to represent probability and Bayes’ Theroem because I believe you get a deeper understanding of the whole process if you have a grasp of these.
In the last few years ‘Big Data’ and ‘Data Mining’ have become the buzzwords of the tech industry. It’s how Facebook knows what adverts to show you, it’s how iPhones correct your typing and, apparently, how the NSA decides whether you are a terrorist. But what do these buzzwords actually mean? What are computers doing when they’re ‘learning’ or ‘mining’? When asked, experts say in a serious tone ‘it’s a very complicated field that isn’t easy to understand’ but they’re lying. The principles are easy to grasp and you don’t need to be an expert to appreciate the potential of this subject or to think of applications yourself!
This post starts simple and gradually gets more in depth, but i’ve tried to make the whole thing accessible to someone who has very little maths knowledge. If you just want a very broad overview you only need to read the first section.
It’s all Classification
First things first. While all these terms have subtle differences they’re all talking about the same process: Classification. So what is classification? Well take a look at this picture:
The top/left side of this picture has red circles and the bottom/right side has blue squares. Now say that someone pointed to a place in that picture and asked you to say whether a circle or square belonged there, eg:
There should be a circle where the yellow cross is and a square where the black cross is, right? You’ve just classified data! Think of classification in this way: deciding which parts of the picture belong to the circle and which parts belong to the square.
If you were only given one square and one circle it would be pretty hard to decide which areas of the picture belong to the circle and which belong to the square, but if you were given a million circles and a million squares it would be really easy. The circles and squares have already been classified as circles and squares, therefore they are classified data. Let’s make all this into a rule:
The more data you have already classified, the easier it is to decide how new data will be classified
So how does deciding between circles and squares have any practical use? Well let’s say i’m a teacher and I think that you can estimate whether someone will be earning over £40,000 a year from their IQ and test scores. I get the IQ, test scores and earnings of all my previous students. I then classify their payment into two categories, above £40,000 a year and below £40,000 a year. I can then plot a graph like this (above £40,000 = red circle, below £40,000 = blue square)
Now if I want to estimate if young Jimmy will earn over £40,000 a year in the future, I can put an ‘X’ on this graph at where his IQ and test scores are and estimate whether he is a red circle or a blue square; whether he will earn over £40,000 a year or not (for the record the graph isn’t real, I have no idea if IQ or test score correlate with income).
There are two important (and related) things to realize about this example
- The only thing helping me classify new bits of data is already classified bits of data. There are no fancy formulas or complex algorithms, what’s helping me decide if something is a circle or a square is the circles and squares that are already on the graph. I.e. data about past students
- I have to do very little work. All I say is ‘I think x and y effect the classification of z’ and plot it. If there are clear areas where squares are and clear areas where circles are then x and y probably effect z, if the circles and squares are everywhere and there is no clear pattern, then x and y probably don’t effect z.
Let me clarify the second point a bit. Look at the second picture again and think about why you could easily classify the crosses. The reason was because there was a pattern in the data, there was a clear circle area and a clear square area. Pattern’s only happen if variables are related to one another. So there is a nice pattern between income, IQ and test scores because they’re related to each other. But if I plotted income, eye color and height there would not be a pattern because they’re not related to each other. It would be much harder to decide if a point was in a circle area or a square area. It would look like this:
Classification is therefore a way of spotting patterns in data. As humans are pattern spotting creatures, hopefully you can start to see the potential this has…
How computers classify
When you classified the yellow cross and the black cross it was easy, you could instinctively tell which was a square and which was a circle. But computers need a set of rules to make them decide how to classify rather than instinct. There are two main ways to make them decide, the nearest neighbor classifier and the SVM classifier.
The nearest neighbor classifier is pretty self explanatory. When the computer has to decide if a certain point is a circle or a square it looks to it’s nearest neighbors and sees what they are, and then classifies itself according to a vote. So in the graph below the computer looks at it’s 3 nearest neighbors:
The yellow cross sees that its neighbors are all circles so says ‘well i’m probably a circle’ and classifies itself. The same with the black cross.
SVM (support vector machine) is a fancy name for a straight line. The SVM draws a line between the circles and the squares like so:
and then says ‘anything above the green line is a circle, anything below it is a square’. Therefore when it is given the yellow cross it says ‘this is above the green line so it’s a circle’.
Both these techniques are valid ways of making the computer decide how to classify data. Deciding which one to use depends on the nature of the problem, but that’s a bit too complex for this post.
Why we need computers to Classify for us
In the example with the teacher it would have been easy for a human to look at the graph and decide Jimmy’s economic outlook, so why bother getting computers to do this for us? The answer has to do with dimensions.
Again lets go back to the example of the teacher classifying whether his students will earn over £40,000. If he wanted to see the relationship between IQ and future income it would be a one dimensional problem because there is only one thing he is measuring: IQ. For a 1-D problem you can plot it on a straight line like so (remember red circle = above 40k, blue square = below 40k):
If he wanted to look at the relationship between IQ, test scores and future income then his problem would be two dimensional. This is because he now is measuring 2 things and seeing what the classification is. A 2-D problem must be plotted in two dimensions, which is a plane (it’s easy to think of it like a piece of paper. Height and width but no thickness). So plotting in 2-D is what the original graph was:
Now the teacher wants to include the height of the student to see if there are any patterns there. Now he needs to plot IQ, Test Score and Height. There are 3 things to measure so it’s a 3-D problem. The graph must therefore be plotted in 3-D. 3-D is the world we live in (height, width and depth), so it would look something like this:
What if the teacher wanted to add eye color? You guessed it, it would be a 4-D problem. So what does 4-D look like? It is literally impossible for us to visualize. For 1-D and 2-D problems it is pretty easy for humans to classify what are circles and what are squares, but once you want to start looking at patterns between more than 3 variables we cannot do it, and this is where computers come in. Computers can handle more variables with ease, and so a 1000 dimensional problem isn’t too hard for it to handle (that may sound like an exaggeration, but it’s not. Problems above 1000 dimensions are pretty common!). So computers can take in huge amounts of variables (eye color, height, IQ, birth weight, blood sugar levels, you name it) and find patterns in the data (i.e. classifying circles or squares). These are patterns that we would never be able to visualize. Remember the only thing helping the computer find the pattern is the data that we’ve already put into it and classified. That’s all it needs, after you give it lots of data it does the rest of the work in classifying new bits of data.
How is all this classification actually used in the real world?
It’s all been a bit abstract so far, but hopefully you understand enough to get some of practical applications of classification. I’ll quickly talk about two. One is IBM’s method of predicting crime and the other is face recognition.
One of my favorite applications of classifying has been IBM’s crime predictor. Their classifier takes in huge amounts of variables about New York city every day (Weather, major sports events, time of day, time of year etc.) and then looks at where crime is happening in the city. So the thing that’s being classified is whether a crime is happening in a certain area or not (crime in lower Manhattan = yes/no = circle/square). After collecting information for a few years they started using the classifier to predict where crime will happen that day. So at the beginning of the day they tell the classifier ‘today it is sunny, a Monday, there’s a Yankee’s game on etc. etc.’ and the classifier will look at the data on similar days, and then classify what will be ‘crime hotspots’ for the day. Ambulances and police cars then patrol those areas more and wait for the crime to happen. Apparently this has genuinely helped reduce the emergency response time in New York. Awesome!
A common use for classifying is face recognition. The classifier gets thousands of images of people and told where the face is in the image (this is done by drawing a box round each face in each image). What the classifier then does is takes the pixel locations, the color of each pixel and their relation to each other as its variables, and classifies whether there is a face in the image or not. The classifier has been told thousands of times what a face looks like in an image manually and because of this it can determine if there is a face in the image and where it is (face or no face, circle or square). This technology is really effective and is computationally lightweight enough to use on a standard digital camera.
Big data, Machine Learning and Data Mining all refer to the process of classifying data. ‘Big Data’ refers to the fact that the more data you can use, the more effective the classifier will be (look at the rule we established early on), ‘Data Mining’ refers to the fact that classifiers can spot patterns in data that are in too high a dimension for us to comprehend and ‘Machine Learning’ refers to the fact that a classifier seems to ‘learn’ the way to classify because they are using previous results to inform future decisions.
It is easy to make sweeping predictions about what the future holds for Big Data, but my guess is as good as anyone’s. It may be useful to look to the past in order to put this phenomenon in context. The Victorians loved to classify. They went around the world collecting all sorts of bugs, animals, bits of ancient civilizations and so on. It was driven by the faith that if you could categorize and classify everything humans would be more knowledgeable about the world and closer to an objective truth. Big Data is based on the same belief that we can find patterns in nature by classifying and measuring as much as we can. I would argue that Big Data is the ultimate Victorian machine, the only reason it has been realised 100 years later is that we have the technology to put these ideals into practice.
Is a lemonade stand actually a good way to make money?
As a rookie entrepreneur I have recently been trying to think of viable business ideas, but that is easier said than done. Yesterday it was beautiful so I decided to go to Clapham Common in London to clear my head and think of new ideas. After 15 minutes or so I became thirsty, and after realising that the nearest place to buy a cold drink was a 10 minute walk, I thought to myself: Is making a homemade lemonade stand on Clapham Common a feasible business idea? It is the most stereotypical thing for a kid who wants to be an entrepreneur to do, but in terms of starting a business I have a comparable amounts of knowledge! So I thought it would be fun to do a couple of calculations and put some thought into it. Here are the results:
The first hit on google when you search ‘Homemade Lemonade’ is a nice simple recipe:
- 1 cup sugar (can reduce to 3/4 cup)
- 1 cup water (for the simple syrup)
- 1 cup lemon juice
- 3 to 4 cups cold water (to dilute)
So sugar and lemon juice are going to be the main cost when producing the lemonade. After searching for prices, I found you can buy 6 litres of lemon juice for about £15, and 5kg of sugar for £4. Assuming the cost of water is negligible I worked out the price to produce 1 litre of lemonade like so:
for 30 kg of sugar it will cost £4 x 6 (amount of bags needed) = £24
for 30 litres of lemon juice it will be £15 x 5 = £75
Total = £24 + £75 = £99
For 30 litres of lemon juice we add 120-150 litres of water, therefore (assuming that sugar takes up an insignificant amount of volume) for £99 it is possible to produce about 135 litres of homemade lemonade.
Lets say that each portion of lemonade is 250ml, and each glass gets quater a segment of lemon, this means that we needs 135 lemons for 135 litres. Tesco sells lemons 4 for £1 so that would add an extra £39.
This comes to 135 litres costing £138 not including ice, which we can nicely round to:
£1 = 1 Litre of Lemonade
How much to sell for?
Clapham is a neighborhood of London that has alot of middle class families, and the common is a lovely green area of 220 acres that people come to, especially on sunny days. Due to its size, and where the shops are located, there are many parts of it that are at least 10 minutes from anywhere to get a drink. That is unless you are willing to pay extortionate prices at the ice cream van. On sunny days like this:
you would have a huge demand and could charge a high price. I would estimate that you could sell a 250ml plastic cup of lemonade for £1.50. This makes £5 profit on every litre, or to put it another way, for every pound you invest you get £6 back. Not bad! But it’s not as simple as starting a lemonade stall on every available piece of London greenery and waiting for the millions to roll in, as we shall see…
It would be great to set up a little stand on the common, this would allow you to mix lemonade on demand and give the business a “homely” field. But I guess I was naive in thinking Clapham Common was actually common land where you could do this without anyone’s permission. It’s under a thing called the “Common Land act of 2006″ which pretty much makes it a park. After talking to the ice cream lady on the common, I found that she pays £25,000 per year for a licence to park there (!) and we would have to pay a comparable amount. So that throws the idea of a stall out the window, but there is a way round this with a peddler’s licence. This costs only £17 a year and would allow me to get to those parched sunbathers, the only catch is that I have to “generally keep on the move, pausing to make sales”.
This is feasible, but the logistics have to change. First the lemonade would have to be mobile. Carrying a significant amount of liquid is not feasible (remember 1 litre = 1kg), but it is possible to roll it around. It would make sense to store it in something like this (used by campers to get water for their caravan):
If I could attach a pump to this then I could pour drinks directly from the container through a little nozzle into plastic cups, and the whole process could be done on the move.
Making homemade lemonade can have a high profit margin if it is in high demand and it is logistically possible to sell it cheaply under a peddler’s licence. However I would say that it isn’t worth my time for two reasons:
- The British Weather
The success of a lemonade business is highly dependent on the weather and unfortunately the British weather is one of the least reliable things I know. From June-September the average temperature is above 16C, but the average rainfall during those months is the same as any other month. If I were to give a high estimate, there would be maybe 50 days in the year that I could go out and work. For the other 315 I would be sitting in my flat waiting for the sun to come out. It would be an erratic business and I would not be able to plan very far in the future.
The other problem is that to make serious money from it you would need to grow the business beyond just me selling lemonade, but I wouldn’t be able to expand it without significant cost. If I wanted to sell twice as much lemonade in an afternoon I would need to hire someone else, which would cost around £35 more a day (if he worked for 5 hours at £7 per hour), and buy a new liquid container and pump for each person. This works out as each person having to sell about 30 cups to break even. Due to the nature of the business the amount I would need to invest would be proportional to the amount I want to sell, and I am not too inclined to invest much money into this!
If you want to make a few quid on a hot day then selling lemonade is a great idea. But if you want to start a business there are better options.