Having been in the social sciences for a couple of weeks it seems like a large amount of quantitative analysis relies on Principal Component Analysis (PCA). This is usually referred to in tandem with eigenvalues, eigenvectors and lots of numbers. So what’s going on? Is this just mathematical jargon to get the non-maths scholars to stop asking questions? Maybe, but it’s also a useful tool to use when you have to look at data. This post will give a very broad overview of PCA, describing eigenvectors and eigenvalues (which you need to know about to understand it) and showing how you can reduce the dimensions of data using PCA. As I said it’s a neat tool to use in information theory, and even though the maths is a bit complicated, you only need to get a broad idea of what’s going on to be able to use it effectively.

There’s quite a bit of stuff to process in this post, but i’ve got rid of as much maths as possible and put in lots of pictures.

## What is Principal Component Analysis?

First of all Principal Component Analysis is a good name. It does what it says on the tin. PCA finds the principal components of data.

It is often useful to measure data in terms of its principal components rather than on a normal x-y axis. So what are principal components then? They’re the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. This is easiest to explain by way of example. Here’s some triangles in the shape of an oval:

Imagine that the triangles are points of data. To find the direction where there is most variance, find the straight line where the data is most spread out when projected onto it. A vertical straight line with the points projected on to it will look like this:

The data isn’t very spread out here, therefore it doesn’t have a large variance. It is probably not the principal component.

A horizontal line are with lines projected on will look like this:

On this line the data is way more spread out, it has a large variance. In fact there isn’t a straight line you can draw that has a larger variance than a horizontal one. A horizontal line is therefore the principal component in this example.

Luckily we can use maths to find the principal component rather than drawing lines and unevenly shaped triangles. This is where eigenvectors and eigenvalues come in.

## Eigenvectors and Eigenvalues

When we get a set of data points, like the triangles above, we can deconstruct the set into eigenvectors and eigenvalues. Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue. An eigenvector is a direction, in the example above the eigenvector was the direction of the line (vertical, horizontal, 45 degrees etc.) . An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line. The eigenvector with the highest eigenvalue is therefore the principal component.

Okay, so even though in the last example I could point my line in any direction, it turns out there are not many eigenvectors/values in a data set. In fact the amount of eigenvectors/values that exist equals the number of dimensions the data set has. Say i’m measuring age and hours on the internet. there are 2 variables, it’s a 2 dimensional data set, therefore there are 2 eigenvectors/values. If i’m measuring age, hours on internet and hours on mobile phone there’s 3 variables, 3-D data set, so 3 eigenvectors/values. The reason for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions have to be equal to the original amount of dimensions. This sounds complicated, but again an example should make it clear.

Here’s a graph with the oval:

At the moment the oval is on an x-y axis. x could be age and y hours on the internet. These are the two dimensions that my data set is currently being measured in. Now remember that the principal component of the oval was a line splitting it longways:

It turns out the other eigenvector (remember there are only two of them as it’s a 2-D problem) is perpendicular to the principal component. As we said, the eigenvectors have to be able to span the whole x-y area, in order to do this (most effectively), the two directions need to be orthogonal (i.e. 90 degrees) to one another. This why the x and y axis are orthogonal to each other in the first place. It would be really awkward if the y axis was at 45 degrees to the x axis. So the second eigenvector would look like this:

The eigenvectors have given us a much more useful axis to frame the data in. We can now re-frame the data in these new dimensions. It would look like this::

Note that nothing has been done to the data itself. We’re just looking at it from a different angle. So getting the eigenvectors gets you from one set of axes to another. These axes are much more intuitive to the shape of the data now. These directions are where there is most variation, and that is where there is more information (think about this the reverse way round. If there was no variation in the data [e.g. everything was equal to 1] there would be no information, it’s a very boring statistic – in this scenario the eigenvalue for that dimension would equal zero, because there is no variation).

But what do these eigenvectors represent in real life? The old axes were well defined (age and hours on internet, or any 2 things that you’ve explicitly measured), whereas the new ones are not. This is where you need to think. There is often a good reason why these axes represent the data better, but maths won’t tell you why, that’s for you to work out.

How does PCA and eigenvectors help in the actual analysis of data? Well there’s quite a few uses, but a main one is dimension reduction.

## Dimension Reduction

PCA can be used to reduce the dimensions of a data set. Dimension reduction is analogous to being philosophically reductionist: It reduces the data down into it’s basic components, stripping away any unnecessary parts.

Let’s say you are measuring three things: age, hours on internet and hours on mobile. There are 3 variables so it is a 3D data set. 3 dimensions is an x,y and z graph, It measure width, depth and height (like the dimensions in the real world). Now imagine that the data forms into an oval like the ones above, but that this oval is on a plane. i.e. all the data points lie on a piece of paper within this 3D graph (having width and depth, but no height). Like this:

When we find the 3 eigenvectors/values of the data set (remember 3D probem = 3 eigenvectors), 2 of the eigenvectors will have large eigenvalues, and one of the eigenvectors will have an eigenvalue of zero. The first two eigenvectors will show the width and depth of the data, but because there is no height on the data (it is on a piece of paper) the third eigenvalue will be zero. On the picture below ev1 is the first eignevector (the one with the biggest eigenvalue, the principal component), ev2 is the second eigenvector (which has a non-zero eigenvalue) and ev3 is the third eigenvector, which has an eigenvalue of zero.

We can now rearrange our axes to be along the eigenvectors, rather than age, hours on internet and hours on mobile. However we know that the ev3, the third eigenvector, is pretty useless. Therefore instead of representing the data in 3 dimensions, we can get rid of the useless direction and only represent it in 2 dimensions, like before:

This is dimension reduction. We have reduced the problem from a 3D to a 2D problem, getting rid of a dimension. Reducing dimensions helps to simplify the data and makes it easier to visualise.

Note that we can reduce dimensions even if there isn’t a zero eigenvalue. Imagine we did the example again, except instead of the oval being on a 2D plane, it had a tiny amount of height to it. There would still be 3 eigenvectors, however this time all the eigenvalues would not be zero. The values would be something like 10, 8 and 0.1. The eigenvectors corresponding to 10 and 8 are the dimensions where there is alot of information, the eigenvector corresponding to 0.1 will not have much information at all, so we can therefore discard the third eigenvector again in order to make the data set more simple.

## Example: the OxIS 2013 report

The OxIS 2013 report asked around 2000 people a set of questions about their internet use. It then identified 4 principal components in the data. This is an example of dimension reduction. Let’s say they asked each person 50 questions. There are therefore 50 variables, making it a 50-dimension data set. There will then be 50 eigenvectors/values that will come out of that data set. Let’s say the eigenvalues of that data set were (in descending order): 50, 29, 17, 10, 2, 1, 1, 0.4, 0.2….. There are lots of eigenvalues, but there are only 4 which have big values – indicating along those four directions there is alot of information. These are then identified as the four principal components of the data set (which in the report were labelled as enjoyable escape, instrumental efficiency, social facilitator and problem generator), the data set can then be reduced from 50 dimensions to only 4 by ignoring all the eigenvectors that have insignificant eigenvalues. 4 dimensions is much easier to work with than 50! So dimension reduction using PCA helped simplify this data set by finding the dominant dimensions within it.

If you like this blog and think that I would work well in your company a copy of my CV can be found by following this link. I am available for employment from August 2014.

Feel free to email me at george.m.dallas@gmail.com

well done!

This is fantastic. Thank you.

Great post – thank you. I would like to add that PCA is done frequently with scientific data. For example if you have a spectrum representing some chemical and are comparing with a different spectrum from another chemical the data will be similar on a lot of dimensions. So comparing PC scores (eigenvalues) is a good way to differentiate the data.

That was nicely explained! Now I just need to figure out how this applies to image processing – apparently PCA is a hot technique there for classifying different motifs in images.

Images have a lot of information (dimensionality) and not all of it will discriminate between a two images. Say you have two different faces in focus and an out of focus background. The pics are generally very similar if you just did a correlation on some metric like image intensity, so PCA would pick out the differences and place those onto orthogonal axes allowing you to easily differentiate the images with a linear discriminant analysis (for example).

I used eigenvector analysis in image processing 20-odd years ago to develop classification algorithms which detected faulty solder joints based in X-ray images. I’m thrilled our paper is available online.

http://www.bmva.org/bmvc/1990/bmvc-90-029.html

We used the average pixel brightness from 16 regions inside and surrounding the area of the solder joint as a 16-dimensional vector, and then used eigenvector analysis to find the axes that gave the best discrimination of good from faulty joints.

When you say, “it’s a neat tool to use in information theory,” I think you mean “exploratory data analysis” and not “information theory”. The latter is the field of mathematics (often practiced by electrical engineers) describing compression and error-correction codes (i.e., removing all the redundancy from a stream of bits [if you'll allow me to be concrete] and carefully adding redundancy back to obtain immunity to partial corruption, respectively), and doesn’t have much to do with PCA.

Ahmed, PCA does indeed have applications in information theory, since PCA is one way (though not always the most optimal) of reducing noise / enhancing signal. There are many studies about the role of PCA in information theory, which much discussion about when it is appropriate. I’ll give a couple of links discussing these issues:

https://en.wikipedia.org/wiki/Principal_component_analysis#PCA_and_information_theory

http://www.spsc.tugraz.at/sites/default/files/RelevantLossFinal.pdf

(y) good 1

Great :), thank you

Nice explanation of dimensional reduction. It’s a hard concept to get wrap your mind around at first. In case you’re interested, I attempt to explain a similar rank reduction here: http://bit.ly/1cc8xHQ (in the “Mathy Bits” section).

Wow!!! – you have an amazing way of making so complicated seem so simple. Great job!!

Can character data be converted in such a way to also use it with PCA? If so, what would that transformation look like?

Great explanation

Very well explained. finally, I got it!!!!

Nice post.

PCA is certainly of value to information theory and image processing.

Although PCA can be used for data reduction, I think its primary role is data exploration – in particular search algorithms and interpolation.

What most people don’t understand about PCA is it can be used on large images by using samples rather than the whole image; similarly one can use fast algorithms to identify the dominant eigenvectors.

I think these fast algorithms are still underutilised.

Great explanation! But how does one label the principal components? I think one must have some prior idea about it.

Thank you so much! I am a biologist myself and sinc e a long time I couldn’t find any non-math-related explanation for PCA! This helps me a lot! My PhD defense is saved :)

great post. just to verify my understanding am i right in saying that eigen vectors are same as principal components (in vector form). so the PCA is what we call the characteristic equation?

thank you for this very informative lecture. :)

Thanks :) Good one :) Really helped

Reblogged this on Have a Sanook Day! and commented:

An amazing way to explain Principal Component Analysis seem so simple

Thank you for giving us the “principal components” of PCA. I understand it much more now.

brilliant, even if I read the theory behind it, the authors there were afraid of sharing the practical and illustrative of the whole thing. I could picture it here without any pics.

please more of you people who can explain stuff simply.

Cheers

Well explained. Thanks

Nice Explanation! However, in the last example, it should be 49 eigenvectors. Since, the eigen decomposition finds hyper-lines (eigenvectors), it needs two variables at least, so that the number of eigenvectors is the number of variables – 1.

Great lecture! You’ve opened my eyes in PCA. God bless your understanding much more.

good job !!

Thank you! Excellent qualitative description!

Awesome and simple! :)

Reblogged this on 2-Propionyl-2-thiazoline.

Suppose the frist two PCs explaine 60% of variance. How can I create a single variable that has this property? We may want to call it the aggregate of the frist two PCs. If someone knows how to do this, please reply as son as you possibly can.

Well done. This was very useful

very very nice post, it make my way to PCA more easy.

thank you

I’m using PCAs since months and read up a lot of stuff to provide myself a better comprehension of what they really do, and this is the first time I really understood! Thank you! :)))))

Reblogged this on Molecular Dynamics Jobs and Discussion Blog and commented:

Learn more about Principle Component Analysis. Thanks to http://georgemdallas.wordpress.com for this nice post.

Well done. You have explained PCA in simplest of terms. The application of PCA are wide ranging. I have been using it in Statistical Arbitrage. I have applied PCA on Indian stock markets in an attempt to figure out any hidden trends. Below is the link if you want to have a look:

http://quantcity.blogspot.in/2013/12/pca-on-nifty-stocks.html

very well explained! I wish the method of calculation of the eigen values would have been explained too! :)

Well done. How does PCA explainif small variations exist in values that that located in power condition like (a+b)^c (power c)? Even small changes in c (0.1; 0.2; etc)is more important then a or b been 10; 11; 12; etc.

To simplify I will increase values by x10 :

(100+100)^2 = 200 x 200 =40000

(100+120)^2= 220 x 220 = 40400 (increase +20^2)

(100+100)^3 = 200 x 200 x 200 = 40000 x 200 (increase by x 200 after only increase power position c by 1 (2+1=3)

If PCA will ignore c as the smallest value, you will lose a lot!

Thanks.

Andrew.

Excellent presentation! I just wish I owned a company so that I could hire you!! Thanks again,

silvana

great :D

Thanks a lot

Reblogged this on sarahmoawad and commented:

very nice explanation :)

This was great – quick read and understandable – I have read other documents about PCA, but nothing this well explained. Very much appreciated – thank you.

you’re awesome

Great explanation – I wish tools like Matlab’s implementation of PCA showed us how to get those eigenvalues!

Thank you! I wish you good luck in your job search.

thanks for the info! you may want to submit your CV at http://jobs.irri.org/ should you wish to be involved in rice science. all the best!

Thanks for the great job you have done. For the 1st time I sensed what axis rotation is.