Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction

(I recently wrote a new post that you may also find interesting called Principal Component Analysis 4 Philosophers)

Having been in the social sciences for a couple of weeks it seems like a large amount of quantitative analysis relies on Principal Component Analysis (PCA). This is usually referred to in tandem with eigenvalues, eigenvectors and lots of numbers. So what’s going on? Is this just mathematical jargon to get the non-maths scholars to stop asking questions? Maybe, but it’s also a useful tool to use when you have to look at data. This post will give a very broad overview of PCA, describing eigenvectors and eigenvalues (which you need to know about to understand it) and showing how you can reduce the dimensions of data using PCA. As I said it’s a neat tool to use in information theory, and even though the maths is a bit complicated, you only need to get a broad idea of what’s going on to be able to use it effectively.

There’s quite a bit of stuff to process in this post, but i’ve got rid of as much maths as possible and put in lots of pictures.

What is Principal Component Analysis?

First of all Principal Component Analysis is a good name. It does what it says on the tin. PCA finds the principal components of data.

It is often useful to measure data in terms of its principal components rather than on a normal x-y axis. So what are principal components then? They’re the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out. This is easiest to explain by way of example. Here’s some triangles in the shape of an oval:

Imagine that the triangles are points of data. To find the direction where there is most variance, find the straight line where the data is most spread out when projected onto it. A vertical straight line with the points projected on to it will look like this:

The data isn’t very spread out here, therefore it doesn’t have a large variance. It is probably not the principal component.

A horizontal line are with lines projected on will look like this:

On this line the data is way more spread out, it has a large variance. In fact there isn’t a straight line you can draw that has a larger variance than a horizontal one. A horizontal line is therefore the principal component in this example.

Luckily we can use maths to find the principal component rather than drawing lines and unevenly shaped triangles. This is where eigenvectors and eigenvalues come in.

Eigenvectors and Eigenvalues

When we get a set of data points, like the triangles above, we can deconstruct the set into eigenvectors and eigenvalues. Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue. An eigenvector is a direction, in the example above the eigenvector was the direction of the line (vertical, horizontal, 45 degrees etc.) . An eigenvalue is a number, telling you how much variance there is in the data in that direction, in the example above the eigenvalue is a number telling us how spread out the data is on the line. The eigenvector with the highest eigenvalue is therefore the principal component.

Okay, so even though in the last example I could point my line in any direction, it turns out there are not many eigenvectors/values in a data set. In fact the amount of eigenvectors/values that exist equals the number of dimensions the data set has. Say i’m measuring age and hours on the internet. there are 2 variables, it’s a 2 dimensional data set, therefore there are 2 eigenvectors/values. If i’m measuring age, hours on internet and hours on mobile phone there’s 3 variables, 3-D data set, so 3 eigenvectors/values. The reason for this is that eigenvectors put the data into a new set of dimensions, and these new dimensions have to be equal to the original amount of dimensions. This sounds complicated, but again an example should make it clear.

Here’s a graph with the oval:

At the moment the oval is on an x-y axis. x could be age and y hours on the internet. These are the two dimensions that my data set is currently being measured in. Now remember that the principal component of the oval was a line splitting it longways:

It turns out the other eigenvector (remember there are only two of them as it’s a 2-D problem) is perpendicular to the principal component. As we said, the eigenvectors have to be able to span the whole x-y area, in order to do this (most effectively), the two directions need to be orthogonal (i.e. 90 degrees) to one another. This why the x and y axis are orthogonal to each other in the first place. It would be really awkward if the y axis was at 45 degrees to the x axis. So the second eigenvector would look like this:

The eigenvectors have given us a much more useful axis to frame the data in. We can now re-frame the data in these new dimensions. It would look like this::

Note that nothing has been done to the data itself. We’re just looking at it from a different angle. So getting the eigenvectors gets you from one set of axes to another. These axes are much more intuitive to the shape of the data now. These directions are where there is most variation, and that is where there is more information (think about this the reverse way round. If there was no variation in the data [e.g. everything was equal to 1] there would be no information, it’s a very boring statistic – in this scenario the eigenvalue for that dimension would equal zero, because there is no variation).

But what do these eigenvectors represent in real life? The old axes were well defined (age and hours on internet, or any 2 things that you’ve explicitly measured), whereas the new ones are not. This is where you need to think. There is often a good reason why these axes represent the data better, but maths won’t tell you why, that’s for you to work out.

How does PCA and eigenvectors help in the actual analysis of data? Well there’s quite a few uses, but a main one is dimension reduction.

Dimension Reduction

PCA can be used to reduce the dimensions of a data set. Dimension reduction is analogous to being philosophically reductionist: It reduces the data down into it’s basic components, stripping away any unnecessary parts.

Let’s say you are measuring three things: age, hours on internet and hours on mobile. There are 3 variables so it is a 3D data set. 3 dimensions is an x,y and z graph, It measure width, depth and height (like the dimensions in the real world). Now imagine that the data forms into an oval like the ones above, but that this oval is on a plane. i.e. all the data points lie on a piece of paper within this 3D graph (having width and depth, but no height). Like this:

When we find the 3 eigenvectors/values of the data set (remember 3D probem = 3 eigenvectors), 2 of the eigenvectors will have large eigenvalues, and one of the eigenvectors will have an eigenvalue of zero. The first two eigenvectors will show the width and depth of the data, but because there is no height on the data (it is on a piece of paper) the third eigenvalue will be zero. On the picture below ev1 is the first eignevector (the one with the biggest eigenvalue, the principal component), ev2 is the second eigenvector (which has a non-zero eigenvalue) and ev3 is the third eigenvector, which has an eigenvalue of zero.

We can now rearrange our axes to be along the eigenvectors, rather than age, hours on internet and hours on mobile. However we know that the ev3, the third eigenvector, is pretty useless. Therefore instead of representing the data in 3 dimensions, we can get rid of the useless direction and only represent it in 2 dimensions, like before:

This is dimension reduction. We have reduced the problem from a 3D to a 2D problem, getting rid of a dimension. Reducing dimensions helps to simplify the data and makes it easier to visualise.

Note that we can reduce dimensions even if there isn’t a zero eigenvalue. Imagine we did the example again, except instead of the oval being on a 2D plane, it had a tiny amount of height to it. There would still be 3 eigenvectors, however this time all the eigenvalues would not be zero. The values would be something like 10, 8 and 0.1. The eigenvectors corresponding to 10 and 8 are the dimensions where there is alot of information, the eigenvector corresponding to 0.1 will not have much information at all, so we can therefore discard the third eigenvector again in order to make the data set more simple.

Example: the OxIS 2013 report

The OxIS 2013 report asked around 2000 people a set of questions about their internet use. It then identified 4 principal components in the data. This is an example of dimension reduction. Let’s say they asked each person 50 questions. There are therefore 50 variables, making it a 50-dimension data set. There will then be 50 eigenvectors/values that will come out of that data set. Let’s say the eigenvalues of that data set were (in descending order): 50, 29, 17, 10, 2, 1, 1, 0.4, 0.2….. There are lots of eigenvalues, but there are only 4 which have big values – indicating along those four directions there is alot of information. These are then identified as the four principal components of the data set (which in the report were labelled as enjoyable escape, instrumental efficiency, social facilitator and problem generator), the data set can then be reduced from 50 dimensions to only 4 by ignoring all the eigenvectors that have insignificant eigenvalues. 4 dimensions is much easier to work with than 50! So dimension reduction using PCA helped simplify this data set by finding the dominant dimensions within it.

414 thoughts on “Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction”

Pramod says:

A job well done !! Easy is beautiful !!

Reply
October 19, 2017 at 1:26 pm
Ali Sher says:

I am an organic chemist and recently I am studying about PCA, HCA, PLS & PLS-DA, PCR, and like materials but being stressful and thought that I’d never be able to understand the concept of PCA but I thank you so much you helped via this, and these are really amazing induceable stuffs with examples.
Dear author if possible kindly upload something like as mentioned in the recent comment.

My well wishes and thanksgivings

Reply
October 25, 2017 at 4:03 pm
Surendra says:

Simply i can say WOW!!!!!!

I don’t have much English words to explain how much simplified explanation to PCA. Googled for 2 days and finally I crossed my fingers after reading this article saying that now I can explain my grandma as well

THANK YOU SO MUCH

Reply
October 30, 2017 at 1:37 pm
sampaul says:

Thank you very much for this article.

Reply
October 31, 2017 at 1:22 pm
Marcia says:

I am so grateful you have put this on the internet. I finally get what many other web postings could not make clear to me. Thank you!

Reply
November 4, 2017 at 1:14 pm
Luce says:

Thank you for the explanation! It’s really helpful for me!

Reply
November 9, 2017 at 6:47 am
DesignPond (@DesignPond) says:

Really nice explanation! Thank you 🙂

Reply
November 17, 2017 at 4:28 pm
Anon129176495 says:

This article seems to have been used for a youtube video almost word for word. https://www.youtube.com/watch?v=w3fQ_F4ATmU

Reply
November 21, 2017 at 5:08 pm
- checkdetector says:
  
  Thanks, I’ll message them now
  
  Reply
  November 29, 2017 at 9:30 pm
Anna says:

You have given the best answer I’ve seen to date! Thank You!

Reply
November 23, 2017 at 4:56 am
Mohammad Abo Almash says:

Very Good explanation. Thanks soo much

Reply
December 10, 2017 at 6:15 pm
Pingback: HERE’S HOW TO COPE DATA VISUALIZATION IN A WORLD OF MANY FEATURES
thecuriousmaverick says:

Preparing for an exam on Business Analytics, and this really helped! Thank you so much!

Reply
December 15, 2017 at 11:31 pm
al says:

In the first example, is it only to me the variance with the vertical line seems bigger than the variance with the horizontal line? If the variance is the sum of power of two of red lines, than it obviously bigger with the vertical line.

Reply
December 19, 2017 at 10:03 am
- al says:
  
  Ok, now got it. So the variance is the distance from the arrows(projections on the line) to the imaginal center (centroid) of the line. Not the distance (red lines) as I originally though.
  
  Reply
  December 19, 2017 at 10:32 am
- abbyrjones72 says:
  
  Variance is the spread in the data, expressed by the largest distance between two data points.
  
  Reply
  January 31, 2018 at 3:41 am
Pingback: Principal Investigator sought – epinormal
Pingback: Another Twitter sentiment analysis with Python — Part 8 (Dimensionality reduction: Chi2, PCA) | Copy Paste Programmers
abbyrjones72 says:

Without a doubt the best explanation of Principal Component Analysis I have ever seen. Recommending it for sure.

Reply
January 31, 2018 at 3:29 am
M Rosu says:

Excellent! A very enlightening introduction of the concept!

Reply
February 8, 2018 at 2:55 pm
Christopher LaFave says:

Oh my gosh thank you so much for this. It is very kind of you. Eigenvalues and eigenvectors are no longer intimidating to me. Neither is PCA, for that matter. I had read countless descriptions of them, all serving to needlessly confuse the issue.

Reply
February 27, 2018 at 3:14 am
Surya says:

Thanks for the great explanation. I’m a newbie to data analysis and this article made me understand PCA. However, I have some questions:

1. I don’t understand how PCA is needed for dimension reduction.
The eigenvalue is nothing but variation in direction of Eigne vector. So, why don’t we calculate variance in a column(dimension) and drop it from dataset if it has less variance?
Is it because Eigenvalues find variation not only in x,y,z axes but also other directions?

2. In the 3-D example, we transformed the data around 2 new axes by removing third eigenvector(with 0 eigenvalue). So, what would be these 2 dimensions? How can we know which dimension the 0 valued eigenvector represents?

Could you please clarify my queries?

Reply
March 3, 2018 at 3:35 am
Pingback: PCA and yield curve? what the fuck is that? – siyuano
JSZ says:

Thank you so much. I am from a non math background and struggled to understand the eigenvector concept till now. Thank you so much

Reply
March 16, 2018 at 1:36 am
Pingback: Scientific Prose in EEBO-TCP – Linguistic DNA
benjamin altman says:

What an amazing breakdown. Thank you!!

Reply
April 10, 2018 at 10:09 pm
Pingback: Dimensionality Reduction - The Math of Intelligence #5 - Artificial Intelligence Videos
Vinay Bagare says:

Excellent write-up! Thanks for sharing!

Reply
May 12, 2018 at 9:44 pm
Anupama says:

Thank you for the explanation. It’s really intuitive.

Reply
May 13, 2018 at 5:19 pm
rgreen says:

Very clear explanation of what was a confusing topic for me. Best I’ve found. Thanks!

Reply
May 18, 2018 at 11:22 pm
rahulrajias says:

Awesome article…
Please provide a similar article on t-SNE and LDA…

Thanks for making the life easier… :p

Reply
May 27, 2018 at 9:32 am
kahomomo@l0real.net says:

Wow, really nice explanation!

However, there is still one point missing here I guess:
Especially when you explained the dimension reduction it seems as if each PC would correspond to one variable. But in truth each PC is just the best regression through the cloud and may or may not correlate closely with a variable, right?

Maybe you could add that somewhere so that it does not trick people into thinking that PCs and variables match?

P.S. I didn’t go through all the comments and hope that I don’t just repeat what others already remarked…

Reply
June 20, 2018 at 10:33 am
Carolina says:

Great explanation, Thanks!!

Reply
July 16, 2018 at 3:19 pm
Pingback: Confused by data visualisation? Here's how to cope in a world of many features - Finance Crypto Community
Krishna says:

Excellent article on Eigen vectors & PCA. Far best than what I have found on internet.

Reply
October 13, 2018 at 10:29 am
Manasee Godsay says:

Thank you, this is the most intuitive explanation I have found regarding PCA till date!

Reply
December 12, 2018 at 3:33 am
S. Imran says:

Thanks, well explained

Reply
December 12, 2018 at 9:46 am
Claudio says:

Thank you very much man, outstanding explanation.

Reply
December 28, 2018 at 2:59 pm
Braininjury-Recuperation says:

I was struggling with eigenvalues and eigenvectors for such a long time, You made it very clear.
Thanks a ton!

Reply
January 2, 2019 at 3:52 am
Rahul Sengupta says:

Thank you so much!! You’re a godsent!

Reply
January 14, 2019 at 12:51 am
RKS says:

Thank. The explanations were so easy to follow and now I am able to see the big picture.

Reply
February 17, 2019 at 7:51 pm
Pingback: Dimensionality Reduction – The Math of Intelligence #5 | Nikkies Tutorials
Kate says:

Very helpful indeed, so simply explained!

Reply
March 8, 2019 at 4:26 pm
Lauren says:

This was one of the best explanations I’ve come across. Thank you!

Reply
March 23, 2019 at 10:35 pm
Jigar says:

Simply an Amazingly written article!! Have you written any book .. you should think about writing one.. Would love to buy it…

Reply
April 8, 2019 at 8:57 pm
Michael Elzinga says:

Great description and overview – it puts things into perspective for me – I’m studying a Masters degree in Data Science and the course material simply goes directly into the detail with many Greek letters and much confusion
Thanks so much

Reply
April 11, 2019 at 8:26 am
Tom Borek says:

Thanks. I have been reading explanations of PCA for a week and this one was worth all the others put together.

Reply
April 23, 2019 at 5:23 pm
Jabiru Aliyu says:

Thank you so much !! I’ve seen my focus

Reply
May 1, 2019 at 12:24 pm
Obli Narasimharajan, says:

Very informative and useful for a beginner learning PCA . Thanks for taking time to do this.

Reply
May 24, 2019 at 2:40 am
Martin says:

Great, since two or three years i’m applying PCA, using the results without really understanding them, though I’ve red a lot of PCA bibles. But it’s the first time I’ve understand the principle at a glance. Thanks a lot George !

Reply
June 5, 2019 at 5:52 pm
Bharath Raj says:

Thank you. It is very easy to understand.

Reply
June 15, 2019 at 6:39 am
Shreeratan Joshi says:

Too good.

Reply
June 17, 2019 at 5:38 am
ASRAM says:

Great explanation on a overly confused topic! Keep writing.

Reply
July 1, 2019 at 5:22 am

	Lindsey Maness on Principal Component Analysis 4…
	jackkinsella1 on Principal Component Analysis 4…
	checkdetector on Principal Component Analysis 4…
	babaganeshblog on Principal Component Analysis 4…
	Principal Component… on Principal Component Analysis 4…

George Dallas

Data Scientist based in Victoria, BC

Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction

What is Principal Component Analysis?

Eigenvectors and Eigenvalues

Dimension Reduction

Example: the OxIS 2013 report

414 thoughts on “Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction”

Leave a reply to Anon129176495 Cancel reply

What is Principal Component Analysis?

Eigenvectors and Eigenvalues

Dimension Reduction

Example: the OxIS 2013 report

Share this:

Related

414 thoughts on “Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction”

Leave a reply to Anon129176495 Cancel reply