In the last few years ‘Big Data’ and ‘Data Mining’ have become the buzzwords of the tech industry. It’s how Facebook knows what adverts to show you, it’s how iPhones correct your typing and, apparently, how the NSA decides whether you are a terrorist. But what do these buzzwords actually mean? What are computers doing when they’re ‘learning’ or ‘mining’? When asked, experts say in a serious tone ‘it’s a very complicated field that isn’t easy to understand’ but they’re lying. The principles are easy to grasp and you don’t need to be an expert to appreciate the potential of this subject or to think of applications yourself!

This post starts simple and gradually gets more in depth, but i’ve tried to make the whole thing accessible to someone who has very little maths knowledge. If you just want a very broad overview you only need to read the first section.

## It’s all Classification

First things first. While all these terms have subtle differences they’re all talking about the same process: Classification. So what is classification? Well take a look at this picture:

The top/left side of this picture has red circles and the bottom/right side has blue squares. Now say that someone pointed to a place in that picture and asked you to say whether a circle or square belonged there, eg:

There should be a circle where the yellow cross is and a square where the black cross is, right? You’ve just classified data! Think of classification in this way: deciding which parts of the picture belong to the circle and which parts belong to the square.

If you were only given one square and one circle it would be pretty hard to decide which areas of the picture belong to the circle and which belong to the square, but if you were given a million circles and a million squares it would be really easy. The circles and squares have already been classified as circles and squares, therefore they are classified data. Let’s make all this into a rule:

The more data you have already classified, the easier it is to decide how new data will be classified

So how does deciding between circles and squares have any practical use? Well let’s say i’m a teacher and I think that you can estimate whether someone will be earning over £40,000 a year from their IQ and test scores. I get the IQ, test scores and earnings of all my previous students. I then classify their payment into two categories, above £40,000 a year and below £40,000 a year. I can then plot a graph like this (above £40,000 = red circle, below £40,000 = blue square)

Now if I want to estimate if young Jimmy will earn over £40,000 a year in the future, I can put an ‘X’ on this graph at where his IQ and test scores are and estimate whether he is a red circle or a blue square; whether he will earn over £40,000 a year or not (for the record the graph isn’t real, I have no idea if IQ or test score correlate with income).

There are two important (and related) things to realize about this example

- The only thing helping me classify new bits of data is already classified bits of data. There are no fancy formulas or complex algorithms, what’s helping me decide if something is a circle or a square is the circles and squares that are already on the graph. I.e. data about past students
- I have to do very little work. All I say is ‘I think x and y effect the classification of z’ and plot it. If there are clear areas where squares are and clear areas where circles are then x and y probably effect z, if the circles and squares are everywhere and there is no clear pattern, then x and y probably don’t effect z.

Let me clarify the second point a bit. Look at the second picture again and think about why you could easily classify the crosses. The reason was because there was a pattern in the data, there was a clear circle area and a clear square area. Pattern’s only happen if variables are related to one another. So there is a nice pattern between income, IQ and test scores because they’re related to each other. But if I plotted income, eye color and height there would not be a pattern because they’re not related to each other. It would be much harder to decide if a point was in a circle area or a square area. It would look like this:

Classification is therefore a way of spotting patterns in data. As humans are pattern spotting creatures, hopefully you can start to see the potential this has…

## How computers classify

When you classified the yellow cross and the black cross it was easy, you could instinctively tell which was a square and which was a circle. But computers need a set of rules to make them decide how to classify rather than instinct. There are two main ways to make them decide, the nearest neighbor classifier and the SVM classifier.

### Nearest Neighbor

The nearest neighbor classifier is pretty self explanatory. When the computer has to decide if a certain point is a circle or a square it looks to it’s nearest neighbors and sees what they are, and then classifies itself according to a vote. So in the graph below the computer looks at it’s 3 nearest neighbors:

The yellow cross sees that its neighbors are all circles so says ‘well i’m probably a circle’ and classifies itself. The same with the black cross.

### SVM

SVM (support vector machine) is a fancy name for a straight line. The SVM draws a line between the circles and the squares like so:

and then says ‘anything above the green line is a circle, anything below it is a square’. Therefore when it is given the yellow cross it says ‘this is above the green line so it’s a circle’.

Both these techniques are valid ways of making the computer decide how to classify data. Deciding which one to use depends on the nature of the problem, but that’s a bit too complex for this post.

### Why we need computers to Classify for us

In the example with the teacher it would have been easy for a human to look at the graph and decide Jimmy’s economic outlook, so why bother getting computers to do this for us? The answer has to do with dimensions.

Again lets go back to the example of the teacher classifying whether his students will earn over £40,000. If he wanted to see the relationship between IQ and future income it would be a one dimensional problem because there is only one thing he is measuring: IQ. For a 1-D problem you can plot it on a straight line like so (remember red circle = above 40k, blue square = below 40k):

If he wanted to look at the relationship between IQ, test scores and future income then his problem would be two dimensional. This is because he now is measuring 2 things and seeing what the classification is. A 2-D problem must be plotted in two dimensions, which is a plane (it’s easy to think of it like a piece of paper. Height and width but no thickness). So plotting in 2-D is what the original graph was:

Now the teacher wants to include the height of the student to see if there are any patterns there. Now he needs to plot IQ, Test Score and Height. There are 3 things to measure so it’s a 3-D problem. The graph must therefore be plotted in 3-D. 3-D is the world we live in (height, width and depth), so it would look something like this:

What if the teacher wanted to add eye color? You guessed it, it would be a 4-D problem. So what does 4-D look like? It is literally impossible for us to visualize. For 1-D and 2-D problems it is pretty easy for humans to classify what are circles and what are squares, but once you want to start looking at patterns between more than 3 variables we cannot do it, and this is where computers come in. Computers can handle more variables with ease, and so a 1000 dimensional problem isn’t too hard for it to handle (that may sound like an exaggeration, but it’s not. Problems above 1000 dimensions are pretty common!). So computers can take in huge amounts of variables (eye color, height, IQ, birth weight, blood sugar levels, you name it) and find patterns in the data (i.e. classifying circles or squares). These are patterns that we would never be able to visualize. Remember the only thing helping the computer find the pattern is the data that we’ve already put into it and classified. That’s all it needs, after you give it lots of data it does the rest of the work in classifying new bits of data.

### How is all this classification actually used in the real world?

It’s all been a bit abstract so far, but hopefully you understand enough to get some of practical applications of classification. I’ll quickly talk about two. One is IBM’s method of predicting crime and the other is face recognition.

#### Predicting Crime

One of my favorite applications of classifying has been IBM’s crime predictor. Their classifier takes in huge amounts of variables about New York city every day (Weather, major sports events, time of day, time of year etc.) and then looks at where crime is happening in the city. So the thing that’s being classified is whether a crime is happening in a certain area or not (crime in lower Manhattan = yes/no = circle/square). After collecting information for a few years they started using the classifier to predict where crime will happen that day. So at the beginning of the day they tell the classifier ‘today it is sunny, a Monday, there’s a Yankee’s game on etc. etc.’ and the classifier will look at the data on similar days, and then classify what will be ‘crime hotspots’ for the day. Ambulances and police cars then patrol those areas more and wait for the crime to happen. Apparently this has genuinely helped reduce the emergency response time in New York. Awesome!

#### Face Recognition

A common use for classifying is face recognition. The classifier gets thousands of images of people and told where the face is in the image (this is done by drawing a box round each face in each image). What the classifier then does is takes the pixel locations, the color of each pixel and their relation to each other as its variables, and classifies whether there is a face in the image or not. The classifier has been told thousands of times what a face looks like in an image manually and because of this it can determine if there is a face in the image and where it is (face or no face, circle or square). This technology is really effective and is computationally lightweight enough to use on a standard digital camera.

### Conclusion

Big data, Machine Learning and Data Mining all refer to the process of classifying data. ‘Big Data’ refers to the fact that the more data you can use, the more effective the classifier will be (look at the rule we established early on), ‘Data Mining’ refers to the fact that classifiers can spot patterns in data that are in too high a dimension for us to comprehend and ‘Machine Learning’ refers to the fact that a classifier seems to ‘learn’ the way to classify because they are using previous results to inform future decisions.

It is easy to make sweeping predictions about what the future holds for Big Data, but my guess is as good as anyone’s. It may be useful to look to the past in order to put this phenomenon in context. The Victorians loved to classify. They went around the world collecting all sorts of bugs, animals, bits of ancient civilizations and so on. It was driven by the faith that if you could categorize and classify everything humans would be more knowledgeable about the world and closer to an objective truth. Big Data is based on the same belief that we can find patterns in nature by classifying and measuring as much as we can. I would argue that Big Data is the ultimate Victorian machine, the only reason it has been realised 100 years later is that we have the technology to put these ideals into practice.

Insightful article on big data…love the example of crime prediction in NYC

SVM should be support vector machines, rather than standard vector machines

Reblogged this on analyticalsolution and commented:

When asked, experts say in a serious tone ‘it’s a very complicated field that isn’t easy to understand’ but they’re lying. The principles are easy to grasp and you don’t need to be an expert to appreciate the potential of this subject or to think of applications yourself!

I love your simplification of the science, very effective. Could you go one step further and explain how some companies can back test? Is that when they run their machines for a couple of years and apply the machine learnt rules to historical data?