Richard Bergmair's Media Library

ML Lecture #3: Data Representation & Statistics

The data samples forming the point of departure for any machine learning problem come about through some kind of a natural or computational process, which leave recognizable patterns in the data.

For example, when picking out two observable numeric features about the data points in the sample, and when representing them in a scatter plot, then it may be the case that those features have come about through a process which is additive in nature, resulting in the well-known visual pattern of a normal distribution, where data points are densely clustered around some central region, and become more sparse, as one moves away from the centre. On the other hand, the process may be multiplicative in nature, in which case the pattern will show visible distortions, resulting in a log-normal distribution.

Looking at patterns like that, and recognizing such effects should be one of the first orders of business for any data scientist, when faced with a machine learning problem, since machine learning methods often assume they are dealing with one type of data or the other, and will show greatly diminished performance when used on the wrong type of data.

In order to further illustrate the mathematical mechanics within stochastic processes producing normally distributed data, and to introduce the basic intuitions behind them, we also consider examples such as insurance and risk management.