> Tech Talks: Data Representation & Similarity Measures

Notions of similarity and dissimilarity, as well as closeness and distance are at the heart of the kinds of mathematical models that enable machine learning.

Distance, in a geometric sense, would seem to be a rather rigid concept. But to a mathematician there are, in fact, surprising degrees of freedom within the choice of a distance measure, giving different mathematical properties to the resulting geometric spaces.

This video lecture starts by introducing some alternatives to the usual Euclidean distance, including Manhattan distance, and, more generally, the mathematical family of distance measures called Minkowski distance.

It then introduces the Mahalanobis distance, which takes into account the covariance structure of a statistical sample, and measures the distance between two points within a reference system that is appropriately decorrelated.

We then move on to the similarity and dissimilarity measures that have found widespread applications in information retrieval and natural language processing, such as cosine similarity and such as the set overlap measures by Dice and Jaccard, as well as string edit distances including Hamming distance, Levenshtein distance, and Jaro-Winkler similarity.

String-edit distances are useful in natural language processing applications such as PANOPTICOM’s media monitoring infrastructure in order to deal with misspellings. For example, if the PANOPTICOM machine learner has picked up “regulatory” as a keyword signalling relevance, then, based on the low string edit distance between this keyword and the token “regularoty”, it might be able to recognize it as a misspelling of the keyword.

download  PDF download

(Reproduced here, courtesy of PANOPTICOM).