## ML Theory: Conditional Probability 101

Venn diagrams are useful tools for visualizing different probability spaces. Consider two events and within some sample space S. Suppose is the event of owning a cat and is the event of owning a dog. Our Venn diagram might look something like the following for a sample of 450 pet owners.

In other words, out of a sample of 450 pet owners,

• 400 owned a cat exclusively,
• 36 owned a dog exclusively,
• 9 owned both a cat and a dog, and
• 5 owned neither a cat nor a dog.

Obviously, the probability of owning a cat,

is much higher than the probability of owning a dog,

as cats are the superior animal. We can also estimate the probability of being one of those people with neither a cat nor a dog as

Now, suppose we select a cat owner at random. What is the probability that they are also a dog owner? Well, since we know the individual is a cat owner, we only need to focus on the left hand bubble of our Venn diagram.

The number of individuals of interest to us is therefore 409, which we will denote . Of those 409, only are within the intersection (denoted ). We will refer to these 9 as . The relevant probability can therefore be estimated as,

represents the probability an individual owns a dog given they have a cat. This is what we refer to as a conditional probability. When calculating a conditional probability, we use additional information to reduce the size of our probability space. In this example, this meant that we could focus solely on the cat owner portion of the pictured Venn diagram.

In general, for two events and , the law of conditional probability states that the probability of observing given event can be represented as

This is exactly the formula we were using for the cats and dogs example,

We often use conditional probability in our everyday lives without thinking about it. For example, if you tested positive for an extremely rare disease, you should be thinking that the test is likely inaccurate rather than you have the disease… or maybe not because you might be freaking out right now. However, your doctor should be! As a result, your doctor would send you for further testing.

## Motivation

For observed pairs , , the relationship between and can be defined generally as

where and . If we are unsure about the form of , our objective may be to estimate without making too many assumptions about its shape. In other words, we aim to “let the data speak for itself”.

Non-parametric approaches require only that be smooth and continuous. These assumptions are far less restrictive than those of alternative parametric approaches, thereby increasing the number of potential fits and providing additional flexibility. This makes non-parametric models particularly appealing when prior knowledge about ‘s functional form is limited.

## Estimating the Regression Function

If multiple values of were observed at each , could be estimated by averaging the value of the response at each . However, since is often continuous, it can take on a wide range of values making this quite rare. Instead, a neighbourhood of is considered.

Define the neighbourhood around as for some bandwidth . Then, a simple non-parametric estimate of can be constructed as average of the ‘s corresponding to the within this neighbourhood. That is,

(1)

where

is the uniform kernel. This estimator, referred to as the Nadaraya-Watson estimator, can be generalized to any kernel function (see my previous blog bost). It is, however, convention to use kernel functions of degree (e.g. the Gaussian and Epanechnikov kernels).

## Kernel and Bandwidth Selection

The implementation of a kernel estimator requires two choices:

• the kernel, , and
• the smoothing parameter, or bandwidth, .

Kernels are often selected based on their smoothness and compactness. We prefer a compact kernel as it ensures that only data local to the point of interest is considered. The optimal choice, under some standard assumptions, is the Epanechnikov kernel. This kernel has the advantages of some smoothness, compactness, and rapid computation.

The choice of bandwidth is critical to the performance of the estimator and far more important than the choice of kernel. If the smoothing parameter is too small, the estimator will be too rough; but if it is too large, we risk smoothing out important features of the function. In other words, choosing involves a significant bias-variance trade-off.

smooth curve, low variance, high bias
rough curve, high variance, low bias

The simplest way of selecting is to plot for a range of different and pick the one that looks best. The eye can always visualize additional smoothing, but it is not so easy to imagine what a less smooth fit might look like. For this reason, it is recommended that you choose the least smooth fit that does not show any implausible fluctuations.

## Cross-Validation Methods

Selecting the amount of smoothing using subjective methods requires time and effort. Automatic selection of can be done via cross-validation. The cross-validation criterion is

where indicates that point is left out of the fit. The basic idea is to leave out observation and estimate based on the other observations. is chosen to minimize this criterion.

True cross-validation is computationally expensive, so an approximation known as generalized cross-validation (GCV) is often used. GCV approximates CV and involves only one non-parametric fit for each value (compared to CV which requires fits at each ).

In order to approximate CV, it is important to note that kernel smooths are linear. That is,

where is an smoothing matrix. is analogous to the hat matrix in parametric linear models.

It can be shown that

where is the diagonal element of (hence is analogous to , the leverage of the observation). Using the smoothing matrix,

where is the trace of . In this sense, GCV is analogous to the influence matrix.

Automatic methods such as CV often work well but sometimes produce estimates that are clearly at odds with the amount of smoothing that contextual knowledge would suggest. It is therefore extremely important to exercise caution when using them and it is recommended that they be used as a starting point.

## Motivation

It is important to have an understanding of some of the more traditional approaches to function estimation and classification before delving into the trendier topics of neural networks and decision trees. Many of these methods build on an understanding of each other and thus to truly be a MACHINE LEARNING MASTER, we’ve got to pay our dues. We will therefore start with the slightly less sexy topic of kernel density estimation.

Let be a random variable with a continuous distribution function (CDF) and probability density function (PDF)

Our goal is to estimate from a random sample . Estimation of has a number of applications including construction of the popular Naive Bayes classifier,

## Derivation

The CDF is naturally estimated by the empirical distribution function (EDF)

where

I’m not saying naturally to be a jerk! I know the feeling of reading proof-heavy journal articles that end sections with “extension to the d-dimensional case is trivial”, it’s not fun when it’s not trivial to you. essentially estimates , the probability of being less than some threshold , as the proportion of observations in our sample less than or equal to .

It might seem natural to estimate the density as the derivative of but is just a collection of mass points and is not continuous. Instead, lets consider a discrete derivative. For some small ,

This can be re-expressed as

since (draw a picture if you need to convince yourself)! Simplifying further,

where

is the uniform density function on . Mathemagic!

From our derivation, we see that essentially determines the number of observations within a small distance, , of . The bandwidth dictates the size of the window for which are considered. That is, the bandwidth controls the degree of smoothing. The greater the number of observations within this window, the greater is.

Our derived estimate is a special case of what is referred to as a kernel estimator. The general case is

(1)

where is a kernel function.

## Kernel Functions

A kernel function is any function which satisfies

The kernel function acts as our weighting function, assigning less mass to observations farther from . This helps to ensure that our fitted curve is smooth.

Non-negative kernels satisfy for all and are therefore probability density functions. Symmetric kernels satisfy for all . The Gaussian, or Normal, distribution is a popular symmetric, non-negative kernel.

The moments of a kernel are defined as

The order of a kernel, , is defined as the order of the first non-zero moment. For example, if and , then is a second-order kernel and . Symmetric non-negative kernels are second-order and hence second-order kernels are the most common in practice.

Other popular kernels include the Epanechnikov, uniform, bi-weight, and tri-weight kernels. The Epanechnikov kernel is considered to be the optimal kernel as it minimizes error. Choice of the bandwidth, however, is often more influential on estimation quality than choice of kernel.

## Bandwidth Considerations

As noted above, the bandwidth determines the size of the envelope around and thus the number of used for estimation. In the case of a Gaussian kernel, would translate to the standard deviation. For k-nearest neighbours, would translate to the size of the neighbourhood expressed as the span (k points within the N observation training set).

The infamous bias-variance trade-off must be considered when selecting . If we choose a small value of , we consider a smaller number of . This results in higher variance due to smaller sample size but less bias as each will be closer to . As we increase , our window size increases and we consider a larger number of . This reduces our variance but our bias will now be higher as we are using that are further from and thus information that might not be particularly relevant.

In other words, if is too large we will smooth out important information but if it is too small, our estimate will be too rough and contain unnecessary noise. Choosing is no easy task and several methods for bandwidth selection have been proposed including cross-validation methods, rules of thumb, and visual inspection.

Personally, I prefer to use cross-validation as a starting point since I try to minimize the effect of my own biases on estimation. However, these methods aren’t perfect and if feasible, I will follow this up with visual inspection to ensure that the CV bandwidth makes sense in the context of my data and problem. I will generally select a slightly rougher fit over a smoother fit as it is easier for the human eye to imagine a smoother fit than a rougher fit!

## Properties of the Kernel Density Estimator

The kernel density estimator

has several convenient numerical properties:

1. If the kernel function is non-negative, the density estimator is also non-negative.
2. integrates to one, making it a valid density function when is non-negative.

To prove this, let

Then, via a change-of-variables,

3. The mean of the estimated density is .

Using the following transformation,

and thus

Recall that the kernel function must integrate to one and that for second-order kernels,

Therefore,

4. The variance of the estimated density is . That is, the sample variance is inflated by a factor of when the density is estimated.

The variance of is given by

The second moment of the estimated density is

Thus,

## Summary

• The empirical distribution function (EDF) assigns a mass of 1/N to each , resulting in a discrete or “jumpy” estimate.
• Kernel density estimators (KDE) estimate by constructing a neighbourhood around the point of interest . Observations within this neighbourhood are then assigned a mass based on their distance from via a kernel function, resulting in a smooth estimate.
• Popular kernel choices are the Gaussian and Epanechnikov kernels. These kernels are second-order kernels, suggesting they are both proper, symmetric density functions.
• The size of the neighbourhood is dictated by the bandwidth . Special care must be taken when selecting in order to ensure that the bias-variance trade-off is balanced for your problem. Different methods such as CV are available to assist you with optimal bandwidth selection.
• Choice of bandwidth has a larger impact on estimation quality than your choice of kernel.

I will be extending the kernel density estimator to kernel regression in my future blog posts and conducting a case study in R that uses these methods, stay tuned!

## Advent of Code 2017 in R: Day 2

Day 2 of the Advent of Code provides us with a tab delimited input consisting of numbers 2-4 digits long and asks us to calculate its “checksum”. checksum is defined as the sum of the difference between each row’s largest and smallest values. Awesome! This is a problem that is well-suited for base R.

I started by reading the file in using read.delim, specifying header = F in order to ensure that numbers within the first row of the data are not treated as variable names.

When working with short problems like this where I know I won’t be rerunning my code or reloading my data often, I will use file.choose() in my read.whatever functions for speed. file.choose() opens Windows Explorer, allowing you to navigate to your file path.

1 2 3 4 input <- read.delim(file.choose(), header = F)   # Check the dimensions of input to ensure the data read in correctly. dim(input)

After checking the dimensions of our input, everything looks good. As suspected, this is a perfect opportunity to use some vectorization via the apply function.

1 2 3 row_diff <- apply(input, 1, function(x) max(x) - min(x)) checksum <- sum(row_diff) checksum

Et voilà, the answer is 45,972!

As was the case with Day 1, we are then prompted with a part two. In order to help out a worrisome computer, we now have to find the two evenly divisible numbers within each row, divide them, and add each row’s result.

This is a tad bit trickier but it’s clear we need to work with the modulo operator. We need to identify the two numbers a and b within each row such that a %% b == 0. If a < b, a %% b will just return a so my first thought is that we should sort the rows in ascending order.

1 2 # Sort rows of matrix input <- t(apply(input, 1, sort))

You can avoid transposing the matrix if you use some helpful packages but per my previous post, I'm trying to stick to base R *sobs quietly*. I used loops to solve this because we need to iterate through each row, comparing each element to every other element. I did try using vectorization here via sapply,

1 2 # Compare all elements in first row of input matrix. sapply(input[1,], function(x) x %% input[1,] == 0)

but this produces a 16 x 16 matrix for each row with a diagonal that needs to be ignored, on top of which we need to find the one TRUE element and map it back to the two matrix indices. I think I would pursue this method further if there was more than one pair of numbers we were searching for but since we areeeeen't....

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Initialize vector to store each row value row_val <- c(rep(NA, nrow(input)))   # For each row.. for(row in 1:nrow(input)){   # Compare each element to its succeeding elements.. for(col in 1:(ncol(input) - 1)){ for(i in 1:(ncol(input) - col)){   # If the modulo is equal to 0, # set the vector element equal to the division result. if(input[row, col + i] %% input[row, col] == 0){   row_val[row] <- input[row, col + i] / input[row, col]   } } } }   sum(row_val)

Our sum is 326, the correct answer. I'd love to some alternative solutions to part 2, I feel like there is definitely a lot of optimization that could occur here!

## Advent of Code 2017 in R: Day 1

My boyfriend recently introduced me to Advent of Code while I was in one of my “learn ALL of the things!” phases. Every year starting December 1st, new programming challenges are posted daily leading up to Christmas. They’re meant to be quick 5-10 minute challenges, so, wanting to test my programming skills, I figured why not try to do all of them in base R!

I went with base R because I know I can dplyr and stringr my way to victory with some of these challenges. I really want to force myself to really go back to basics and confirm that I have the knowledge to do these things on my own without Hadley Wickham‘s (very much appreciated in any other situation) assistance.

Since I’ve started, I’ve also seen a couple of other bloggers attempt to do these challenges in R so I’m really curious how my solutions will compare to theirs.

The first day of the challenge provides you with a string of numbers and asks you to sum all of the digits that match the next digit in a circular list, i.e. the digit after the last digit is the first digit.

My string was…

1 8231753674683997878179259195565332579493378483264978184143341284379682788518559178822225126625428318115396632681141871952894291898364781898929292614792884883249356728741993224889167928232261325123447569829932951268292953928766755779761837993812528527484487298117739869189415599461746944992651752768158611996715467871381527675219481185217357632445748912726487669881876129192932995282777848496561259839781188719233951619188388532698519298142112853776942545211859134185231768952888462471642851588368445761489225786919778983848113833773768236969923939838755997989537648222217996381757542964844337285428654375499359997792679256881378967852376848812795761118139288152799921176874256377615952758268844139579622754965461884862647423491918913628848748756595463191585555385849335742224855473769411212376446591654846168189278959857681336724221434846946124915271196433144335482787432683848594487648477532498952572515118864475621828118274911298396748213136426357769991314661642612786847135485969889237193822718111269561741563479116832364485724716242176288642371849569664594194674763319687735723517614962575592111286177553435651952853878775431234327919595595658641534765455489561934548474291254387229751472883423413196845162752716925199866591883313638846474321161569892518574346226751366315311145777448781862222126923449311838564685882695889397531413937666673233451216968414288135984394249684886554812761191289485457945866524228415191549168557957633386991931186773843869999284468773866221976873998168818944399661463963658784821796272987155278195355579386768156718813624559264574836134419725187881514665834441359644955768658663278765363789664721736533517774292478192143934318399418188298753351815388561359528533778996296279366394386455544446922653976725113889842749182361253582433319351193862788433113852782596161148992233558144692913791714859516653421917841295749163469751479835492713392861519993791967927773114713888458982796514977717987598165486967786989991998142488631168697963816156374216224386193941566358543266646516247854435356941566492841213424915682394928959116411457967897614457497279472661229548612777155998358618945222326558176486944695689777438164612198225816646583996426313832539918

My first thought was that I would need to separate this string such that each character was the element of an object, either a vector or a list. I kept things simple and started by just copy-pasting the string into R. I could import it as a .txt file or otherwise but I figured that was unnecessary for such a quick problem. I stored the string as a variable named input.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44   # Split string after each character.   input_split <- strsplit(input, "")   # As a result, input_split is a list with 1 element: # a vector containing each character of input as an # element. Annoying. Let's unlist() it to extract # *just* the vector.   char_vector <- unlist(input_split)   # The problem now is that if we are going to sum # the elements of our string, we need them to be # numeric and not characters. Easy enough...   num_vector <- as.numeric(char_vector)   # Now lets just initialize our sum...   num_sum = 0   # And use a loop...   for(i in 1:length(num_vector)){   # If we have the last element of the input string, # set the next number equal to the first element # of the string, else select element i + 1. next_num <- ifelse(i = length(num_vector), num_vector[1], num_vector[i + 1])     # If our current element is equal to the next element, # update the sum. if(num_vector[i] == next_num){   num_sum = num_sum + num_vector[i]   } }   num_sum

Our sum is 1390 which is correct, huzzah. Once you complete part 1, it surprises us with part 2 which asks us to now consider the digit “halfway around the circular list”. So, if our list is 10 digits long, we would add element 1 and element 6, element 8 and element 3, etc. We can make a really simple modification to our established loop to solve this…

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22   num_sum = 0 skip = length(num_vector)/2   for(i in 1:length(num_vector)){   # For the first half of the list we need to # add half the length of the vector to the # index to get the next number & substract # for the second half.   next_num <- ifelse(i <= skip, num_vector[i + skip], num_vector[i - skip])     if(num_vector[i] == next_num){   num_sum = num_sum + num_vector[i]   } }

Our answer is 1,232 which is correct. Woo.

I know that loops in R are often discouraged due to their speed, or lack thereof. I think I might come back and play with this one again to see how I can implement vectorization here. That being said, I also think it’s important to know which tool is right for the job. With a problem of this size, I think a loop is just fine.

## Hello world!

Hello world!

I started blogging back when I was a Master’s of Applied Statistics student making my way through some very heavy journal articles. It always took me a while to work my way through journal articles as I often found myself wanting to semi-prove all the results in a paper for myself. Having “wasted” an obscene amount of paper and time working these things out, I decided that I would attempt to translate these scribbles into complete notes… and so, Statisticelle was born!

Another big part of my learning process is applying what I’m reading or deriving to an actual data set. Having based my Master’s degree on clinical trial research and now working in industry, I don’t think that I can actually use/discuss the data relevant to my past research or career without breaking a few laws. However, I think this blog presents itself as a great opportunity to delve into the massive collections of open data now available. I’m going to make an effort to provide references to these data sets and any R code I used – even if no one uses it, I’m hoping that putting it out in the open will encourage me to be a more diligent statistician!

It’s been a while since I’ve written any blog posts but I’ve recently found myself staring at my statistics bookshelf, thinking that I really miss delving into new topics and working my way through all the messy bits. I’m not sure that anyone will really read this but on the off-chance that they do, I hope these blog posts are helpful and not a collection of misinformation! I am trying to understand these topics in a way that makes sense to me and I can only hope that I am doing so correctly. If you do find any errors, do not hesitate to let me know.

Anyways, that’s my mission statement in so many words. Part 2 – lets see how this turns out!