Saturday, 19 December 2015

Notes on "Big Data: New Tricks for Econometrics" (Hal R. Varian)

Big Data: New Tricks for Econometrics
Hal R. Varian
2014, Vol.28(2), pp.3-28 [Peer Reviewed Journal]

the sheer size of the data involved may require more powerful data manipulation tools
we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection
large datasets may allow for more flexible relationships than simple linear models

Einav and Levin 2013: new more detailed data

Sullivan 2012, Google uses many of these tools



Out-of-sample predictions:
since simpler models tend to work better for out-of-sample forecasts, machines learning experts have come up with various ways to penalise models for excessive complexity - regularisation
it is conventional to divide the data into separate sets for the purpose of training, testing, and validation.
the standard way to choose a good value for such a tuning parameter is to use k-fold cross-validation

ways to improve classifier performance:
bootstrap involves choosing a sample of size n from a dataset of size n to estimate the sampling distribution of some statistic. A variation is the “m out of n bootstrap” which draws a sample of size m from a dataset of size n>m.
Bagging involves averaging across models estimated with several different bootstrap samples in order to improve the performance of an estimator.
boosting involves repeated estimation where misclassified observations are given increasing weight in each repetition. The final estimate is then a vote or an average across the repeated estimates.

Random forests is a technique that uses multiple trees. A typical procedure uses
the following steps.
1. Choose a bootstrap sample of the observations and start to grow a tree.
2. At each node of the tree, choose a random sample of the predictors to make
the next decision. Do not prune the trees.
3. Repeat this process many times to grow a forest of trees.
4. In order to determine the classification of a new observation, have each tree make a classification and use a majority vote for the final prediction.

spike-and-slab regression, a Bayesian technique


No comments:

Post a Comment