Home

Lecture notes - Columbia Applied Data Science

1. 122 10 3 4 Multiprocessing Pools o 123 10 3 5 Multiprocessing example Stream processing text files 124 10 3 6 Numba aoaaa ee ee 129 A lt GythoOny seus feos Sy ay en ee sh ee e 129 CONTENTS v What is data science With the major technological advances of the last two decades coupled in part with the internet explosion a new breed of analysist has emerged The exact role background and skill set of a data scientist are still in the process of being defined and it is likely that by the time you read this some of what we say will seem archaic In very general terms we view a data scientist as an individual who uses current computational techniques to analyze data Now you might make the observation that there is nothing particularly novel in this and subse quenty ask what has forced the definition After all statisticians physicists biologisitcs finance quants etc have been looking at data since their respec tive fields emerged One short answer comes from the fact that the data sphere has changed and hence a new set of skills is required to navigate it effectively The exponential increase in computational power has provided new means to investigate the ever growing amount of data being collected every second of the day What this implies is the fact that any modern data analyst will have to make the time investment to learn computational techniques necessary to deal with the volumes and complexity
2. This formulation is useful because it allows one to explore questions of model error For example suppose e depends on x This could arise if there is error in some of the labels and this error depends on x suppose it is more difficult to determine recovery in older patients More generally e represents the 58 CHAPTER 6 LOGISTIC REGRESSION deviation of the latent variable z from our modeled value x w Re phrasing our conversation at the beginning of chapter 5 there is no need to assume that the world is fundamentally random There exists some set of variables z v that tell us with certainty whether or not a person will recover for example the position and state of every particle in the universe It just happens that in our data set we do not have access to all of them we only have x or to the correct functional form of their relationship to the observed variable 1 50 If there true form is z f z v we can write z a wte where e x w f x v 6 3 In other words e represents the uncertainty in our model Exercise 6 3 1 Referring to the paragraph above 6 3 restate the issue of mislabeled data as a problem of modeling error 6 2 Determining the regression coefficient w Before we proceed we should define a couple things A matrix A is said to be positive definite if it is symmetric and all of its eigenvalues are posi tive Rather than computing the eigenvalues one can check that for every nonzero vector v vT
3. T Hastie R Tibshirani and J Friedman The elements of sta tistical learning data mining inference and prediction third ed Springer 2009 B W Kernighan and R Pike The Unix programming environ ment Prentice Hall 1984 Eric S Raymond The art of Unix programming Addison Wesley 2004 Nassim Nicholas Taleb Fooled by randomness The hidden role of chance in life and in the markets 2 updated ed Random House Trade Paperbacks 8 2005 The black swan Second edition The impact of the highly improbable With a new section on robustness and fragility 2 ed Random House Trade Paperbacks 5 2010 130 BIBLIOGRAPHY 131 Wik Wila Wilb WPF WPR Wikipedia Bell labs http en wikipedia org wiki Bell_ labs G Wilson Software carpentry http software carpentry org G et al Wilson Software carpentry nyc bootcamp 2013 http software carpentry org bootcamps 2013 01 columbia html Finite state machine howpublished http en wikipedia org wiki Finite state_ machine Regular expression http en wikipedia org wiki Regular_ expression
4. One difficulty that beginners have with Git is understanding that as you are working on a file there are at least four different versions of it 1 The working copy that is saved in your computer s file system 2 The saved version in Git s index a k a staging area This is where Git keeps the files that will be committed 3 The commit current at your HEAD Once you commit a file it and the other files committed with it are saved forever The commit can be identified by a SHA1 hashtag The last commit in the current checked out branch is called HEAD 4 The commit in your remote repository Once you have pulled down changes from the remote repository and pushed your changes to it your repository is identical to the remote 2 6 Common Git Workflows Here we describe common workflows and the steps needed to execute them You can test out the steps here except the remote steps by creating a temporary local repository cd tmp mkdir repo cd repo git init 16 CHAPTER 2 VERSION CONTROL WITH GIT After that you will probably want to quickly create a file and add it to the repo Do this with echo line1 gt file git add file Practice this and subsequent subsections on your own Remember to type git lola git status and git log frequently to see what is happening You can then add other lines with e g echo line2 gt gt file When you are done you can clean up with rm rf repo To set up a remote r
5. We decide haphazardly on N experimental inputs which together form the data matrix X Xu AE X Xni XNK The first row is the first experimental input Call this X Later on we will see how our choice of X affects our chances of not ending up as a furball We perform the first experiment and measure the output Y We know that if Wirue w the output will be Y Xi w E Given this w Y N X w 02 Using the fact that the likelihood is p Y w pg Y Xw the likelihood after one experiment is 1 O a mE ne There are a number of w that make X w Y 0 the problem is underdetermined since we have K unknowns and only one equation One such w is wmr Yi X1 7 X1 Combining this with Y X1 Wtrue E1 we get Xis Wtrue Ey Xi X WML This is less than satisfactory Suppose Xj pointed in a direction almost orthogonal to Wtrue then our output would be entirely dominated by the noise E Our prior knowledge about w can then help us We multiply the prior and likelihood and form the posterior X w Y 2 2 po Y xep peel b exp Lel 20 202 where w D w defines the l2 vector norm Our MAP estimate is Xi w Y a 2 Wmap arg min w Since the term involving w is a sum of K components and the term 32 CHAPTER 5 LINEAR REGRESSION prior likelihood b posterior Figure 5 1 The post
6. w p Y This posterior will be used to quantify our uncertainty about the coefficients after measuring our training data Moreover given a new input x we can characterize our uncertainty in the response y by p w Y 5 3 nula Y rulo Y do pujol 64 Above we used the fact that p ylw z Y p y w 2 since once we know w and zx y is determined as y w For this text though we will usually content ourselves with the more tractable prediction y Wmap Where Wmap is the maximum a posteriori estimate Wmap arg max p w Y lFor simplicity we always assume our distributions are absolutely continuous with respect to Lebesgue measure giving rise to density functions 30 CHAPTER 5 LINEAR REGRESSION In other words we can estimate a best guess w then feed this into 5 2 ignoring the error term The other characters in 5 3 are the likelihood p Y w the prior p w and the term p Y The term p Y does not depend on w so it shall be treated as a constant and is generally ignored The likelihood is tractable since once w is known we have Y Xw E which implies p Y w pe Y Xw where pg is the pdf of E In any case the likelihood can be maximized to produce the maximum likelihood ML solution Wmi arg max p w Y w The prior is chosen to reflect our uncertainty about w before measuring Y Given a choice of prior and error model the prior does indeed become the
7. Applied Data Science lan Langmore Daniel Krasner Contents I Programming Prerequisites 1 Unix 1 1 History and Culture 1 2 he Shell tst aori ls ota di LES Streams 0 A eo a a BY ede Gee 1 3 1 Standard streams 13 25 Pipes as rt is a eee e MA MERA a a AAA on SA nad phe A ti ds 1 5 Philosophy 1 5 1 Inanutshell 1 5 2 More nuts and bolts 1 6 End Notes 000000004 2 Version Control with Git p Backeround e s a tei eat eA Re A 2 2 Whats Git sis Beds eA ee Be G 2 35 Setting Upri wes we ad 2 4 Online Materials 04 2 5 Basic Git Concepts 00 2 6 Common Git Workflows 2 6 1 Linear Move from Working to Remote 2 6 2 Discarding changes in your working copy 2 6 3 Erasing changes 2 6 4 Remotes c en oeta a a G 2 6 5 Merge conflicts 0 3 Building a Data Cleaning Pipeline with Python 3 1 Simple Shell Scripts 3 2 Template for a Python CLI Utility 13 13 13 14 14 15 15 16 17 17 17 18 CONTENTS II The Classic Regression Models 4 Notation 4 1 Notation for Structured Data Linear Regression Bala IMtro ducho Lai ir ah of eee Soins ha Sa a 5 2 Coefficient Estimation Bayesian Formulation 5 2 1 Generic SUP rios aa e 020200204 5 2 2 Ideal Gaussian World 5 3 Co
8. Y u1 gt gt Y lt up Since we are not dividing by Az it appears the problem is robust to noise 3 Use the relation Wm Aw 0 to show that K Y Uk Wml gt A Uk ma as always So if we have small A the problem is not robust to 52 CHAPTER 5 LINEAR REGRESSION Height mu Linear Fit A a ES ee Height mm Segment 1 mmm Segment 2 mmm Segment 3 Figure 5 4 Left Fitting height vs age with a linear function Right Fitting with a segmented function Fake data The above exercise shows that the robustness of your coefficients to noise is highly dependent on the variable representation you choose Some linear combinations of coefficients will be robust and others won t At the end of the day however your predictive model Xw is unchanged by a change of basis in the ML formulation The Bayesian MAP solution is a different story The best advice to give is to choose your prior wisely 5 4 3 Nonlinear transformations and segmentation Suppose we try to model height as function of age From birth to 17 years of age we would see fairly steady increase in mean height After that mean height would level out until during old age it would decrease If we try to use height by itself then it will have a hard time fitting this non linear curve See figure 5 4 3 left A better alternative would be to put a nonlinear trans formation of height Perhaps take
9. L1 regularization results in insignifi cant coefficients being set exactly to zero This is nice because it effectively removes them from the model which means that the effective model can be significantly simpler than in L2 regularization More precisely Theorem 6 8 With L w the negative log likelihood given by 6 4 suppose that the columns of the data matrix X are linearly independent and that a gt 0 for all k Then the solution w to 6 7 exists and is unique Moreover for every k exactly one of the following holds 6 5 Ll REGULARIZATION 65 i on Ow w a and w 0 i OL al lt a and w 0 Remark 6 9 This means that a sets a level at which the coefficient kt vari able must effect the log likelihood in order for its coefficient to be nonzero This is contrasted with L2 regularization which tends to result in lots of small coefficients This can be expected since for small the penalty wz gt we In fact one could use terms such as w for 0 lt 6 lt 1 and achieve even more sparsity In practice this would lead to difficulty since the problem would no longer be convex Proof Since the likelihood is bounded it is always less than one L w is always greater than zero Let M max L w w lt 1 and am min aj ax Then for w gt M am we will have L w S anlwe gt S ag gt Om Y hoz gt Am w gt M k k k In other words a local minimum must occur in th
10. This is somewhat wasteful however since we never actually need the entire list at once All we need is some way to step through the numbers that would have been in the list This same ends can be achieved by replacing range with xrange The speed difference between using range and xrange is small but the memory savings can be significant and it is generally good practice in Python 2 7 to use xrange In general an iterator is an object with a directed one dimensional structure and a next method that allows us to iterate through that structure A list is an iterable since it can be converted to an iterator A generator is an iterator that is tied to a function so that some function is called to provide the next element List comprehensions can be replaced with expressions that produce iterators that have a next method gt gt gt mylist Pa 0 1 b gt gt gt mygen item for item in mylist if item isalpha gt gt gt mygen next a gt gt gt mygen next b gt gt gt mygen next Traceback most recent call last File lt stdin gt line 1 in lt module gt Stoplteration 2Note that in Python 3 0 range will automatically be converted to xrange inside for loops and xrange will no longer be used So if you want your code to be Python3 0 compliant use range 122 CHAPTER 10 HIGH ER PERFORMANCE PYTHON A common use of iterators that we have made extensive use of is the file object
11. are given by A uk Y Ax 1 lt k lt r al anything r 1 lt k lt K 38 CHAPTER 5 LINEAR REGRESSION e Setting w 0 for k gt r gives us the so called minimal norm solution arg min is a solution to 5 7 e The Moore Penrose pseudoinverse XT is defined by y a an k 1 In other words X TY is the minimal norm solution to the least squares problem One could just as easily truncate this sum at m lt r giving rise to X1 There exist a variety of numerical solvers for this problem So long as the number of covariates K is small most will converge on approximately the solution given above The 1 A7 factor however can cause a huge problem as we will now explain Recall that our original model was K N Y XWirue E S Urik i Uk AL UL F E Un Un k 1 n 1 Inserting this into the expression w uj Y A we have Wj Wirue Vj 2 j Wtrue Uj a u J j y 5 12 The term XWrueuj is our modeled output in direction uj Similarly E uj is a measure of the unmodeled output in direction uz If our unmodeled output in this direction is significant our modeling coefficient 0 will differ from the true value Moreover as Xv Ajuj Aj the singular value A can be seen as the magnitude of our covariates pointing in direction vj If our covariates did not have significant power in this direction the error in w will be amplified Thus the goal for statistical in
12. decisions at the tree nodes m lt lt M 3 Choose a training set of size n lt N with replacement and use its complement as a test set This is know as a bootstrap sample 4 Choose m features which will be used for decisions in the given tree and calculate the best split on those features and training set 5 Let each tree be fully grown i e no pruning The result is a collection of trees T1 Tk where given a input x a class can be assigned by taking the mode vote over all the 7 s Note this was Brieman s original approach for regression trees or if you are interested in knowing the strength of a given classification an average or some other weighted sum can be taken over the outputs of the T s Although m can be determined experimentally many packages use y M or loga M as default values 9 4 3 Out of bag classification Given a collection of bootstrap samples T Ty from a training set T as well as pair 1 y T you can use the classifiers trained on samples T where x y T for measuring the prediction accuracy For example suppose you have samples 7 T2 T3 and respective trained classifiers h1 h2 and hg as well as a pair x y T but x y T and x y T2 You can predict the accuracy of your ensemble classifier from evaluating ha and h3 x against the true value y This is called an out of bag estimate or classification and empir ical evidence has shown that
13. default length 100 If you want to generate text based on your own input which is say of type str you first should coerce input into an nitk texrt Text object i e import nltk myNltkOText nltk text Text myStringText myN1ltkText generate will do the trick some more stuff Part IV Classification 89 Chapter 9 Classification 9 1 A Quick Introduction Classification is one of the fundamental problems of machine learning and data science in general Whether you are trying to create a spam filter trying to figure out which patience are most likely to be hospitalized in the coming year or trying to see tell whether a particular image appears in a photograph it is all too likely that you will spend a significant percentage of your time working on such problems There are two basic types of clas sification the binary or two class and the multi class problem In this chapter we will explore some of the basic solutions constructs and evalua tions thereof 9 2 Naive Bayes The Naive Bayes classifier is probably the most fundamental and widely used methods in industry today It is simple fast easily updated and de spite it s many theoretical and even technical pitfalls quite effective in prac tice Many a time will the real life limitation of an industry solution lead 90 9 2 NAIVE BAYES 91 you to drop a more sophisticated approach where the accuracy gains do not justify the time and resourc
14. gt gt gt f open myfile txt r gt gt gt f next this is the first line n gt gt gt f next this is the second linen Note that you could read the entire file into memory at once using f read or f readlines This however is often a very wasteful use of memory Exercise 10 1 1 Consider the following code fragment that loops through using an explicit index mylist Create a list mysum 0 for i in xrange len mylist mysum mylist i In contrast consider this method of looping which does not use an index mylist Create a list mysum 0 for item in mylist mysum item The second method works if mylist is replaced by an iterator but the first method does not why Note that this is a reason the second method is preferred If you also need the index then try using for i item in enumerate mylist 10 3 3 For loops versus BLAS As figure 10 8 shows dot products are much faster in numpy than in pure Python The reason that numpy is fast is that it uses some version of the Ba sic Linear Algebra Subprograms BLAS BLAS is a library of linear algebra functions mostly written in Fortran It takes advantage of your machine s cache configuration and some versions support multi threading The ver sion of BLAS that comes with standard numpy is 5 50 times slower than an optimized version so the user is encouraged to upgrade to a version that 10 3 PRACTICAL PERFORMA
15. marginal density of w However there is usually no clear reason why this should be the marginal density and it is better thought of as our prior guess Usually a reasonable and mathematically convenient choice is made Exercise 5 4 1 The term p Y is determined by an integral involving the prior and likelihood What is it 5 2 2 Ideal Gaussian World Suppose we are given a black box where we can shove input x into it and measure the output y An omnipotent house cat controls this box and tells us that the output y Wirue Where Wirue is fixed and for each experiment the cat picks a new i i d e M 0 0 Suppose further that while we don t know Wirue we do know that the cat randomly generates w by drawing it from the normal distribution N 0 o2 Ix here I R is the identity matrix Furthermore this variable Wtrue is independent of the noise The cat challenges us to come up with a best guess for Wirye and to predict new output y given some input x If we do this poorly he will claw us to death In this case prior to taking any measurements of y our knowledge about w is w N 0 0 Ig In other words we think that true is most likely to be within a width cu ball around 0 R We will see that each measurement reduces our uncertainty Interestingly enough as K oo w is most likely to be within a shrinking shell around the surface of this ball 5 2 COEFFICIENT ESTIMATION BAYESIAN FORMULATION 31
16. then we are classifying Xn as class 1 precisely when our model tells us that the probability Y 1 is greater than 0 5 This is a good choice if we want to be correct most of the time However other cutoffs can be used to balance considerations such as true false positives See chapter 9 6 4 LOGISTIC REGRESSION FOR CLASSIFICATION 63 Class 1 3 51 eee Class 0 Figure 6 2 The hyperplane normal to w separates space into two classes Whether or not this separation is correct is another story Logistic regression as a classifier uses a hyperplane to separate R into two regions one of which we classify as class 0 and the other class 1 See figure 6 4 This happens because o z w gt d lt z w gt log 75 and the set of points x for which x w is greater less than some constant c is the two regions on either side of the hyperplane defined by x w c This fact is important because it is a fundamental limitation of logistic classification This sort of limitation is not present in other methods such as decision trees Note that the limitation may be a good thing if you know that the solution structure is roughly approximated by space separated by a hyperplane In a sense this is a form of regularization and prevents logistic regression from giving a crazy answer Also note that you are free to choose nonlinear combinations of variables e g you can square some feature Then in the ori
17. 2 if X Y are the training set and X contains a constant column 0 lt R lt 1 3 if X does not contain a constant column R could in some cases be less than zero 4 if X and Y are not the training set R could in some cases be less than zero R has a few different interpretations i The denominator in the ratio in 5 25 can be thought of as the vari ability in the data The numerator can be thought of as the variability unexplained by the model Subtracting the ratio from one we get R the fraction of variability explained by the model ii R can also be thought of as the improvement from null model to the fitted model We will generalize this idea later in the chapter on logistic regression iii For linear models R is the square of the correlation between the model s predicted values and the actual values 54 CHAPTER 5 LINEAR REGRESSION Since squaring a number smaller than one makes it smaller and squaring a number larger than one makes it larger R will tend to penalize larger errors more More generally define the L p norm as N 1 p X Y ll Liro vap gt n l and the L infinity norm as X Y l max Xn Y Yn If p 1 then we are computing N times the mean absolute error Exercise 5 25 2 Show that 1 as p gt X Y gt X Y In other words as p increases we are penalizing the larger deviations more and more 2 as p gt 0
18. 3 5 Numerical techniques The conclusion of theorem 5 19 implies that one can find ws by solving XTX 8D ws XTY 5 21 It is easy to show that the condition number of A XTX I is at least as large as 6 So when 6 gt 0 is large enough the problem is well posed and can be solved directly even for large N K e g N K 1000 gives little trouble If 0 and columns of X are linearly dependent you will not be able to invert XTX Even if XTX is invertible in theory if 6 is small then numerical error can cause error e g when the condition number is in the millions Alternatively one can compute the SVD of X and then use the pseudoinverse Xt This has the advantage of not having to worry about linearly dependent columns sometimes however is is often nice to know that your columns are 5 4 VARIABLE SCALING AND TRANSFORMATIONS 47 dependent Moreover computing the SVD of X can be done in a stable manner even for large matrices The best alternative for small or medium matrices is to factor the ma trix into a simpler form One example is the QR factorization We write XTX QR with Q orthogonal and R upper triangular The resultant nor mal equations QRw XTY are then trivial to solve First we multiply by QT which is Q then solve the system of equations Rw Q XTY which is trivial since R is triangular This has the advantage of being numerically very stable and quicker than the SVD The methods described
19. A DATA CLEANING PIPELINE WITH PYTHON python SRC subsample py r 0 1 python SRC cleandata py gt DATA outputfile csv Some points The bin bash is called a she bang and in this case tells your machine to run this script using the command bin bash In other words let bash run this script All other lines starting with are comments The line SRC src sets a variable SRC to the string src In this case we are referring to a directory containing our source code To access the value stored in this variable we use SRC The lines that end with a backslash are in fact interpreted as one long line with no newlines This is done to improve readability The first couple lines under cat start with pipes and the last line is a redirection The command cat is used on the first line and the output is piped to the first program This is done rather than simply using as the first line python SRC subsample py r 0 1 DATA inputfile csv What advantage does this give It allows one to easily substitute head for cat and have a program that reads only the first 10 lines This is useful for debugging Why write shell scripts to run you programs Shell scripts allow you to tie together any program that reads from stdin and writes to stdout This includes all the existing Unix utilities You can and should add the shell scripts to your repository This keeps a record of how data was generated Anyone who under
20. Av gt 0 A function f w is strictly convex if for all A 0 1 and points wt w RE f Awi 1 A we lt Af wi 1 A f 103 If f has two continuous derivatives then rather than checking the above inequality one can check that the hessian matrix V f w is positive definite at every point w RE If a function f RE gt R is strictly convex then any local minimum is also the global minimum Note that it is possible for the function to not have any minimum e g it can keep getting smaller and smaller as w 00 This is very important for optimization since most algorithms are able to find local minimums but have a hard time verifying that this is a global minimum The coefficient w in 6 1 6 2 is usually determined by maximum likeli hood although a Bayesian approach may be used In either case we need the likelihood Recalling the notation of chapter 4 the likelihood is N N pY w 0 e Ply 1e Xn w Ply 0 0 Xn 0 n 1 3 Xan PAX y e n 1 6 2 DETERMINING THE REGRESSION COEFFICIENT W 59 This can be checked by considering the cases Y 0 and Y 1 separately The maximum likelihood solution is the point wmz that maximizes this Instead we usually minimize the negative log likelihood that is WML arg min L w where 6 4 257 Yn log o Xn w 1 Yn log 1 o Xn w n 1 Note that L w will always be non negative since 0 1 Sin
21. C C programming languages Wik Chapter 2 Version Control with Git Git That s the vcs that I have to look at Google to use Josef Perktold 2 1 Background The idea of version control is almost as old as writing itself Authors writing books and manuscripts all needed logical ways to keep track of the various edits they made throughout the writing process Version control systems like Git SVN Mercurial or CVS allow you to save different versions of your files and revert to those earlier versions when necessary The most modern of these four systems are Git and Mercurial Each of these have many features designed to facilitate working in large groups and keeping track of many versions of files 2 2 What is Git Git is a distributed version control system DVCS This means that every user has a complete copy of the repository on their machine This is nice since you don t need an internet connection to check out different versions of your code or save a new version Multiple users still do need some way to share files In this class we will use Git along with the website GitHub 13 14 CHAPTER 2 VERSION CONTROL WITH GIT GitHub provides you with a clone of your local repository that is accessible via the internet Thus when you change a file you will push those changes to GitHub and then your teammates will pull those changes down to their local machines 2 3 Setting Up For macs download from mac github com For Lin
22. Exercise 5 5 1 Show that adding more covariates to w can only decrease Xw Y Does this mean that adding more covariates is always a good idea Exercise 5 5 2 Derive 5 5 Exercise 5 5 3 What happens to the MAP estimate as your prior uncer tainty about w goes to infinity e g 0 gt 00 Does this seem reasonable The omnipotent house cat was introduced to emphasize the fact that this ideal world does not exist In real life our unmodeled response Ep depends on Xn and is certainly not Gaussian Furthermore what does w represent in real life Suppose we take the interpretation that w is a vector of derivatives of the input output response then for what reason would our prior guess be w N 0 02 JK All however is not lost If your model is not too far from reality then your interpretation of w will have meaning and your predictions will be accurate This is what mathematical modeling is The beauty of the Bayesian approach is that it makes these assumptions explicit In the next section we will see how our inevitable misspecification of error along with data quality issues will degrade our estimation prediction and the prior will take on the role of preventing this degradation from getting out of hand 5 3 Coefficient Estimation Optimization Formu lation As we saw in section 5 2 in the case of Gaussian prior and error finding the best guess coefficient Wmap is equivalent to solving the regularized least squar
23. ML w is approximately equal to one If on the other hand we use millimeters then the ML w will be 1 000 times smaller and if we do not adjust the penalty term dw will have little influence on the MAP solution We could adjust making it 1 000 times larger but then the other penalty term owe would be huge As a result if we change our unit of measurement to millimeters then our model will have no constant The problem is that we have used the same prior in both cases Recall that 5 23 is the result of the prior wg N 0 2 6 for k 0 1 However if we are using millimeters then we expect w to be 1 000 times smaller than before which means we should change 6 6 1000 Consider the more general MAP problem with a data matrix X with a constant column X 9 prepended As above the optimal ws will depend strongly on the unit of measurement used for the columns of X To correct for this we could use a different prior for each column The prior variance should be inversely proportional to the scale of the variable One way to achieve this is to let 6 be a vector where k x 1 0 for k 1 K Similar but not identical ends can be obtained however by first standardizing X and then using the same 6 for all variables wz k 1 2 K Typically the constant is not regularized The resultant MAP problem is to find K 6 argmin Y d 07 5 24 wW k 1 The Bayesian interpretation of 5 24 is that we believe a pr
24. X Y tends to the number of elements of Xw and Y that differ In other words decreasing p penalizes all deviations equally Exercise 5 25 3 Suppose we fit two linear regression models using two different datasets X Y and X Y Both data sets are the same length We notice that the R square error is bigger with X Y and the L 2 error is bigger with X Y How can this be 5 6 End Notes An extensive introduction to similar material can be found in The elements of statistical learning by Hastie Tibshirani and Friedman HTF09 The elements book covers more of the classical statistical tests e g p values which are important to understand since you will be asked to produce them Our theoretical treatment is designed to compliment this classical approach and to prep you for the numerical problem For a more complete treatment of Bayesian prediction inference we refer the reader to the statistical text by Gelman Gel03 and the machine learning text by Bishop Bis07 A very theoretical treatment of regularization can be found in the book Regularization of Inverse Problems EHNOO Chapter 6 Logistic Regression The purpose of computing is insight not numbers Richard Hamming 6 1 Formulation Logistic regression is probably familiar to many of you so we will try to formulate the problem from a few different angles 6 1 1 Presenter s viewpoint Let s consider the viewpoi
25. _ log g W 6 1 This can be re written using the logit function defined by logitz log z 1 z Solving for P y 1 2 we arrive at ga 1 erw Ply 1 x 6 2 This can also be re written using the logistic sigmoid function o z exp z 1 exp z In other words our model assumes Ply 1 1 o x w This function has the nice property that it takes values in the interval 0 1 as a probability should Moreover it behaves nicely when differentiated see exercise 6 2 2 Exercise 6 2 1 Show that 6 1 implies 6 2 Exercise 6 2 2 Show that 0 2 0 2 1 o z and o z 1 o 2 6 1 FORMULATION 97 Figure 6 1 Plot of the sigmoid function o z from for z w 5 5 6 1 3 Data generating viewpoint One can devise a data generating process that gives rise to the classical viewpoint Suppose we have a latent variable z such that our observed variable y is given by y 1 59 For example y 1 could indicate patient recovers from cancer and z is the health level of the patient To tie this to independent variables we consider the model Z X WFE where e is a random variable with probability density o z 0 2 1 0 2 see exercise 6 2 2 This implies Ply 1 Ple gt 0 Ple gt 2 0u f oa fa Tw 00 o x w where the second to last equality is justified by verifying o z o z In other words y has probability mass function given by 6 2
26. backreferences were introduced You can find these in the python RE module as well as when calling grep with the E flag for extended The basic syntax is quite simple and is evoked by writing N to refer to the N th group If we refer back to the valid dates example from above we would replace the second set of s with a backreference to first ensuring a consistent match 19120 d d 0 1 9 11 012 2 0 1 9 12 0 9 3 01 Note that we are calling 2 because the first group is 19 20 Of course even backreferences don t solve all your problems but they are big help If you were trying to match something a bit more flexible such 84 CHAPTER 8 PROCESSING TEXT as balanced parentheses you would need a counter or just some additional scripting Exercise 8 0 5 Use the python regular expression library to write a script that matches all lines in a file with balanced parentheses 8 3 Python RE Module The python regex library is an amazing tool that combines the power of regular expression with backreference and the flexibility of the python lan guage most of the syntax is inherited from unix as above but there are a few additions There is also a large array of methods that come with the library and a selection of the more noted is the following e The findall function re findall pattern myString which returns all non overlapping patterns in a string For example re findall r Xd My number
27. between re match and re search is the fact that match looks for patterns at the beginning of a string and search anywhere within You can turn a search into a match function by appending a to the beginning of the patter at hand e The sub and split function These substitute a given found regex pattern with a string or split on a given pattern The basic syntax is re split pattern string re sub patter stringSub string 86 CHAPTER 8 PROCESSING TEXT Digression Why the r As you might have noticed there are two ways to enter a regular expres sion into a python RE method either in quotes or with a r appended to the front The r invokes python s raw string notation and the reason for is that the use of backslash in regex to as an escape character i e to allow special wild characters to be used for a literal match collides with python s use or a backslash for the same purpose in string literals Hence to match a backslash in a string using RE you would have to write re findall myString i e two backslashes to escape the special meaning in regex and two to escape 1t in python strings If this were left so you can imagine a complete rigmarole but luckily we can invoke the raw string notation and arrive at the same function with more sane syntax re findall r myString The python RE library is very rich and incredibly powerful We invite you to explore more on the module webs
28. computer to communicate with other machines see figure 1 3 1 Standard input is used to allow a process to read data from another source A Python programmer could read from standard in then print the same thing to standard out using for line in sys stdin sys stdout write line If data is flowing into stdin then this will result in the same data being written to stdout If you launch a terminal then stdout is by default connected to your terminal display So if a program sends something to stdout it is displayed on your terminal By default stdin is connected to your keyboard Stderr operates sort of like stdout but all information carries the 1 3 STREAMS 7 Text terminal Keyboard 0 stdin 1 stdout Display as Figure 1 2 Illustration of the standard streams special tag this is an error message Stderr is therefore used for printing error debugging information 1 3 2 Pipes The standard streams aren t any good if there isn t any way to access them Unix provides a very simple means to connect the standard output of one process to the standard input of another This construct called a pipe and is written with a vertical bar Utilities tied together with pipes form what is known as a pipeline Consider the following pipeline cat infile csv cut d f1 sort uniq c The above line reads in a text file and prints it to standard out with cat the pipe redirects this standard out to
29. creation is attributed to Leo Brieman Adele Culer Tin Kam Ho as well as others The method aims to utilize the flexibility of a single decision tree while mitigating the drawbacks mentioned above by spreading them out over an entire collection Random forest like ensembles in general can be thought of as a framework more than a specific model The essential pieces are the following e The shape of each decision tree i e is the threshold for a single deci sion a liner quadratic or other function e The type of predictor to use at each leaf for example a histogram of constant predictor e The splitting objective to optimize each node for example error rate information gain etc e Some random method which will specify each tree For the sake of concreteness and to set up the some of the technicalities which will help us understand why random forests make sense as an classi fication approach lets look at Brieman s original definition and algorithm Definition 9 2 Brieman A random forest is a classifier consisting of a collection h X 01 x 1 where 0 are independently distributed random vectors and each tree counts a unit vote for the most popular class at input T For each tree Brieman specified to do the following 1 Let N be the number of training cases and M the number of features in the classifier 102 CHAPTER 9 CLASSIFICATION 2 Let m be the number of input features that are used to determine the
30. distrinction between uncertainty and risk has been talked about quite extensively by Nassim Taleb Tal05 Tal10 CONTENTS vil What is the motivation for and focus of this course Just as com mon as the hacker with no domain knowledge or the domain expert with no statistical no how is the traditional academic with meager computing skills Academia rewards papers containing original theory For the most part it does not reward the considerable effort needed to produce high qual ity maintainable code that can be used by others and integrated into larger frameworks As a result the type of code typically put forward by academics is completely unuseable in industry or by anyone else for that matter It is often not the purpose or worth the effort to write production level code in an academic environment The importance of this cannot be overstated Consider a 20 person start up that wishes to build a smart phone app that recommends restaurants to users The data scientist hired for this job will need to interact with the company database they will likely not be handed a neat csv file deal with falsely entered or inconveniently formatted data and produce legible reports as well as a working model for the rest of the company to integrate into its production framework The scientist may be expected to do this work without much in the way of software support Now considering how easy it is to blindly run most predictive software our hypo thetica
31. do and this slows things down This is only really a problem when you re working with a small number of files and so this doesn t matter If chunksize is too small you risk burdening the processes with the task of pickling and com municating No kidding here There are many cases where a small chunksize can result in performance that gets worse as you add more cores For the files on my computer at this time chunksize 10 was a decent compro mise In any case you simply must test out the time with a few different settings to make sure you get some speedup This can be done with time python streamfiles py gt dev null The redirection sends the output dev null which is a file that is erased as soon as it is written In other words you never see the output 10 3 6 Numba 10 3 7 Cython Bibliography Bis07 BV04 Cle Con EHNO00 Gel03 HTF09 KP84 Ray04 Tal05 Tal10 C Bishop Pattern recognition and machine learning first ed 2007 S Boyd and L Vandenberghe Convex optimization Cambridge university press 2004 W S Cleveland Data Science An Action Plan for Expanding the Technical Areas of the Field of Statistics ISI Review D Conway The data science venn diagram http www dataists com 2010 09 the data science venn diagram H Engl M Hanke and A Neubauer Regularization of inverse problems Kluwer 2000 A Gelman Bayesian data analysis second ed 2003
32. g for line in f 10 3 PRACTICAL PERFORMANCE IN PYTHON 125 newline process_line line g write newline A closely related problem would be that of opening many small text files computing something in each and printing results to a new file As a con crete example consider the case where we have a collection of files and want to count the occurrences of nouns in them To detect nouns requires some non trivial and often slow NLP This means that the processing function is likely the bottleneck In that case it makes sense to parallelize things Let s start with the serial version of this program import nltk from os import listdir from os path import isfile join import sys def main basepath home langmore jrl enron data raw enron spam all allfiles f for f in listdir basepath if isfile join basepath f The part of speech that we will keep pos_type NN for filename in allfiles result process_file pos_type basepath filename sys stdout write result n def process_file pos_type basepath filename nnn Read one file at a time extract non stop words that whose part of speech is pos_type return a count Parameters pos_type String Some nltk part of speech type e g NN 126 CHAPTER 10 HIGH ER PERFORMANCE PYTHON basepath String Path to the base directory holding files filename String Name of the file Returns counts String filename word1 n1 wor
33. in practice the setup up is as good as using a train test set of equal size If your training set is small and you are worried about leaving some part of it out during the model build this can be a good approach If your training set is very large and you are worried about com putation time you can use smaller bootstrap samples to build a collection of classifiers in less time and ensemble these together Since the individual classifiers are independent you can make parallelize the whole process mak ing it highly scalable which can be very useful if you envision having to retrain your models in the same way but on ever growing data sets 9 4 OTHER CLASSIFIERS 103 Figure 9 4 The entropy function 9 4 4 Maximum Entropy There are a number of different criteria which you can use to split a decision tree node and one very popular which we will look at now is called maximum entropy You can think of entropy as the information content the measure of uncertainty or randomness of a random variable The entropy of a random variable X is define as H X Y P X i log2P a i EClass where the sum is over all the classes the random variable can output 9 4 shows a plot of what the function would look like for a two class variable Lets look at an example to see why this function could make sense Suppose you are flipping a coin and would like to find a way to measure how biased it is The probability that the coin flips
34. j this code could be parallelized by 1 Splitting mylist up into chunks 124 CHAPTER 10 HIGH ER PERFORMANCE PYTHON 2 Sending each chunk to a separate worker process hopefully attached to a separate processor core 3 Letting each process handle its chunk creating a shorter version of myresults 4 Send the results back to the master process which re assembles it The following code does just that from multiprocessing import Pool Start 4 worker processes pool Pool processes 4 myresults pool map myfun mylist If you use pool map the workers complete their work then return the results This can cause memory issues if the results are large A variant is pool imap which creates an iterator such that results are sent back as soon as they are available See section 10 3 5 for more details and an example 10 3 5 Multiprocessing example Stream processing text files A common data science task is to read in a large file that does not fit into memory compute something with every line e g count word occurrences and write the results to a new file The proper way to do this is to read the file in line by line and process each line one at a time This avoids blowing up your memory and is parallelizable see below Serial examples An example is the following def process_line line Write some code here return newline with open myfile txt r as f with open outfile txt w as
35. lead to floating point underflow causing P l F to be zero once again Floating point number don t have infinite precision open up your python shell and type le 320 if you want to convince yourself and hence after a certain point these are rounded to zero In general if you have a large training set as you should you will have many instances where the value of P f P 1 s is small multiplying these will result in underflow causing the posterior to be zero There are ways to handle ushc issues in the python decimal module but we will look at some others below 9 2 1 Smoothing Probably the most common solution to the latter two issues above is called arithmetic smoothing Here we simply add constant to each likelihood above i e replace with P f l P l with P f ll PU a If a 1 this is referred to as one smoothing but generally a smaller value such as 5 is chosen 94 CHAPTER 9 CLASSIFICATION Another solutions is to transform everything onto the logarithmic scale i e effectively replacing multiplication with addition We can then write our classifier as arg max log P l TL Put arg maxflog P l 20 P flli 9 3 Measuring Accuracy 9 3 1 Error metrics and ROC Curves A classifier either gets something right or wrong Although looking at vari ous statistical error metrics such as rmse can be useful for comparing one classifier model the most natural is to look at predicted vs true positives and ne
36. numiter mylist 1 arrsize myarray np ones arrsize for i in xrange numiter temp pythondot mylist mylist temp numpydot myarray myarray Figure 10 8 Results of profiling session viewed with less See also listing 1 the options n and r which determine the number of iterations used to determine the time the function call takes While timeit can be used to test small snippets of code it is not the best tool to test larger functions The line_profiler is the authors preferred method It provides a readout of time spent in each line of a function The line profiler can be installed with pip install line profiler The easiest way to use the line_profiler is to follow these steps see also listing 1 1 Wrap your code in a function and then put an Cprofile decorator at the top of the function 2 Run this function with some command line argument This is usually done by writing a script that calls the function either a unit test or some script you are using to help run your code This script can even be the same module as the function where you have written run capabilities into a if _mname__ _main__ clause 3 Supposing the script is called myscript py call the script with the command line argument kernprof py l myscript py The 1 argument invokes line by line profiling and you will get funny errors if you don t use 1 4 View the results with python m line_profiler myscript p
37. of the data of today In addition to those of mathemics and statistics these software skills are domain transfereable and so it makes sense to create a job title that is also transferable We could also point to the data hype created in industry as a culprit for the term data science with the science creating an aura of validity and facilitating LinkedIn headhunting What skills are needed One neat way we like to visualize the data science skill set is with Drew Conway s Venn Diagram Con see figure 1 Math and statistics is what allows us to properly quantify a phenomenon observed in data For the sake of narrative lets take a complex deterministic situation such as whether or not someone will make a loan payment and attempt to answer this question with a limited number of variables and an imperfect understanding of those variables influence on the event we wish to predict With the exception of your friendly real estate agent we generally acknowldege our lack of soothseer ability and make statements about the probability of this event These statements take a mathematical form for example P makes loan payment e F creditscore William S Cleveland decide to coin the term data science and write Data Science An action plan for expanding the technical areas of the field of statistics Cle His report outlined six points for a university to follow in developing a data analyst curriculum vi CONTENTS Substantive Expert
38. results do have their place we prefer instead to show the simple method of cross validation A typical cross validation cycle would go as follows 1 Choose a number of possible values for 6 call them 1 m For every Om 2 Segregate your observations into a training set X Y and a cross validation set X Y A common split would be 70 30 3 For every dm use the training data to solve the regularized least squares problem 5 18 obtaining ws W5 4 For every m measure the cross validation error here it is a relative root mean square error X ws Y Y 5 Choose 6 to be the m that minimizes that error 5 3 COEFFICIENT ESTIMATION OPTIMIZATION FORMULATION45 6 Train your model using the full data set and 6 from above Step 1 chooses a bigger set of data for training than cross validation This is typical because you are training K different parameters but in this case only using cross validation to find one Thus we are not worried so much about overfitting a small data set The m in step 2 are usually picked by first guessing a few values of and seeing over what range of values the training data misfit isn t too horrible In step 3 we solve the regularized problem multiple times Note that if we were to measure the training data unregularized least squares error X ws Y Un Y we would see the training error get worse monotonically as 6 increases W
39. should be cautious Example 5 17 Another example starts with the data matrix 1 1 01 a a This matrix is almost singular because the two rows are almost the same This happened because over our measured data set the two covariates were almost the same If our model is good we expect the two measured responses Y Y2 to be almost the same either that or we have some pathological case such as y 1000 11 x2 One finds that the singular values are Ay 2 005 A2 0 005 The smaller singular value is a problem It is associated with the singular directions vg 0 71 0 71 u2 0 71 0 71 This means that 0 71 w1 w2 t 200 0 71 Y2 Y1 In other words our coefficient is extremely sensitive to Y Y2 Small differences between the two will lead to a huge difference in w note that this will not lead to huge changes in w wa only w w2 The upshot is that our predictive model will work fine so long as future x values have z x2 just like our training data but if we stray outside the range of our training data we are in for big problems 5 3 3 L regularization As was mentioned earlier the Bayesian MAP problem reduces in the Gaus sian case to the optimization problem Ws arg min Xw Y lui 5 18 With no math whatsoever one can see that the penalty term w acts to prevent w from becoming too big As with classical least squares the SVD provides insight into
40. simple that the appending of an item to a list takes a significant amount of the time you can get around 50 speedup This 50 or so speed up usually does not merit a pizza party So don t be tempted to wrap every for loop into a list comprehension This type of code would be unreadable A good rule of thumb is this If a for loop is not slowing anything down suppose it is only over a few items then a for loop is usually more readable and is preferred Use a list comprehension only if the list comprehension can be put in one 79 character line The map function is an alternative to list comprehensions mylist create a list def myfun item do something return new_item same as myfun item for item in mylist myresults map myfun mylist Hash tables Another sometimes huge performance tip that should always be followed is to take advantage of the fast hash table lookup capabilities of Python s set and dict objects Suppose you create a dictionary mydict name ian age 13 A hash table is created that stores the keys name and age These keys are not stored as the actual strings Instead they are transformed using a hash function to a location in a table Many different keys could be transformed to the same location but it is very very unlikely that this will happen to you So you can safely assume that in mydict the keys name and age are mapped to different locations Th
41. so far are not suited for larger problems because they require the full matrix to exist in memory and attempt to completely solve the problem which is difficult and unnecessary for large problems In this case an iterative technique is used Here we start with a guess w and iteratively improve it until we have reduced the residual Xw Y to a small enough value These iterative techniques typically require evaluation of Xu for various vectors u This does not require storing X you only need to stream X in and out of memory and multiply things while things are in memory For the case of a large sparse most entries are zero you can store the nonzero values and use these to perform multiplication Moreover rather than completely solving the problem you can stop at any point and obtain an approximate solution 5 4 Variable Scaling and Transformations To motivate this section consider the linear model y wy wa21 e where y is wealth and x is a person s height If height is measured in millimeters then x will be around 1 500 If on the other hand height is measured in meters then x will be around 1 5 Intuition tells us that when height is measured in millimeters and thus x is very large the optimal 10 will be much smaller This is a form of prior belief This section will conclude 1 Maximum likelihood estimation require no special attention to vari able scale except for possible numerical issues to produce o
42. stream you should view it as a contiguous block that looks like name age weight n ian 1 11 n chang 2 22 n daniel 3 33 The special character n is called a newline character and represents the start of a new line The command cut d f2 data csv will pick out the second column of data csv in other words it returns 6 CHAPTER 1 UNIX age WN e or thought of as a stream age n 1 n 2 n 3 This could be accomplished by reading the file in sequence starting to store the characters in a buffer once the first comma is hit then printing when the second comma is hit Since the newline is such a special character many languages provide some means for the user to process each line as a separate item This is a very simple way to think about data processing This simplicity is advantageous and allows one to scale stream processing to massive scales Indeed the popular Hadoop MapReduce implementation requires that all small tasks operate on streams Another advantage of stream processing is that memory needs are reduced Programs that are able to read from stdin and write to stdout are known as filters 1 3 1 Standard streams While stream is a general term there are three streaming input and output channels available on almost every machine These are standard input stdin standard output stdout and standard error stderr Together these standard streams provide a means for a process to communicate with other processes or a
43. tails is p and heads is 1 p If your coin is highly biased and the probability of heads or tails is 1 then H X 0 i e the random variable is always there same and has no information content 104 CHAPTER 9 CLASSIFICATION or randomness If the coin is fair i e P X 5 for both heads and tails then H X 1 i e we have maximum randomness From this example there are a few intuitive conditions that one might come up with for such a function to exist a the function is symmetric and b has a maximum at 5 There is actually one more technical condition which we won t get into here but the three together ensure that the definition of H X is unique up to a constant If you are interested in learning more about this I would recommend reading Claude Shannon s wonderful 1948 paper A Mathemtical Theory of Communication where in a tour de force of utter clarity and brilliance he laid out the foundation for information theory signal processing and really data science at large Back to decision trees Given a root node and a training set T 1 calculate the entropy for each feature of your decision tree classifier 2 split T into subsets using the feature for which entropy is minimum i e for which we are reducing the randomness of our decision as much as possible 3 make a decision tree node containing that feature i e split 4 continue on the above subsets with the remaining features Another way to say is that at each split
44. the noise was basically uncorrelated all 4 7 were not related Use remark 5 11 to show that our estimate for w is 1 0 05 which is the exact result in the uncorrelated model Note Since v1 v2 are the standard basis the coefficients for w estimated by 5 11 will be in the standard basis as well 4 Suppose instead that we measured Y 1 15 0 85 1 15 0 85 So the noise was correlated Show that our estimate is w 1 0 15 The coefficient for wa was three times larger than before 5 What will happen if we use the second coefficient to predict returns during uncorrelated times What about during times when the error is negative and correlated What about positive and correlated Note that if errors were always correlated say with correlation matrix gt one should solve the generalized least squares problem w argmin Xw Y E Xw Y w This can be seen by reworking the example in section 5 2 2 starting with the assumption E N 0 Y One could also take the viewpoint that while we cannot hope to specify the error model correctly we do know a priori that the coefficient of x should be smaller than that of x In this case we could use a prior proportional to a 1 w sE we X PITH DAS 5 3 COEFFICIENT ESTIMATION OPTIMIZATION FORMULATION43 Alternatively we could fit our model to fake data that was perturbed by different noise models If the results show wild swings in the coefficients then we
45. the standard in of cut cut in turn extracts the first column and passes the result to sort which sends its result to uniq uniq c counts the number of unique occurrences of each word Let s decompose this step by step First print infile csv to stdout which is by default the terminal using cat cat infile csv 8 CHAPTER 1 UNIX ian 1 daniel 2 chang 3 ian 11 Second pipe this to cut which will extract the first field the f option in this comma delimited the d option file cat infile csv cut d f1 ian daniel chang ian Third pipe the output of cut to sort cat infile csv cut d f1 sort chang daniel ian ian Third redirect the output of sort to uniq cat infile csv cut d f1 sort uniq c 1 chang 1 daniel 2 ian It is important to note that uniq counts unique occurrences in consecutive lines of text If we did not sort the input to uniq we would have cat infile csv cut d f1 uniq c ian daniel chang ian PPP e uniq processes text streams character by character and does not have the ability to look ahead and see that ian will occur a second time 1 4 TEXT 9 1 4 Text One surprising thing to some Unix newcomers is the degree to which simple plain text dominates The preferred file format for most data files and streams is just plain text Why not use a compressed binary format that would be quicker to read write using a special reader
46. to mimic the functionality of the map function There is one issue with this however in that mapO will get all results at once and all the results may be too big to fit in memory So instead we use a multiprocessing version of imap imap returns an iterator that allows us to step through the results one by one as they become available There is an additional parameter chunksize that specifies the size of chunks to send to from the workers at one time It will re use the functions process_file and is_stopword verbatim so we don t re write them here The main function is significantly changed and now supports both single and multiprocessing modes There is an ad ditional function imap_wrap along with a couple of re definitions which are necessary if we want to exit with Ctrl C rather than explicitly killing the process with ps import nltk from os import listdir from os path import isfile join import sys import itertools from functools import partial from multiprocessing import Pool from multiprocessing pool import IMapUnorderedIterator IMapIterator def main 128 def CHAPTER 10 HIGH ER PERFORMANCE PYTHON basepath home langmore jrl1 enron data raw enron spam all allfiles f for f in listdir basepath if isfile join basepath f Number of slave processes to start n_procs 2 The size chunk to send between slave and master chunksize 10 The part of speech type that we will keep po
47. we will maximize the information gain IG f H T p s H s ses where H T is the entropy of T S the subsets created by splitting T over feature f and p s is the proportion of the number of elements in s to the elements in T i e the relative size of s The above is known as the D3 algorithm invented by Ross Quinlan Part V Extras 105 Chapter 10 High er performance Python To many people high performance scientific code must be written in a language such as C C or Fortran or even a GPU language It is true that these languages will continue to support applications where the highest possible performance is absolutely essential For example some climate simulations can take months of time on a supercomputer and therefore even a modest 2x speedup of code can make a huge difference However this is not always the case In the authors experience the time spent writing code a k a developer time usually far exceeds the time spent waiting for a computer to crunch numbers a k a CPU time Since python code is generally easier to write and maintain than e g C the project goal e g predicting an outcome can be achieved far quicker and cheaper using Python Moreover Python does not require a developer to mange low level tasks such as memory allocation so a non expert programmer e g a person with a PhD in statistics can often be used The fact that python does not require a programmer to manage memory al
48. 4 0 6 0 8 1 0 false positive rate Figure 9 3 Our example ROC Example 9 1 Suppose you have a test set of 10 cases whose actual class you know and you classifier outputs the following numbers The cases are listed below in descending order case Class case actual 1 97 True 2 93 True 3 87 False 4 70 False 5 65 True 6 08 False 7 43 True 8 33 False 9 21 True 10 05 False With the first threshold inf you classify no cases as true and hence have no true or false positives i e the first point on ROC is 0 0 With the 9 4 OTHER CLASSIFIERS 99 second threshold you will classify the case 1 as true since this is correct your true positive rate will now be 5 since there are 5 positives int the test set and your false positive rate will be 0 With the third threshold you will classify cases 1 and 2 as true leading to 2 tp rate and 0 tn rate etc In figure 9 1 you can see the ROC plotted for this example Below is an efficient ROC algorithm copied verbatim from Inputs L the set of test examples f i the probabilistic classifier s estimate that example 7 is positive P and N the number of positive and negative examples Outputs R a list of ROC points increasing by fp rate Require P gt 0 and N gt 0 Lsorted E FPeTP lt 0 R lt gt fprev inf while lt Lsortea do if F i F orev then push 4F TL onto R forev fi end i
49. NCE IN PYTHON 123 is built with an optimized BLAS i e Intel s Math Kernel Library MKL This is easy to do if you are using Continuum s Anaconda or Enthought s EPD There are many reason s that Python for loops are slow Consider the following loop mylist Create a list mysum 0 for item in mylist mysum item Since Python is not compiled the python interpreter needs to convert this code into machine code every time through the loop Moreover since a list can contain a very wide variety of objects Python needs to figure out how to add these objects or if they even can be added every time through Python also checks if you are using a value of i that is outside the bounds of mylist None of these checks need to be done with numpy arrays since they are of pre determined shape and data type Note that it is not efficient to use a for loop on a numpy array in fact it is often slower than a list Numpy is only optimal when you call the built in numpy functions e g sum dot max that call external BLAS libraries 10 3 4 Multiprocessing Pools Many embarrassingly parallel tasks can be thought of as applying a function to a list and getting back another list Consider first the following serial code mylist create a list def myfun item do something return new_item same as myfun item for item in mylist myresults map myfun mylist So long as myfun mylist i is independent of myfun mylist
50. NTS iii 7 Models Behaving Well 74 TA End Notes ie iii a a a cd bk 75 II Text Data 76 8 Processing Text 77 8 1 A Quick Introduction e 77 8 2 Regular Expressions e e 78 8 2 1 Basic Concepts o 78 8 2 2 Unix Command line and regular expressions 79 8 2 3 Finite State Automata and PCRE 82 8 2 4 Backreference o e e 83 8 3 Python RE Module o e e 84 8 4 The Python NLTK Library 87 8 4 1 The NLTK Corpus and Some Fun things to do 87 IV Classification 89 9 Classification 90 9 1 A Quick Introduction 00000004 90 9 2 Naive Bayesi 0000000000 mor Oe a Ae LE be e E 90 9 2 1 Smoothing eliana BS Ea Pee ek ees 93 9 3 Measuring Accuracy 2 22 00002 2s 94 9 3 1 Error metrics and ROC Curves 94 9 4 Other classifiers o e o 99 9 4 1 Decision Trees o 022 0000 e 99 9 4 2 Random Forest e e 101 9 4 3 Out of bag classification 0 4 102 9 4 4 Maximum Entropy 00 103 V Extras 105 10 High er performance Python 106 10 1 Memory hierarchy aoaaa a 107 10 2 Parallelism 4 2 arf 4d e a a e a a a 110 10 3 Practical performance in Python 114 10 3 1 Profiling n cara A A aoa 114 10 3 2 Standard Python rules of thumb 117 CONTENTS 10 3 3 For loops versus BLAS
51. OCESSING TEXT This is a bit convoluted so lets fo through it The first thing to notice is that there are three main groups defined by brackers these are 19120 o 1 9 11 012 o 1 9 12 0 9 13 01 The first part just makes sure that you are looking for a string that starts with 19 or 20 The second group makes sure that you that the month starts with either a 0 or 1 and contunes with the appropriate digit and similiar for the dates In addition we have a to signify that we are looking at the start of a line then dsignifies we are looking for a number between 0 and 9 the which allows either a dash or a backslash Note there are some issues with this particular regular expression since it will match dates like 2003 02 31 but also it will match dates like 2004 04 12 which you wanted to exclude We ll see ways to get around this 8 2 3 Finite State Automata and PCRE There are a number of different implementation of regular expressions re sulting in varied flexibility and performance The original framework is modelled on finite state machines WPF making for a very fast and efficient approach The complexity of finite automata implementation referred to as re2 is O n where n is the size of the input but it does have its limitation The main reason is that a finite state machine does not remember how it arrived at a given state which prevents from evaluating a latter piece of a regular exp
52. P x w In other words we will set A 1 From the standpoint of building a model that takes in x and spits out P y 1 2 why doesn t the assumption 1 matter 3 Suppose now that e N 0 a Utrue Show that our model should be Ply 112 0 2 4 Using the notation Xn w Pn 0 7 Xn U write down the likelihood and negative log likelihood associated to the independent variable matrix X and dependent variable vector Y Minimizing the negative log likelihood is obviously more difficult because it involves a nonlinear combination of variables For that reason an iterative technique is used whereby v is fixed and the minimum over w is obtained then w is fixed and a minimum over v is obtained This iterative procedure is repeated until the negative log likelihood stops changing very much 6 3 Multinomial logistic regression Logistic regression can be generalized to the case where y takes on a number of values Call each of these classes Cm for m 1 M We can generalize 6 1 to get Ply Ci 2 a A o Pe ST Ply Cm x 62 CHAPTER 6 LOGISTIC REGRESSION The coefficients wt for i 1 M 1 are each vectors in R viz w wf w One can solve for the probabilities and arrive at a generalization of 6 2 exp zw Ply C z E 1 gt exp x wn 6 6 The coefficients are determined in a manner similar to two class logistic regression That is we write dow
53. Romo 3 3 3 5 4 1 9 Moo Y Mo Roo Figure 10 4 Screenshot of htop invoked on a laptop with 4 cores and 4 virtual cores so it appears as 8 cores to the OS Notice that multiple cores are in use due to a multi threaded numpy matrix multiply You can also see that 2821 MB of memory is currently in use The green memory and CPU usage is normal usage and red indicates kernel usage in other words the OS kernel is invoking this The most common example of a red CPU line is due to a disk read write can install with sudo apt get install htop You can start the utility by typing htop in the command prompt In addition to monitoring memory usage htop allows you to see CPU usage kill processes and search for processes A useful number to keep in mind when calculating memory usage is to realize that double precision numbers the default for Python if your system is 64 bit take up 64 bits which is 8 bytes So an array of numbers that has length 1 million is 8 million bytes or 8 MB 10 2 Parallelism Since as we saw above performance gains will not come through increasing the clock speed in the serial pathway of figure 10 1 it is now coming through the memory usage just doesn t make sense 10 2 PARALLELISM 111 random subset tree OS Figure 10 5 Random forest is an embarrassingly parallel algorithm bad ala al aaa 2 parallelism Suppose you have a processor with 4 cores and w
54. Unix in terms of grammatical language structure where 80 CHAPTER 8 PROCESSING TEXT e commands like ls ls man are thought of as verbs e the objects files data to operate on as nouns e shell operators such as or as conjunctions so what we are missing now are some adjectives and we can think of regular expressions as filling this gap We ve already seen some regular expressions so lets look at a table of the core syntax Match any character Match the start of a string Match the end of a string or just before the newline character Xd Match any decimal digit D Match any single non digit character w Match any single alphanumeric character A Z Match any of uppercase A to Z Match zero or one occurrence of the preceding regular expression i Match zero or more occurrence of the preceding regular expression Match one or more occurrences of the preceding regular expression n Match exactly n occurrences of the preceding regular expression m n Match from m to n repetitions of the preceding regular expression Lets look at some examples Suppose you have a file people txt which con tains the names of all the people in the class It looks like Kate Student Jake Student lan Professor Emily Student Daniel Professor Sam Student Chang Professor Paul Student If you want something trivial such as retrieving all lines with professor names you can
55. a and write this as Y 1 Xi El Yn 1 Xyoe EN where X192 is the age of the first individual and X22 the age of the second and so on More generally we will model each response Y as depending on a number of covariates Xno Xn where by convention we select Xno 1 This gives the matrix equations of linear regression Y Xw E 5 2 5 2 COEFFICIENT ESTIMATION BAYESIAN FORMULATION 29 5 2 Coefficient Estimation Bayesian Formulation Consider 5 2 The error term E is meant to represent un modelable as pects of the x y relationship The part that we do model is determined by w It is w then that becomes the primary focus of our analysis although we emphasize that a decent error model is important To quantify our uncer tainty in w we can take the Bayesian viewpoint that w is a random variable This leads to insight about how uncertainty in w changes along with the data and model 5 2 1 Generic setup Assume we have performed N experiments where in each one we have taken measurements of K covariates Xn1 Xnx We then seek characterize the posterior density function p w X Y This is the probability density function pdf of our unknown coefficients w conditioned on given that we know the measurements X Y We will always take the viewpoint that X is a fixed set of deterministic measurements and we therefore will no longer explicitly condition on X Due to Bayes rule this can be decomposed as p w p
56. a class label There are a number of ad vantages e Decision trees are simple and easy to interpret e They require comparatively little data cleaning can deal well with continuous or categorical variables e Evaluation of a data point is fast log n where n is the number of leaves e Can handle mixed type data The main drawbacks are the following e Decision trees can easily become complex and lead to over fitting One way to deal with this problem is to prune by either setting the minimum number of samples for a given node to split if the min samples is high the tree is forced to make a decision early fitting the data less or by setting the max depth limiting the entire size of the tree e The prediction accuracy can become unstable with small perturbations of the data i e a small shift can lead to data points falling into the wrong classes Ensembling many trees which we will look at below can help mitigate this problem 9 4 OTHER CLASSIFIERS 101 e Finding an optimal tree is NP complete Note that the algorithm de scribed above does not guarantee that a globally optimal tree is found since it simply makes the best decision at any given node excluding the possibility that say a less than optimal decision locally might lead to a better overall classifier For more information about decision trees see Python s scikit learn library 9 4 2 Random Forest Random Forest is a popular ensemble method whose
57. and it is possible that both training and cross validation error will monotonically increase see figure 5 3 4 right The reader is cautioned however that in the real world the data is not generated by Y Xw E with uncorrelated E and more complex behaviors can emerge In particular the modeler may intend on using the model on real world inputs that behave differently than the training or cross validation input For example both the training and cross validation data could be from pre 2008 before the financial crisis but you will use your model in a post 2012 world These out of time errors can be especially 46 CHAPTER 5 LINEAR REGRESSION train_error Cv error ast AA ae train_error 0 12 cv error 0 030 0 5 1 0 15 2 0 25 0 0 0 5 10 15 2 0 2 5 3 0 3 5 4 0 delta delta Figure 5 3 Cross validation and test errors Left The X matrix with correlated columns hence the unregularized solution was overfitting Right The X matrix had uncorrelated columns and therefore the unregularized solution was not overfitting problematic In this case the modeler can error on the side of caution and use more regularization than the cross validation cycle above suggests Another popular method is to reduce the number of variables to a smaller set that is human interpretable e g a human can check that the coefficient in front of a particular variable is reasonable 5
58. application The reason is in the question A special reader application would be needed As time goes on many data formats and reader applications come in and then out of favor Soon your special format data file needs a hard to find application to read it What about for communication between processes on a machine The same situation arises As soon as more than one binary format is used it is possible for one of them to become obsolete Even if both are well supported every process needs to specify what format it is using Another advantage of working with text streams is the fact that humans can visually inspect them for debugging purposes While binary formats live and die on a quick computer time scale change in human languages changes on the scale of at least a generation In fact one summary of the Unix philosophy goes This is the Unix philosophy Write programs that do one thing and do it well Write programs to work together Write programs to handle text streams because that is a universal interface This in addition to the fact that programming in general requires manip ulation of text files means that you are required to master decent text processing software Here is a brief overview of some popular programs e Vim is a powerful text editor designed to allow quick editing of files and minimal hand movement e Emacs is another powerful text editor Some people find that it re quires users to contort their hands a
59. ar value decomposition is always possible Definition 5 9 Singular Value Decomposition SVD A singular value decomposition of a matrix X is a set of left singular vectors uz un a set of right singular vectors v1 vK and a set of singular values Di A o N V K is the maximum of N and K such that 36 CHAPTER 5 LINEAR REGRESSION The vz form an orthonormal basis for RL e The u form an orthonormal basis for RY XV AjUj and XTuj AjUj forj 1 NVK e gt A2 gt gt AnvK gt 0 and if K lt N we have an r lt K such that Ary1 Ay 0 This decomposition is also sometimes written X UYEV 5 10 where the columns of U are the uj the columns of V are the vj and X RAXK has the A on its diagonal as far as its diagonal actually goes since it is not necessarily square Exercise 5 10 1 Show that 5 10 follows from definition 5 9 Exercise 5 10 2 Show that the right singular vectors the v are the eigenvectors of the matrix XTX and the singular values are the square roots of the eigenvalues Digression 2 Simplifying Basis The SVD of X is a choice of basis under which the operator X acts in a simple manner Xu Aguz This trick is widely used in mathematics The most famous example is probably the Fourier series Here one chooses a sinusoidal basis to transform functions fiz y fr sin 2rkx k 1 The differential operator d dx then takes the simple action 2 L
60. augment our data matrix X with a constant zeroth column X y 1 Variable standardization is the process of replacing X with X where Xo Xo l Xp Xk uk Tk k 1 K Solving the ML maximum likelihood problem with X gives us the coeffi cients w They are related to the standard ML coefficients by K A HkWk Wk wo tio Y F wk k 1 K i Ok Ok A typical workflow involves first standardizing the data converting X to X then fitting the coefficients w Second we inspect the coefficients w looking for irregularities and re compute if necessary If we intend on predicting a new output y given a new input x we could either standardize x using the exact same Uk 0 as above or we could translate w into w and use the un standardized x with w Exercise 5 22 1 For the standardized coefficients of definition 5 22 show that 1 The constant coefficient wo is equal to the predicted output y when the input vector x is equal to its mean 2 The coefficient of X defined above does not change when the units used to measure X change 3 The translation 4 gt w is as given above 50 CHAPTER 5 LINEAR REGRESSION Moving on to Bayesian MAP estimates let s revisit the happiness as a function of height problem Now we have N ws arg min Y wo wiXni Yn 09 001 5 23 n 1 Suppose we measure height in meters and find ws In this case Xp1 is around 1 Suppose we find that the
61. blication websites emails old document scans social media the data world today is filled with text and many times it is up to the data scientist to extract useful information or usable signals Much of this work falls into the realm of data processing and uses a variety of techniques from UNIX regular expressions to natural language processing The following three are common examples e Text Classification lets say you start with a collection of newspaper articles and the task is to properly place each into the appropriate news category The task would be to first extract useful features from the text these could be simply words or nouns or phrases and then use these features to build a model 77 78 CHAPTER 8 PROCESSING TEXT e Web scraping your task it to write a program that will crawl a number of websites process the text and extract common themes topics or sentiment levels e Social media trends your task is to analyze the reaction of twitter users to particular news events and to identifying those which are currently trending or have the potential of going viral 8 2 Regular Expressions Regular expressions WPR are an incredibly power tool for patter matching in text They originated from automata and formal language theory of the 1950 s and later being integrated in Unix ed grep and awk programs became an indispensable part of the Unix environment The power of regular expressions comes from their fle
62. ce 0 z o z 1 o0 z we find that b lt aL BE DT oes le Was Ou Xe E SS o Xp 10 XueXng Or with VL and V L denoting the gradient and hessian matrix the matrix with kj entry equal to 07L Ow 0w Mz VL w o Xn g w Yn Xn ll m n 6 5 GX w 1 o Xn w KE Xn Mza V L w n 1 One can check that for any vector v RE N y V L w v Su d Xn 0 Xn i w 1 a o Xn w n 1 which is greater than or equal to zero and strictly greater than zero for all vectors provided the matrix X has rank K In other words if the data columns are linearly independent then the hessian is positive definite for every w and hence the function L is strictly convex This implies that any local minimum of L is a global min which can be shown to be W rue in the limit N oo if the rows Xn are statistically independent samples and the data generating process is as described in section 6 1 3 with w Wirye To make this last statement plausible note that in this ideal case the expected value of VL w is gt gt o Xn w o Xn Wirue of which w Wirue is the minimizing value Since VL w is a sum of random variables we can rescale it by 1 N and see that it should approach its expectation 60 CHAPTER 6 LOGISTIC REGRESSION Exercise 6 5 1 Show that for fixed w V L w is positive definite for all w if and only if X has rank K Digression Don t truncate linear regr
63. clfQ plt plot x y b ms 18 label Y plt plot x_long p x_long r lw 5 label 3 degree poly plt plot x_long p6 x_long g lw 6 label 6 degree poly plt xlabel X plt legend loc best plt show See figure 5 14 The third degree polynomial fits well but not perfectly at every point The six degree polynomial fits the data better but wiggles in a crazy manner and looks like it will blow up outside the range of 0 1 Furthermore running this code multiple times shows that all coefficients in the six degree polynomial are highly dependent on the error and the first three terms are no where near the real terms For statistical inference this is a killer How can you report these coefficients as findings when they 5 3 COEFFICIENT ESTIMATION OPTIMIZATION FORMULATION41 ey o 3 degree poly 2 10f mm 6 degree poly 0 8 0 67 0 4 0 2 0 055 0 0 0 2 0 4 0 6 0 8 1 0 1 2 Figure 5 2 Polynomial fitting and overfitting so obviously depend on the unmodeled noise For prediction the situation is slightly more nuanced If we believe future x values will fall outside the interval 0 1 then we are clearly in trouble On the other hand if future values lie inside this interval then it can be seen that our polynomial no matter how crazy it is will predict quite well Beware however In higher dimensions it becomes difficult to det
64. d N7 L does however depend quite a bit on the difficulty of the problem and generally is not interpretable from problem to problem For this reason it is usually not a meaningful quantity to share with people An alternative is the so called McFadden s pseudo R sqared L w TR 1 ee Lnuulw where Lnu is the negative log likelihood obtained by using a model with only a constant the null model Inspection of the definition reveals that Y R2 measures how much our full model improves on the null model Also like R squared 0 lt WR2 lt 1 Moreover just like in linear regression the ratio L Lnyy is the ratio of the negative log likelihoods This means that McFadden s pseudo R squared is a generalization of R from linear regression See bullet point ii below 5 25 Exercise 6 11 3 Suppose your boss says just figure out the R square of your logistic regression model in the exact same way as you do for linear regression Tell your boss why this is impossible Exercise 6 11 4 Assume our full model the one giving rise to L is built with a constant and other variables Show that the in sample Y R2 is between zero and one with both zero and one as possible values 6 8 END NOTES 73 6 8 End Notes A great introduction to convex optimization has been written by Boyd and Vandenberghe It focuses on problem formulation and is hence applicable for users of the algorithms BV04 Chapter 7 Models Behav
65. d not just the individual characters we can include a at the end of the expression which is special as it matches one or more characters in the preceding regular expression ie re findall r a zA Z0 9 s bi 2 returns the list 1 T am going to show 2 or maybe 10 examples of using regular expressions Of course typing character ranges explicitly as we did above can become a bit tedious so there are special shorthand symbols to make life easier For example we could have returned the above list by evoking re findall r wt s so now we know that a zA Z0 9 w in RE If we want all the symbols include the angled parentheses at the beginning and end of the string we could call re findall r s If are looking for all instances where a period appear we could return that by calling re findall r s Hence we have learned a few things about regular expression syntax we have ways of matching certain characters or ranges of characters there is shorthand notation for common searches there are special or wild card characters and ways of escaping those special characters names by calling e Now it s time to look at things a bit more formally 8 2 2 Unix Command line and regular expressions We have already quite a bit about Unix command line functions and utilities You can think of
66. d2 n2 word3 n3 nnn path join basepath filename with open path r as f text f read tokens nltk tokenize word_tokenize text good_words t for t in tokens if t isalpha and not is_stopword t word_pos_tuples nltk pos_tag good_words typed wt 0 for wt in word_pos_tuples if wt 1 pos_type freq_dist nltk FreqDist typed Format the output string outstr filename for word count in freq_dist iteritems outstr word str count return outstr def is_stopword string return string lower in nltk corpus stopwords words english if __name __main__ main Notice how in the above code the I O is all in main and the processing is all in process file This is the standard good practice of separating interface from implementation We have also made the choice again good standard practice to push the processing to a function that deals with one 10 3 PRACTICAL PERFORMANCE IN PYTHON 127 single file at a time This sort of choice is usually necessary for paralleliza tion Parallel example We now write a parallel implementation We will use the Pool class from the multiprocessing package This provides an easy way to parallelize embarrassingly parallel programs Pool launches a pool of workers and automatically divides up the work among them The work must be passed to them in the form of an iterable such as a list It is meant
67. e set w w lt M am Since L is strictly convex and the penalty is convex L w ax wel is strictly convex and this is a global minimum and the unique solution to the optimization problem The rest of the proof proceeds by linearizing L w around the optimal point w The reader is encouraged to consider a one dimensional problem w R and replace L w with L w w w L w and consider the conse quences Continuing with the proof define f w L w ax wx Suppose that w 0 Then the derivative 0 f w exists and since w is a minimum it must be equal to zero This implies i To show that ii is a possibility consider the function L w cw w with 0 lt c lt a for one dimensional w One can verify that w 0 and L 0 c so both inequality and equality are possible 66 CHAPTER 6 LOGISTIC REGRESSION It remains to show that no situation other than i or ii is possible This will follow from the fact that we can never have 0 L w gt a To this end assume that 0 lt az lt c lt OxL w the case of negative derivative is similar We then have with ez the standard Euclidean basis vector f w dex f w sow da o 6 lt f w d e a o lt f w for 6 small enough For small enough we must have f w dex lt f w which means w is not a minimum This completes the proof It should be mentioned that the problem of finding the MAP est
68. e values or rather pointers to the values are stored at these locations So when I write mydict name Python uses a hash function to transform name to a location and in that location is the string ian See figure 10 9 The performance significance is that no matter how long your dictionary is it 10 3 PRACTICAL PERFORMANCE IN PYTHON 119 table size 8 key a hash a amp 7 key a value 1 value 1 2 key c key c value 3 value 3 key b 1 MN 3 key b value 2 EA Figure 10 9 A hash table corresponding to the dictionary a 1 b 2 c 3 z 26 In this table both a and z were mapped to the same table entry so there was a collision In reality this is very very rare collision takes the same amount of time to find a key in it For comparisons sake you could and many novices do implement a slow dictionary by storing two lists one with the keys and one with the values Note that a set is also stored as a hash table with no values other than an indication that the element is in the set so finding items in a set is also fast See figure 10 10 Iterators Consider the following code mylist Create a list for i in range N mylist lil mylist i 10 120 CHAPTER 10 HIGH ER PERFORMANCE PYTHON mylist range 10000 myvalues a 10000 for i in xrange 10000 Hydtict Usa myset set range 10000 timeit
69. ed tasks Hadoop is an open source implementation of MapReduce Despite the popularity of Hadoop MapReduce it is important to realize that this is a highly restrictive method of parallel computing with a huge overhead required to spawn new processes 114 CHAPTER 10 HIGH ER PERFORMANCE PYTHON 10 3 Practical performance in Python In this section we introduce you to a few common means to increase the performance of your Python programs 10 3 1 Profiling Python code is meant to be easy to understand and maintain It is generally not good practice to optimize for performance everything since this often results in less straightforward code To quote Donald Knuth We should forget about small efficiencies say about 97 of the time premature op timization is the root of all evil When you do optimize the proper first step is to profile first Profiling is the practice of examining the time taken by different sections of code This helps you find the bottlenecks You then optimize these bottlenecks The easiest way to profile a script is with the unix time utility time python myscript py real 0m24 332s user 0m47 268s sys Om0 792s The real line is the total wall time of the process invoked directly by the script This is the time you would measure if you brought out a stopwatch and timed until the process was finished The user number is the total user CPU time In the above case the script launched a master process whic
70. efficient Estimation Optimization Formulation 5 3 1 The least squares problem and the singular value de COMPOSITION comio be Pa ee a 5 3 2 Overfitting examples 5 3 3 Lo regularization 2 ee 5 3 4 Choosing the regularization parameter 5 3 5 Numerical techniques 5 4 Variable Scaling and Transformations 5 4 1 Simple variable scaling 5 4 2 Linear transformations of variables 5 4 3 Nonlinear transformations and segmentation obs Error Metris 4 5 6 03 i240 ls ok Ge Ghat rl da 0 6 Bind Notes os Sin hie Pe Gee de RK eee al oe Ae 8 Logistic Regression 6 1 Formulation s 0 c05 38 bl Aeon ed Gace ee Ce 6 1 1 Presenter s viewpoint 6 1 2 Classical viewpoint 2 200 6 1 3 Data generating viewpoint 6 2 Determining the regression coefficient w 6 3 Multinomial logistic regression 6 4 Logistic regression for classification 0 6 5 Liregularization 2 0 6 6 Numerical solution o e 20004 6 6 1 Gradient descent o e e o 6 6 2 Newton s method o 6 6 3 Solving the L1 regularized problem 6 6 4 Common numerical issues 6 7 Modelevaluation 0 20 2004 6 8 End N tese onset enn Sie ob a Geek Gude eed OR Led 23 24 CONTE
71. els such as Naive Bayes provide a strict prob ability for a class given a set of features otherwise such as the popular gradient boosting decision tree classifier provide a score indicating the level of classification For example if you have a two class problem the output might be on a scale a b where a and b are some numbers close to 0 and 1 respectively Here the classifier simply orders your test cases from those closest to the class 0 to those closest to the class 1 Hence there is no reason to a priori select some threshold such as 5 as the cutoff with every new threshold you will get a new set of classifier error quantities which you can use to understand the general quality and robustness of your classifier as well as an optimal cutoff point This leads us to Receiver Operator Curves or ROC Given a classifier C an ROC take in a threshold value T and returns the tuple tp rate tn rate i e ROC T fp rate tn rate or ROCe inf inf gt 0 1 x 0 1 which is a curve in the plane There are a couple of points of note here the point 0 0 indicates that no actual positives have been predicted but no false negatives either i e this 9 3 MEASURING ACCURACY 97 1 0r 5 e 0 8 a ye a ar e o Pi 0 6 a 3 E a ao o 20 4 a gt e e e e E a 0 2 on e e e e 0 A S 95 0 2 0 4 0 6 0 8 1 0 false posit
72. epository on GitHub follow the directions at https help github com articles creating a new repository 2 6 1 Linear Move from Working to Remote To turn in homework you have to move files from 1 to 4 The basic 1 to 4 workflow would be note that I use lt something gt when you must fill in some obvious replacement for the word something If the something is optional I write it in square brackets e working copy gt index git add lt file gt To see the files that differ in index and commit use git status To see the differences between working and index files use git diff lt file gt e index HEAD git commit m lt message gt To see the files that differ in index and commit use git status To see the dif ferences between your working copy and the commit use git diff HEAD lt file gt e HEAD gt remote repo git push origin master This means push the commits in your master branch to the remote repo named origin Note that origin master is the default so it isn t necessary This actually pushes all commits to origin but in particular it pushes HEAD You can add and commit at once with git commit am lt message gt This will add files that have been previously added It will not add untracked files you have to manually add them with git add lt file gt 2 6 COMMON GIT WORKFLOWS 17 2 6 2 Discarding changes in your working copy You can replace your working copy with the copy in
73. erior is a compromise between the prior and the likeli hood involving Yj is a scalar equation the w term will dominate the posterior unless w is small Therefore the w that maximizes the posterior Wmap will be small In other words our best guess will look like a likely draw from the prior In this way the posterior is a compromise between the prior and likelihood see figure 5 1 Specifically 2 e If our noise variance o7 was huge then our posterior would be domi nated by the prior and our best guess Winap Would be close to zero e If a priori we were certain that Wirye was close to zero then 0 would be small As a result the w term would be highly dominant and Wmap 0 e The likelihood slightly perturbs w ap in a direction that fits the data e It is easy to show that the right hand side of the above minimization problem is strictly convex so one unique solution exists Once we include all N experiments our posterior is p w Y x p Y w p w pp Y Xw p w 1 1 exp f glo YIP exp ello w 5 3 COEFFICIENT ESTIMATION OPTIMIZATION FORMULATION33 Since the product of two Gaussians is Gaussian our posterior is Gaussian Our MAP solution becomes Xw Y _ lwl 55 2 5 5 Wmap arg min Now the term Xw Y 2 Xn w Y is a sum of N terms If N gt K if we have much more data than unknowns it will dominate and Wmap Will be chosen to fit the data
74. ermine the range of applicability of your model It is usually better to be safe and error on the side of underfitting Note that this is an example of the more general result of equation 5 13 Exercise 5 15 1 Misspecification of error correlations Let s model per centage stock returns among four different stocks the relative percentage change in stock price over one day We will model the tomorrows returns as depending on todays returns via the relation y wx wr e where e is uncorrelated for every stock recall that ordinary least squares implicitly makes this assumption Let s assume the real life model is y x 0 052 n where 7 for different stocks is sometimes uncorrelated and sometimes corre lated Since on most days returns are around 1 we have approximately a 10 to 1 signal to noise ratio and think that we re ok Taking measurements of the stock prices on one day we get our data matrix first column is the 42 CHAPTER 5 LINEAR REGRESSION returns second column is the squared returns 5 16 gt II ee ee 1 Use exercise 5 10 2 to show that the right singular vectors are v 1 0 and va 0 1 and the singular values are A2 2 2 Use the fact that Xv Aju to show that uy 0 5 1 1 1 1 and ug 0 5 1 1 1 1 7 3 Suppose we measured returns the day after and got Y 1 25 1 15 0 85 0 75 One could use y x 0 05x y to explicitly calculate 7 and infer that
75. es for this relatively trivial approach Lets take a sample problem Suppose you built a news aggregation site which pulls in article publication data from a myriad of sources In order to organize this information you would like to be able to classify the incoming articles automatically and place the corresponding links in the appropriate section on your site Say for simplicity you have three classes of articles labelled leisure sports and politics If you had no information at all about the articles themselves you would be temped to simply assign the most common label For example if on an initial scan you realized you had 20 leisure 30 sport and 50 politics your best bet would be to assign politics to article and then regularly redo the count However suppose you knew that the word sports appeared in 2 of the leisure 95 sports and 3 politics articles you d likely want to shift the initial or prior distribution of labels by this new information With a few seconds of thoughts you d probably write down the following P label hasword sport P hasword sport label P label and P hasword sport label P label P hasword sport label count hasword sport label count label i e the new article would receive a classification score of 2x 02 for leisure 3x 95 for sport and 5 03 for politics which is more sensible given the new information If you con
76. es optimization problem Wmap arg min Xw Y ljwll 5 6 34 CHAPTER 5 LINEAR REGRESSION with 6 02 02 In addition solving the maximum likelihood problem gives us the classic least squares problem of finding a w such that Xw best approximates Y Wig argmin Xw Y l 5 7 w Exercise 5 7 1 Suppose X RN K and Y R In other words sup pose you take N measurements and use K covariates possibly including a constant 1 What are the conditions on N and K and the rank of X that ensure we have a unique solution to the unregularized least squares problem 2 What are the conditions on N and K and the rank of X that ensure we have an infinite number of solutions to the unregularized least squares problem 3 What are the conditions on N and K that prevent us from having any solution to Xw Y for every Y Solving the least squares problem can be done explicitly by multiplying both sides by XT yielding the normal equations XT Xw XTY Assuming XTX is nonsingular we can in principle invert it to find ws note that at this point we have not proved this actually finds the w that minimizes 5 7 wis XK OY assuming XTX is non singular Plugging Y XWirue E into this we have Wis XTX XFX Were XTX XTE Wue XTX XE assuming XTX is non singular The second term is error that we hope is small Roughly speaking if X squashes some signals then X7_X will make so
77. ession Suppose we are told to determine the probability that a customer will keep their membership with Super Fitness Gym for longer than one year We ask the general manager for data and she gives us height weight age and length of gym membership in months for 10 000 cus tomers ignore the fact that some customers have only been there less than a year and have not had the chance to be there a full year Now we have membership information as a semi continuous variable membership length 1 2 but we are asked to predict a binary out come either Yes or No One approach would be to set Y 1 if membership length is greater than 12 and 0 if it is less However this approach throws out information about how long someone has been at the gym For example people who quit in 1 month are treated the same as those who quit in 11 A better approach would probably be to use a linear regression model where we try to predict the number of months that the membership will last To turn this into a probability we would use a Bayesian approach to determine p y x Y as in section 5 2 1 equation 5 4 Since the data is discrete but ordered 1 2 3 a better approach would be so called ordinal regression Since some peo ple have not quit hence we don t know how long they will be there the best approach would be right censored ordinal regression The Bayesian approach proceeds as in chapter 5 That is we choose a prior p w and for
78. exactly what is happening Theorem 5 19 If gt 0 then the solution to 5 18 exists is unique and is given by the formula K XTX 461 XTY k y 44 CHAPTER 5 LINEAR REGRESSION Proof The second equality follows from definition 5 9 and exercise 5 10 2 To show the first equality let F w Xw Y 6 w We then have for any vector z F ws z F ws 227 XTX 8D w5 XTY X 2 1 6 21 F ws 1X 211 dl2 l Since the second term vanishes only when z 0 we see that F ws z is minimized when z 0 Thus ws minimizes F and we are done Remark 5 20 As 6 gt 0 it is easy to see that the regularized solution con verges to the minimal norm solution For 6 gt 0 the solution components associated with the smaller singular values are attenuated more This is fortunate since these components can be quite troublesome as example 5 17 has shown 5 3 4 Choosing the regularization parameter Much has been written on techniques Bayesian or otherwise to choose optimal regularization parameters These asymptotic optimality results usually rely on assumptions about the error model i e that your error model is correct or ever that the error is uncorrelated to the covariates This is unfortunate since this is usually not the case There is also the method of hierarchical Bayesian modeling Here the prior is left unspecified and the data determines it While accepting that these
79. f if Lortea t is a positive example then TP TP 1 else i is a negative example FPerFrP 1 end if i 1 1 end while push 2 4 onto R This is 1 1 9 4 Other classifiers 9 4 1 Decision Trees Decision trees form a fundamental part of the classification as well as the regression toolbox They are conceptually simple and in many instances very effective There are a few basic terms necessary before we dive in a 100 CHAPTER 9 CLASSIFICATION decision node or simply node is a place holder in the tree which signifies a decision being made in the case of a binary feature the split will be on whether an item from the data set satisfies the feature conditions the leaves of a decision node are class labels The basic algorithm is the following 1 Begin with a decision stump This is simply a collection of trees con sisting of a single node one for each feature 2 Pick the best one i e given some sort of accuracy measure splitting objective select the single node tree which is the most accurate 3 Check the accuracy of each leaf by evaluating the assigned classifica tion on the associated subset of data If the accuracy is not sufficiently high i e does not satisfy some pre defined criteria continue building the tree as in step 1 utilizing the unused features on the associated subset of data Geometrically decision trees tile your data space with hyper rectangles where each rectangle is assigned
80. ference of coefficients is clear Either avoid this large error scenario by modeling better and avoiding directions in which your data is sparse or give up on modeling these particular coefficients accurately For prediction the situation isn t so 5 3 COEFFICIENT ESTIMATION OPTIMIZATION FORMULATION39 clear Suppose we have a new input x Our best guess output is K K y 1 4 Se 090 4 gt ES E e x k l k 1 Ye Up tee E e 5 13 Assume we are in the large error scenario as above Then the term in square brackets will be large If however the new input x is similar to our old input then x v will be small on the order of and the error in the jt term of our prediction will be small In other words if your future input looks like your training input you will be ok Exercise 5 13 1 Assume we measure data but one variable is always ex actly the same as the other What does this imply about the rank of the variable matrix X Show that this means we will have at least one sin gular value Ax 0 for k lt K Since singular values change continuously with matrix coefficients this means that if two columns are almost the same then we will have a singular value Ap lt 1 This is actually more dangerous since if Az 0 your solver will either raise an exception or not invert in that direction but if Az lt 1 most solvers will go ahead and find a noisy solution Exercise 5 13 2 Us
81. ff not stdin if infilename infile close def delete infile outfile deletion_rate nnn Write later if module interface is needed for linenumber line in enumerate infile if linenumber deletion_rate 0 outfile write line if __name__ __main__ main Note that e The interface to the external world is inside main and the imple mentation is put in a separate function delete This separation is useful because interfaces and implementations tend to change at dif ferent times For example suppose this code was to be placed inside a larger module that no longer read from stdin e The OptionParser module provides lots of useful support for other types of options or flags e Other more useful utilities would do functions such as subsampling cutting certain columns out of the data reformatting text or filling missing values See the homework Part II The Classic Regression Models Chapter 4 Notation 4 1 Notation for Structured Data We establish notation here for structured two dimensional data that is data that could be displayed in a spreadsheet or placed in a matrix The most important and possibly confusing distinction in predictive mod eling is that between the training set and a new input We will use capital letters such as X Y to denote the vectors matrices of training data and then lowercase to denote the new data or the model For example we would have the model y x w Then we c
82. for multi threading is the Open MP library Due to the GIL multi threading directly in Python is not usually done but libraries such as numpy execute C Fortran code outside of Python and thus avoid the GIL In this way numpy is able to perform parallel programming such as matrix vector multiplications through multi threading thus avoiding the overhead of spawning new processes The most popular and versatile mul tiprocessing library is the Message Passing Interface MPI This library provides a number of ways to start new and communicate between pro cesses It also provides a number of so called reduction operations whereby data stored on multiple processes is reduced to one single number For example if each process holds a segment of an array a possible reduction would involve computing the sum of all the numbers in the array The pro cesses each can sum their respective sub arrays and then send their result to the master processes which adds all the sub sums Although MPI is in wide use it does not provide ways to restart failed tasks When you have less than say 100 computers this is not a big deal When however you are Google and regularly perform jobs using thousands of machines this is important For that reason the MapReduce paradigm was developed MapReduce allows mapping of tasks to individual processes located on different machines It then provides support for different types of reduction as well as restarting of fail
83. g super computing and web servers In addition mac OSX which is unix based and a variety of user friendly Linux operating systems represent a significant portion of the personal computer market To understand the reasons for this success some history is needed In the 1960s MIT AT amp T Bell Labs and General Electric developed a time sharing meaning different users could share one system operating system called Multics Multics was found to be too complicated This failure led researchers to develop a new operating system that focused on simplicity This operating system emphasized ease of communication among many simple programs Kernighan and Pike summarized this as the idea that the power of a system comes more from the relationships among programs than from the programs themselves The Unix community was integrated with the Internet and networked com 1 2 THE SHELL 3 langmore labratt ls lth langmore langmore 4 0K langmore langmore 4 0K langmore 4 0K langmore 4 e langmore angmore angmore 4 angmore 4 angmore 4 0K angmore 4 0K angmore 4 angmore 4 0K g ja B B 5 oO 1 langmore langmore 8 3K J Figure 1 1 Ubuntu s GUI and CLI puting from the beginning This along with the solid fundamental design could have led to Unix becoming the dominant computing paradigm during the 1980 s personal computer revolution Unfortunately infighting and poor business decisions kept Unix out of
84. gatives i e the ratio of how many times the model got something right vs wrong There are a few essential terms that come along with the territory an instance that is positive and is classified as such is counted aS a true positive if the same positive instance is classified as negative the it called a false negative Similar for false positive and true negative For this you naturally arrive at a 2x matrix of true and predicted classes called a confusion matrix or a contingency table as you can see below 9 3 MEASURING ACCURACY 95 actual value p A True False m Positive Positive prediction outcome z False True n r Negative Negative total N Figure 9 1 Predicted vs actual in a 2 class situation Given the long history and wide application of classifiers there is a plethora of terminology for the key quantities e True positive rate or hit rate recall sensitivity Positives correctly classified tp rate a Total positives e False positive rate or false alarm rate Negatives incorrectly classified rate fp Total negatives e Specificity True cnegatives specificity i rate peana False positives True negatives fp To get a feeling at what these quantities represent lets just look at a few sensitivity or the true positive rate is proportion of the actual pos itives correctly classified as such and the specificity or true negative rate is the proportion of actual negatives c
85. ginal feature space your separating hyperplane would be a curve 64 CHAPTER 6 LOGISTIC REGRESSION 6 5 L1 regularization As in section 5 2 we can choose a prior and form the posterior p w Y p w p Y w As with linear regression a Gaussian prior w exp w 202 is popular In this section we explore the consequences of choosing a Lapla cian prior The Laplace distribution has density function K p w x exp adlw all lwli D lwxl k 1 As with Gaussian priors we will usually choose not to regularize the con stant and choose u 0 In fact we may wish to weight each term differently in which case we will use the prior K p w exp Sn 0 lt ak lt k 1 The MAP estimate is then K WMAP uenia 260 X aru 6 7 k 1 Solving for wmap is known as using a Laplace prior Lasso or L1 regular ization and is related to the field of compressed sensing Exercise 6 7 1 Show that the MAP estimate is given by 6 7 As with the Gaussian MAP in linear regression 5 18 the term 5 ax w penalizes large w The effect is different in a number of ways though First of all for large w the wy lt wp so the penalty is smaller This means that the L1 regularized solution allows for larger wz than the L2 regularized solution Second the optimization problem 6 7 is significantly more dif ficult than 5 18 because the terms w are not differentiable Third and most importantly roughly speaking
86. h in turn started two slaves three processes total The master was idle for most of the time and the slaves worked most of the time Therefore the user time was about twice the real time The sys line is the time spent in system calls such as I O IPython provides a very nice interface to the timeit module which we ex plain with figure 10 7 In IPython the timeit module runs the code following the command timeit or sometimes timeit is necessary a number of times until it has a good estimate of the time needed for it to run Figure 10 7 shows that creating a simple list with 10 000 elements takes about 50 longer in a for loop than with a list comprehension See section 10 3 2 for an explanation of this difference Note that timeit can also be run with 10 3 PRACTICAL PERFORMANCE IN PYTHON 115 def makelist N mylist for i in xrange N mylist append letters i 3 return mylist makelist 5 timeit makelist 10000 100 loops best of 3 2 53 ms per loop timeit letters i 3 for i in xrange 10000 1000 loops best of 3 1 58 ms per loop Figure 10 7 Figure illustrating use of the timeit module 116 CHAPTER 10 HIGH ER PERFORMANCE PYTHON kernprof py l dottest py Wrote profile results to dottest py lprof python m line_profiler dottest py lprof Timer unit 1e 06 s File dottest py Function testfuns at line 13 Total time 0 005271 s Time Per Hit Ti Line Contents profile def testfuns arrsize
87. h Linux this could work for you or you could be left with a giant headache Currently there are a number of hardware vendors that ship machines with Linux System76 ZaReason and Dell with their Project Sputnik campaign Mac OSX is built on Unix and also qualifies as a linux machine of sorts The disadvantage of a mac is price and the fact that the package management system for installing software that comes with Ubuntu linux is the cleanest easiest ever The shell allows you to control your computer using commands entered in a keyboard This sort of interaction is called a command line interface CLI The shell in our case will refer to the Bourne again or bash shell The bash shell provides an interface to your computer s OS along with a number of utilties and minilanguages We will introduce you to the shell during the software carpentry bootcamp For those unable to attend we refer you to Why learn the shell e The shell provides a number of utilities that allow you to perform tasks such as interact with your OS or modify a text file e The shell provides a number minilanguages that allow you to automate these tasks e Often programs must communicate with a user or another machine A CLI is a very simple way to do this Trust me you don t want to create a GUI for every script you write e Usually the only way to communicate with a remote computer cluster is using a shell Because of this programs a
88. hat we care about and measure in step 4 is the cross validation set error The hope is that the error e will be such that problems associated with overfitting show up here Note that there is nothing canonical about the choice of error in step 4 If for example you care more about large errors a fourth order error function could be used In step 5 we choose 6 as the minimizer of the cross validation error Again different choices could be made For example one may have generated the training and cross validation sets by sampling people in 2010 It may or may not be reasonable to believe people in 2013 will act according to the same model If they don t the error function e x z could be much different and this could cause your model to behave poorly when fed with 2013 data Exercise 5 20 1 Use the SVD to show that the mean square training error gets worse monotonically with increasing 6 Unlike the training mean square error the cross validation error does not always change monotonically If for example the variable matrix X had highly correlated columns then the singular values will rapidly decrease This in turn causes an overfitting effect In our idealized Y Xw E world this means that w will be far from Wye As a result the cross validation error will on average be convex see figure 5 3 4 left If on the other hand your variables were uncorrelated then there is no reason to believe that the model will overfit to the noise
89. her or not the article contains number and the other if the article contains the number 0 15 30 40 tennis point calls Whenever you will see an article with 0 15 30 40 the posterior will be boosted by not only the fact that these specific number are present but also by the feature that detects number at all This example might seem pathological but this is essentially what happens when you have dependent features appearing in the naive bayes setup 9 2 NAIVE BAYES 93 e If the training set is not fully representative this could lead to many of the P l F being equal to zero Suppose you have arrive at a situation where a new article which should have some particular label li contains a feature f which never appears with l in the training set Then the corresponding probability count P f l 0 which forces P U F 0 no matter what the other counts are telling you This can lead to misclassification and frequently does since it is generally hard to make sure that your training set is truly representative Consider the example in the situation above where you see an incoming article with the phrase boxing match This could refer to a sporting event but can also simply be an simile describing the way the Russian parliament behaves itself If you have no example of articles labelled politics containing this phrase your classifier will make a mistake e If enough of the probabilities P f 1 P l are sufficiently small this can
90. igure 6 3 Contours show the level curves of functions Left Typical gradient descent path Right Pathological example where gradient descent zig zags all over the place 6 6 2 Newton s method For smooth objective functions where the hessian is not too difficult to compute Newton s method is a very attractive option Newton s method finds zeros of functions points where the function equals zero We can use Newton s method to find a minimum of L since if L is smooth then at a local minimum we will have VL w 0 So to optimize we search for a zero of the gradient and hence have to compute the hessian To motivate Newton s method consider the problem of finding a zero of a one dimensional function g w Suppose we are at point w and want to find a new point w7 Approximate g with the first term in its Taylor series g w g w w w g w Set this equal to zero and get wth wi This leads to algorithm 2 In other words at each point w we form a linear 6 6 NUMERICAL SOLUTION 69 Algorithm 2 Newton s method for root finding 1 D 1 Choose a starting point w set j 0 2 Choose a tolerance tol 3 while err gt tol do 4 Compute g w 5 Set ati gw g w 6 Set err witt wi Te Set 7 j 1 8 end while approximation to g w and use this to find the point at which this linear approximation is zero In the context of optimization we are looking for a point whe
91. ill be singular This will cause an error in Newton s method What to do You could regularize which would eliminate this error You could also switch to a solver that did not require inversion of the hessian Our viewpoint however is that a singular hessian points to redundancy in the data and that finding and eliminating that redundancy should be the first priority This can be done by eliminating columns from X and checking if the rank of X does not change If it does not change then that column was redundant 72 CHAPTER 6 LOGISTIC REGRESSION 6 7 Model evaluation Linear regression can be thought of as a classifier that produces a probability of class inclusion This is no different than a Naive Bayes estimator and the methods of section 9 3 1 are applicable In particular ROC curves are commonly used The negative log likelihood is another candidate for measuring the goodness of fit The first difficulty arising with using the negative log likelihood is that it will increase in magnitude at a rate proportional to N This means we cannot hope to compare L w for different size data sets We can deal with this by dividing by N giving the normalized negative log likelihood N7 L This is a good candidate to compare different models for the same problem In other words we can add subtract variables and see how it effects N 1L We can also compare N7 L in the training and test cross validation sets The normalized negative log likelihoo
92. ils we hope to convey the following message The ability of your model to facilitate prediction and or inference is only as good as your model s ability to describe the real world interactions Suppose we wish to model height as a function of age A completely ridicu lous model would be height wo w age The constants wo and wy are the same for every individual This model is unrealistic since it implies first that your age completely determines your height and second because it implies this relationship is linear Assuming a deterministic universe the real model could be written with y denoting 26 5 1 INTRODUCTION 27 height and x denoting age y f x z where z represents a huge set of variables e g sex age or every subatomic particle in the universe Assuming differentiability with respect to x this can be linearized around some point z z to produce Of Ox This relation is almost linear if the remainder R is small This would be the case e g if the effects of z were small and x was always close to z In any case with no assumptions we can write y f 2 1 12 z Z R z z y wo wie f x z wo wiz 5 1 EEE en where the error e x z accounts for effects not given by the first two terms Notice that so far we have not introduced any probabilistic concepts There is a problem however in that we cannot possibly hope to write down the function e a z To quantify thi
93. imate 6 7 is equivalent to solving the constrained problem w arg min L w subject to w K X arlwk lt C k l for some C that depends on both a and the function L This dependence cannot be known before solving at least one of the problems However the set of values taken by the coefficients as C and a are swept from 0 to co a k a the path is the same Since normal cross validation practice involves sweeping the coefficients and evaluating the models the two methods can often be used interchangeably 6 6 Numerical solution Unlike the linear regression problem of chapter 5 which reduced to solving a linear system logistic regression is a nonlinear optimization problem because the objective function the function to be minimized L w ax we can not be reduced to solving a linear system In this chapter we explore iterative methods for finding a solution These methods give us a sequence of values wY w converging to a local minimum w of the objective function f w Each w is found by solving a local approximation to f Note that convexity is needed to assure us that this is a global minimum 6 6 NUMERICAL SOLUTION 67 6 6 1 Gradient descent Perhaps the simplest method for finding a minimum of a differentiable ob jective function L w is gradient descent This method can be motivated by observing that locally a function decreases most in the direction oppo site to its gradient which is the direction of g
94. ing 5 12 show that if X and E are independent then 1 w Wtrue Note that since 4 O N and in the uncorrelated case E uj O VN one can show that if the errors are uncorrelated with the covariates then W gt Wirue as N gt 5 3 2 Overfitting examples The next exercise and example are instances of overfitting Overfitting is a general term describing the situation when the noise term in our training data E has too big an influence on our model s fit In this case our model will often fit our training data well but will not perform well on future data The reason is that we have little understanding of the noise term E and very little understanding of what it will do in the future Example 5 14 The most classic example is that of polynomial fitting Suppose our actual data is generated by y x r r elr x 0 0 1 1 0 5 15 40 CHAPTER 5 LINEAR REGRESSION We fit this data by minimizing the mean square error of both three and six degree polynomials equivalently maximizing the obvious likelihood func tion This can be done with the following Python code import scipy as sp import matplotlib pyplot as plt x sp linspace 0 1 10 x_long sp linspace 0 1 1 1 100 y x x 2 x 3 0 1 sp randn len x z sp polyfit x y 3 p sp polyid z print 3 degree coefficients 4s z z6 sp polyfit x y 6 p6 sp poly1d z 6 print 6 degree coefficients 4s z6 plt
95. ing Well All models are wrong some are useful George Box Stories of modeling gone well and gone wrong Define e Training model production model Tell stories about e Newton s model for planetary motion e Interpretability is more important than precision story of black Sc holes Simplicity is more important than precision Netflix story Those who don t understand the math are doomed to use black boxes Your training model should be as close as possible to your production model MBS modeling General remarks on e Segregating data into training cv test sets e Variable selection 74 7 1 END NOTES 75 e Transforming variables and why a linear only model isn t enough 7 1 End Notes The chapter title is a play on Models Behaving Badly by Emanuel Derman This book goes into much detail about the distinction between models and theory Part III Text Data Chapter 8 Processing Text 8 1 A Quick Introduction With the influx of information during the last decade came a vast ever grow ing volume of unstructured data An accepted definition of unstructured data is information that does not have a pre defined data model or does not fit well into relational tables if you have not seen relational database or ta bles think of collection of python pandas data frames or similar containers A large part of this category is taken up by text which is what we will focus on in this chapter From news pu
96. ion cycles per second GHz The reasons why the larger memory is slower has to do with a combination of economics and physics and is beyond the scope of this text Note that the actual memory hierarchy is more complicated with multiple levels of cache and registers figures 10 2 108 CHAPTER 10 HIGH ER PERFORMANCE PYTHON Processor SUPER FAST SUPER EXPENSIVE TINY CAPACITY A CPU CACHE FASTER LEVEL 1 L1 CACH EXPENSIVE SMALL CAPACITY EDO SD RAM DDR SDRAM RD RAM PHYSICAL MEMORY FAST PRICED REASONABLY and More AVERAGE CAPACITY SSD Flash Drive SOLID STATE MEMORY AVERAGE SPEED PRICED REASONABLY ya AVERAGE CAPACITY Mechanical Hard Drives VIRTUAL MEMORY SLOW CHEAP LARGE CAPACTITY A Simplified Computer Memory Hierarchy Illustration Ryan J Leng Figure 10 2 Memory hierarchy with more detail 10 3 What you need to keep in mind is that if the slower component cannot supply the faster upstream component with data then we have a bottleneck This memory wall can only be avoided with one of two means First the faster upstream component can make repeated use of its data For example you can load a training data set from disk into memory and use it to train a logistic regression The loading of the data is slow because of the disk access However this happens only once and then for at least a few minutes the data is repeatedly used in an optimization procedure The access time difference between
97. iori that each coefficient W 9 z k 1 K will have approximately the same magnitude and that the magnitude of wo could be much bigger The choice to not regularize or to weakly regularize the constant can be justified by noting that if the un modeled output e contained a constant term then we would likely be better off including this term in our model Not regularizing the 5 4 VARIABLE SCALING AND TRANSFORMATIONS 51 constant allows the constant to vary as much as possible to fit capture the constant part of the noise The non constant part of the noise will not affect the coefficient w much at all since constants project strongly only on the first few singular vectors details are left to the reader 5 4 2 Linear transformations of variables In general one can change basis in each sample Xn to the columns of the K x K invertible matrix A giving AT X This leads to get a new variable matrix A XT T XA Let us set w 8 arg min X Aw Y 4 wll Theorem 5 19 shows that w4 5 ATXTXA 61 1 ATXTY Setting to zero we see that w4 0 A X7 X LXTY Aly Exercise 5 24 1 With v vx and A Ag the right singular vectors and first K singular values of X set A VC where the columns of V are the vz and Mee O gt 0 0 Age eee 0 CoS So 0 0o Les Ar 1 Show that the new variables X XA satisfy XTX I For this reason A is called a whitening matrix 2 Show that w 0
98. is 212 333 3333 and you can call 5 6pm will return Pies e ee 885 e The search function re search pattern myString which scans through a string and returns a re MatchObject which al ways has a boolean value but also a number of methods For example myString My number is 212 333 3333 and you can call 5 6pm s re search r d myString if s print s group 0 will print the number 212 since it is the first pattern match in the string You can also do much more intelligent things with groups For example suppose you wan to check for the existence of a phone number and time and if possible return both The following expression will do exactly that PYTHON RE MODULE 85 myString My number is 212 333 3333 and you can call 5 6pm s re search r d 3 d 3 d 4 d 1 d 1 am pm myString if s print this is the whole match s group 0 print this is the number s group 1 print this is the time s group 2 print this is the meridiem s group 3 Note you can do a lot niftier things with groups in python s RE mod ule such as attributing keys For example s re search r P lt number gt d 3 d 3 d 4 P lt time gt d 1 d 1 P lt ampm gt ar if s print this is the number s group number print this is the time s group time print this is the meridiem s group ampm Digression Difference between match and search The only difference
99. ise Figure 1 Drew Conway s Venn Diagram where the above quantifies the risk associated with this event Deciding on the best coefficients and P can be done quite easily by a host of software packages In fact anyone with decent hacking skills can do achieve the goal Of course a simple model such as this would convince no one and would call for substantive expertise more commonly called domain knowledge to make real progress In this case a domain expert would note that additional variables such as the loan to value ratio and housing price index are needed as they have a huge effect on payment activity These variables and many others would allow us to arrive at a better model P makes loan payment ee X 1 Finally we have arrived at a model capable of fooling someone We could keep adding variables until the model will almost certainly fit the historic risk quite well BUT how do we know that this will allow us to quantify risk in the future To make some sense of our uncertainty about our model we need to know eactly what 1 means In particular did we include too many variables and overfit Did our method of solving 1 arrive at a good solution or just numerical noise Most importantly how appropriate is the logistic regression model to begin with Answering these questions is often as much an art as a science but in our experience sufficient mathematical understanding is necessary to avoid getting lost 2The
100. ish to perform the axpy operation z ax y A starting point could be to divide the memory address range of x and y into four chunks and send each chunk to a different core Each core performs an axpy on each chunk and then writes the result to the appropriate space in memory so that the end result is a contiguous array z If each chunk fits perfectly into each core s cache and our memory controller can handle these four operations efficiently then we should see a four times speedup This sort of operation is known as embarrassingly parallel because it did not involve communication between the separate cores As another example of an embarrassingly parallel algorithm consider the random forest In this case we have n independent trees each being fit independently see figure 10 5 As an example of an algorithm that is not embarrassingly parallel consider solving the system Ax b As mentioned in 5 3 5 for medium to large systems this is usually done with algorithms that repeatedly multiply A by different vectors v and then add this result to another vector w Suppose our machine has two cores Then each multiplication can be decomposed as os gt 2 Gee eu 10 1 Agi A22 W2 A gt 2101 Agave The first core can handle A 1v1 A 2v and thus assemble the top rows of Av The other core can handle the bottom rows However each result must be sent to the location where the corresponding chunk of w is being kept so that we ma
101. ite http docs python org 2 library re html re MatchObject Digression fomatch In case you are wondering if there is a way to run unix shell style wildcards from within python the answer is via the fnmatch module For example if you would like to print all filenames in a given directory with txt extension i e the equivalent of ls txt you can run import fnmatch import os for file in os listdir if fnmatch fnmatch file txt print file 8 4 THE PYTHON NLTK LIBRARY 87 Or if you would like to convert txt to it s regex equivalent regex fnmatch translate txt There are other modules that are great for handling paths and file extensions such as glob but the above can be useful from time to time 8 4 The Python NLTK Library 8 4 1 The NLTK Corpus and Some Fun things to do The NLTK library contains a large corpus ranging from Moby Dick to a col lection of presidential inaugural addresses which can be used for exploration of library model development etc You can see the work by typing from nltk book import texts and explore individuals nltk text objects by their designated text number For example text1 is nltk text object containing Moby Dick Object method can be explored as usual by typing textl tab if you are using ipython or and IDLE with tab completion from nltk book import text name returns Moby Dick by Herman Melville 1851 and f
102. ive rate Figure 9 2 The green squares represent classifiers that are better than ran dom and the red squares represent those worse than random point represents the threshold inf where no 1 s were predicted the point 1 1 is the other end of the spectrum where every point was classified as 1 resulting in 100 true positive and false positive rate the point 0 1 represents the perfect classifier with no points classified as false and no true classification being wrong In addition the diagonal represents the random classifier since we expect it get the same proportion right as it get wrong Hence a classifier with an ROC curve above the diagonal is better than random and below the diagonal is worse than random By now you can probably guess how to write a simple ROC algorithm and we take out verbatim from Essentially you order the test cases based on predicted value from highest to lowest and then start sequentially including the points into the set of positive classifications With every new point if it is in fact positive you will increase you true positive rate keeping your false positive rate constant and if it negative the true positive rate will stay the same while the false positive will increase Hence you will either move vertically or horizontally on the plane lets look at an example below 98 1 0 0 8 true positive rate o o o h 0 2 o CHAPTER 9 CLASSIFICATION ROC example 0 2 0
103. l company will be tempted to use a programmer with no statistical knowledge to do this task Of course the programmer will fall into analytic traps such as the ones mentioned above but that might not deter anyone from being content with output This anecdote seems construed but in re ality 1t is something we have seen time and time again The current world of data analysis calls for a myriad of skills and clean programming database interaction and understand of architecture have all become the minimum to succeed The purpose of this course is to take people with strong mathematical s tatistical knowledge and teach them software development fundamentals This course will cover e Design of small software packages e Working in a Unix environment e Designing software in teams Fundamental statistical algorithms such as linear and logistic regres sion 3 Our view of what constitutes the necessary fundamentals is strongly influenced by the team at software carpentry Wila viii CONTENTS Overfitting and how to avoid it e Working with text data e g regular expressions e Time series e And more Part I Programming Prerequisites Chapter 1 Unix Simplicity is the key to brilliance Bruce Lee 1 1 History and Culture The Unix operating system was developed in 1969 at AT amp T s Bell Labs Today Unix lives on through its open source offspring Linux This Oper ating system the dominant force in scientific computin
104. location does not mean that a fundamental understanding of computer engineering will not help Used carelessly and without attention to compu tational limitations Python can be a very very slow language 106 10 1 MEMORY HIERARCHY 107 Processor CPU CPU core pwy 00 s of 3 3 cycles Figure 10 1 Simplified memory hierarchy showing disk main memory RAM cache the computer components they are situated in and the time in clock cycles needed to move data between them Millions of cycles In this chapter we present some basic computer engineering concepts and show how the relate to Python performance We also present a number of practical examples of what can be done to increase performance of your code 10 1 Memory hierarchy One can think of a computer as a machine that takes data from disk and transfers it to a CPU The CPU which is composed of a control unit arith metic logic unit and more transforms this data which is then written back to disk or displayed to the user The term memory hierarchy refers to the hierarchical storage scheme for data See figure 10 1 for a simplified version that shows memory getting smaller and faster in terms of the time needed to access it as you get closer to the CPU core In figure 10 1 we measure time in CPU clock cycles The CPU clock is a source of electrical pulses that synchronize CPU operations As of 2013 typical laptop CPU clocks operate at 2 3 bill
105. m the posterior p w Y x p w p Y w Gaussian priors are common Also common is the Laplace prior p w exp al w 1 where wll gt gt w is the L1 norm of w See section 6 5 Exercise 6 5 2 Sometimes otherwise well meaning individuals use linear regression to solve a logistic regression problem Consider the case of spam filtering Your independent variables are the num ber of times certain words appear in the document For example the word V1 Gra is one of them and of course this world is almost always associated with spam You are told to produce a model that gives the probability that an email is spam Your colleague Randolph Duke tells you that since the dependent variable y is a number in this case 0 or 1 he will build a linear regression that takes in x and tries to predict y Of course the result won t be in the interval 0 1 but after training he will truncate it or re scale it so that it is What is 6 3 MULTINOMIAL LOGISTIC REGRESSION 61 wrong with Mr Duke s approach Exercise 6 5 3 Heteroscedastic probit models A probit regression is just like a logistic regression but rather than the logistic sigmoid o x w we use the Gaussian cumulative distribution function P zx w 1 Following the reasoning in section 6 1 3 show that if y 1 0 with Z Wirue with e N 0 A being iid then Ply 1 x P T Wirue A 2 If we perform a probit regression we will assume P y 1
106. main memory and cache can be mitigated in a similar manner Take for example the azpy operation z ax y where x and y are vectors and a is a scalar It often happens that x and y are too large to fit in cache One wrong way to deal perform the axpy would be to first load as much of x as possible into cache multiply it by a then send that result back to memory and load the next bit of x in After we had multiplied az we could start loading both ax and y into cache and add them A smarter implementation loads smaller chunks of x and y so that they fit into cache We then multiply the chunk of x by a and then add it to that chunk of y This double usage of x avoids one cache memory load and speeds code up 10 1 MEMORY HIERARCHY 109 4th Generation Intel Core Processor Die Map 22nm Tri Gate 3 D Transistors ve System Agent Display Engine amp Processor Memory Controller Graphics Memory Controller 1 0 Quad core die shown above Transistor count 1 4 Billion Die size 177mm Figure 10 3 Picture of an Intel core i7 chip showing shared L3 cache L1 and L2 cache as well as registers are located on the CPU core the axpy operation significantly The second means to avoid a memory wall is to slow down the CPU clock Since the clock runs slower the slower components take less CPU cycles to do their work and thus less cycles are wasted This by itself does not help performance However the slower clock speeds con
107. me noise terms blow up The balance between how our variable selection picks out signals and how our inversion blows up noise is a delicate interplay of a special basis that we will study in the next section 5 3 COEFFICIENT ESTIMATION OPTIMIZATION FORMULATION35 5 3 1 The least squares problem and the singular value de composition Here we study the singular value decomposition This decomposition is use ful for analyzing and solving the least squares problem 5 6 It is also the basis of methods such as Principle Component Analysis PCA To moti vate the SVD consider the lucky situation where your covariate matrix was diagonal e g 2 0 x 2 201 Y Xw Y eG a from which it easily follows that w Y 2 and wa Y2 3 If X were not diagonal but were symmetric we could find a basis of eigenvectors v1 va such that Xuk Axux We then write Y Y v1 u1 Y v2 ve and w W V1 av We then have Xw Y if and only if We then have A10101 A209209 Y v1 u1 Y v2 uz which implies Y v1 A1 and wa Y v2 A2 Example 5 8 Consider the matrix 2 1 ve E e Show that v 27 2 1 1 and 21 1 1 are an orthonormal basis for R e Show that v and va are eigenvectors of X e Use that fact to find w such that Xw 3 4 An eigenvalue decomposition such as in example 5 8 is possible for X only if XTX XX This is never the case for non square matrices Fortunately a singul
108. myvalues mylist index 5000 10000 loops best of 3 89 7 us per loop timeit mydict 5000 10000000 loops best of 3 64 3 ns per loop timeit 5000 10000 loops best of timeit 5000 10000000 loops best in mylist 3 85 7 us per loop in of 3 mydict 81 2 ns per loop timeit 5000 10000000 loops best in myset of 3 83 ns per loop Figure 10 10 A performance test showing where a hash table is 1000 times faster than list lookup Note that mydict will be a dictionary with keys equal to the numbers 0 through 9999 and every value equal to a The calls myvalues mylist index 5000 and mydict 5000 both return the 5000th entry in mylist mydict equal to a in both cases The list version is 1000 times slower because mylist index 5000 searches through the list entry by entry for the number 5000 The last three calls are boolean tests each returning True 10 3 PRACTICAL PERFORMANCE IN PYTHON 121 This steps through a list and modifies it We use another list namely range N to provide an iterator i to help us step through The first time through i 0 the second i 1 and so on You can access the iterator by setting try it myiter range 10 _iter__ gt gt gt mylist range 10 Create a list gt gt gt myiter mylist __mylist__ iter gt gt gt myiter next O gt gt gt myiter next 1 This is more or less what happens when you use for in range 10 to do ten iterations of some task
109. n a likelihood or posterior and maximize it using information about the gradient and possibly the hessian Digression Multinomial versus ordinal Suppose we build a model for the number of goals scored in a soccer game Since this number is typically something like 1 2 3 or 4 it does not make sense to use linear regression One approach would be to build a multinomial logistic model where the classes are defined as follows C1 represents team scored 1 or less goals C2 and C3 represent team scored 2 or 3 goals and C4 represents team scored 4 or more goals We could then train the model and recover coefficients for each class w w This however is not a good approach The main problem lies in the fact that the class probabilities 6 6 and hence the coefficients wt are not related in the proper way They are related in the sense that they sum to one which is good but this problem is special An increase in the qualities that allow a team to score 2 points will likely result in them scoring 3 or more points In other words the quality of a team appears on some sort of continuum An ordinal model captures this extra structure and allows us to build a better model 6 4 Logistic regression for classification Logistic regression can be used for classification by choosing a cutoff 6 0 1 and classifying input Xn as class 1 e g y 1 if o Xn w gt and class 0 if o Xn w lt 6 If 6 0 5
110. nd leads to wrist problems e Gedit sublime text are decent text editors available for Linux and Mac They are not as powerful as Vim Emacs but don t require any special skills to use e nano is a simple unix text editor available on any system If nano Any user of Microsoft Word documents from the 90 s should be familiar with the headaches that can arise from this situation 10 CHAPTER 1 UNIX doesn t work try pico e sed is a text stream processing command line utility available in your shell It can do simple operations on one line of text at a time It is useful because of its speed and the fact that it can handle arbitrarily large files e awk is an old school minilanguage that allows more complex opera tions than sed It is often acknowledged that awk syntax is too complex and that learning to write simple Python scripts is a better game plan 1 5 Philosophy The Unix culture carries with it a philosophy about software design The Unix operating system and its core utilities can be seen as examples of this Let s go over some key rules With the exception of the rule of collaboration these appeared previously in Ray04 1 5 1 Ina nutshell Rule of Simplicity Design for simplicity Add complexity only when you must Rule of Collaboration Make programs that work together Work to gether with people to make programs 1 5 2 More nuts and bolts We can add more rules to the two main rules above and p
111. nd workflows that only work in the shell are common For this reason alone a modern scientist must learn to use the shell Shell utilities have a common format that is almost always adhered to This format is utilityname options arguments The utilityname is the name of the utility such as cut which picks out a column of a csv file The options modify the behavior of the program In the case of cut this could mean 1 3 STREAMS 5 specifying how the file is delimited tabs spaces commas etc and which column to pick out In general options should in fact be optional in that the utility will work without them but may not give the desired behavior The arguments come last These are not optional and can often be thought of as the external input to the program In the case of cut this is the file from which to extract a column Putting this together if data csv looks like name age weight ian 1 11 chang 2 22 Then cut d f1 data csv 1 1 SIA AY NS oO utilityname options arguments produces more specifically prints on the terminal screen age 1 2 1 3 Streams A stream is general term for a sequence of data elements made available over time This data is processed one element at a time For example consider the data file which we will call data csv name age weight ian 1 11 chang 2 22 daniel 3 33 This data may exist in one contiguous block in memory disk or not In either case to process this data as a
112. nt of a data scientist explaining logistic regression to a non technical audience One challenge is that the dependent variable in the training data does not explicitly coincide with the model output For example consider the data set training csv age dosage recovers 33 100 1 22 90 0 15 90 1 23 85 0 The variable in the column recovers is one when the subject recovers from cancer and zero when they do not dosage is the dosage of some 55 56 CHAPTER 6 LOGISTIC REGRESSION hypothetical drug A logistic regression could be used to take in age and dosage and output the probability that the subject recovers In this case our model output could look like age dosage prob_recovers 33 100 0 85 22 90 0 6 15 90 0 7 23 85 0 4 So our model output is a probability p 0 1 which does not match up with our dependent variable This is in contrast to the use of logistic re gression for classification Here for example a cutoff is chosen such that if prob_recovers gt cutoff then we classify the subject one that will recover If cutoff 0 5 then out model output would be 1 1 1 0 which could be compared directly with the training data as is the case with linear regression 6 1 2 Classical viewpoint The classic formulation of logistic regression starts with assumptions about the probability of y taking either O or 1 given the value of x Let s write this as P y 0 x and Ply 1 x The assumption is Ply 1 a
113. om of overfitting Note how ever that in one case the second component of wm is 1000 times larger than the other case So we obviously cannot look at raw coefficient magnitude as a symptom of overfitting Moreover the normal matrix XTX will have bottom right entry equal to gt X which could be huge if we are using millimeters Many linear algebra packages would have trouble solving max imum likelihood problem In other cases this sum could be so large or small that our computer cannot store it Suppose further that we used both height and health as variables Then our choice of units to represent height health in would influence the absolute values of our coefficients We would not be able to say the coefficient of height is 10 times larger and therefore height is probably more important Although this is not the proper way to con clusively judge variable importance it is a good way to get rough guesses 5 4 VARIABLE SCALING AND TRANSFORMATIONS 49 that can be used to find model issues e g if the coefficient of health was almost zero we should be surprised The most common solution to this issue is to rescale columns of X by the sample standard deviation of the columns In other words with ee p X gt X 2 Uk Ny nks Ok Nei nk Hk we replace the column X with X 0 In addition to scaling by Da it is common to subtract the mean leading to Definition 5 22 Variable standardization Assume we
114. orrectly classified as such So a perfect classifier will be 100 sensitive and 100 specific You can interpret the other quantities in a similar fashion The reason these quantities can 96 CHAPTER 9 CLASSIFICATION be important as opposed to another standard model error metric such as RMSE is that a model and the associated error never live in a vacuum i e the error should not only give you a metric for model comparison but also convey its accuracy and validity as applied to the situation you are trying to predict or understand For example suppose that you are asked to build a classifier whose task it is to predict which amazon user is going to open at least one advertisement email in the next week for say better email ad targeting You might find it acceptable to have a model which gives a true positive and true negative rate of 70 because you care equally about send ing emails to those who will open and not sending to those who don t But suppose that next you are asked to build a classifier which is going to predict whether some combination of financial signals should constitute opening a trade position but you know that the potential loss and gain of any such trade is high Given that you are likely risk averse you would probably care more that the false negative rate being as low as possible than anything else i e you would be ok with potential missing a good trade but not ok with loosing a large amount Some classification mod
115. ould observe N x y pairs From this we form the training data sets X Y where the n row of X is the nt set of covariates and the nt row of Y is the nt observed output This training data is used to find a best fit w which given a new input x can be used to predict an output using y x w The indicates that Y is a prediction and not the true output for that trial y We use the capital letter E to denote error values occurring in the training set e g Y Xw E and the Greek letter e to denote a single instance of model error or a new error value concurrent with a prediction viz y w 24 4 1 NOTATION FOR STRUCTURED DATA 25 Suppose we have a data matrix Xni a Xnk The nt row of X is denoted by Xn and the kt column by X x The denoting every element in this dimension Notice that in contrast to some statistics texts we do not differentiate between random variables and their realizations We hope the meaning will be clear from the context Chapter 5 Linear Regression The best material model of a cat is another or preferably the same cat Norbert Wiener 5 1 Introduction Linear regression is probably the most fundamental statistical model Both because of its simplicity interpretability range of applicability and the fact that more complex models can be studied in linearized forms where they reduce to linear regression Through understanding of mathematical deta
116. problem is done they should equal wz at the minimum Exercise 6 11 1 Show that if w u is a solution to 6 11 then w is also a solution to 6 10 To solve 6 11 a variety of approaches can be taken Since the objective function has at least two continuous derivatives it is possible to replace it with a quadratic approximation keep the first two terms in a Taylor series get a best guess then iterate This is the same goals as Newton s method except here we have to deal with constraints A discussion of how this is done is beyond the scope of this text 6 6 4 Common numerical issues Here we discuss common numerical issues encountered when solving the maximum likelihood problem 6 4 Perfect separation occurs when some hyperplane perfectly separates R into one region where all training points have label 0 and another region where training points have label 1 As an example consider a one dimensional logistic regression where we have two training data points X Y 1 0 1 1 6 6 NUMERICAL SOLUTION 71 Before we write any equations what do you think will happen Remember that this is not a Bayesian model and it tries to fit the training data as well as it can From the model s perspective it thinks that if 1 then y will always be zero Moreover if x 1 the model thinks y will always be 1 What will our model say about the other points As it turns out the maximum likelihood sol
117. ptimal coefficients 2 Bayesian estimates will be flawed unless the prior is adjusted in accor dance with variable scale or the variables all have the same scale 3 In all cases scaled variables are often easier to interpret 48 CHAPTER 5 LINEAR REGRESSION 5 4 1 Simple variable scaling Returning to our height example suppose we use millimeters and find an optimum wy using least squares In other words N Wm arg min wo wi Xn Yn w n l Suppose we change to meters Then the heights X 1 change to X 1 Xn1 1000 We then seek to find m the maximum likelihood solution in these new variables Rather than re solving the problem we can write the above equation as N xX 2 Wml arg min gt w w1 1000 mE Ya n 1 1000 A E 2 arg min wo wy 1000 X 1 Yn dl n 1 Since Wm Wmi o Wmi 1 minimizes the above sum we see that the minimizing multiplier of heights in meters is w 1000 Therefore Wi Wmi o 1000 t0 1 1 In other words when the variable x got 1000 times smaller its coefficient got 1000 times larger In either case solving a simple least squares problem should produce the correct answer In both cases the residual error X wm Y will be the same Although we are in theory able to ignore variable scaling practical matters make us reconsider Recalling our discussion in previous sections we would like to use huge variables as an common sympt
118. r women and men Our approach to teaching is to show exactly how these problems show up numerically and allow the reader to decide what should be done 28 CHAPTER 5 LINEAR REGRESSION Digression Inference and Prediction Statistical inference is a general term for drawing conclusions from data This could be the possibly rejecting the hypothesis that some set of variables x y are independent or more subtly inferring a causal relationship from x to y Prediction is more specific and refers to the ability to predict y when given the information xz For example suppose we model P Cancer P No Cancer wo w1 Age ws cardiovascular health wg is smoker log Since smoking is associated with a decrease in cardiovascular health it is possible that a number of different wa w3 combinations could fit the data equally well A typical inference question would be can we reject the null hypothesis that smoking does not effect the probability of getting cancer In this case we should worry whether or not w3 truly captures the effect of smoking In the case of prediction any combination of wa w3 that predicts well is acceptable In this sense the models used for prediction can often be more blunt In this text we phrase most of our examples as prediction problems since this avoids adding the messy details of proper statistical interpretation If we observe the age of N individuals we can group the dat
119. re L w 0 so replace g w with L w and you get the iteration step w t wi L w L wf In multiple dimensions this is algorithm 3 Algorithm 3 Newton s method for optimization 1 Choose a starting point w set j 0 2 Choose a tolerance tol 3 while err gt tol do 4 Compute VL w V L w 5 Set wt amp wi V L w VL w gt Set err wi wi Set j j 1 8 end while a If all is well and V L w is positive definite then convergence to w is quadratic In this case that means w w lt Clwi w for some constant C gt 0 If V L w is not positive definite then convergence can be very slow This along with the need to compute and invert the hessian are the major drawbacks of Newton s method for optimization 70 CHAPTER 6 LOGISTIC REGRESSION 6 6 3 Solving the L1 regularized problem While gradient descent and Newton s method are available for maximum likelihood estimation the L1 regularized problem requires special care since it isn t smooth One technique of many is to transform the our MAP problem which is unconstrained and nonsmooth in K unknowns K w arg min frw SS ole 6 10 w k 1 to the equivalent constrained smooth problem in 2K unknowns K w u arg min fiw T Yosu i 6 11 subject to Uk lt Wk lt Uk k 1 K Of course we don t care about the dummy variables u and they can be thrown away once the
120. reatest local increase So in our iterative search for w we should move in the direction opposite the gradient at our current point see algorithm 6 6 1 and figure 6 6 1 The Algorithm 1 Gradient descent Initialize w to some point set j 0 Choose yo gt 0 Choose a tolerance tol while err gt tol do Compute V L w Set wit wi 4 VL w Set err wi Choose y 1 according to some criteria Set 7 j 1 end while parameter yj in algorithm 6 6 1 can be chosen in a number of ways To prevent overshooting it is sometimes shrunk according to some criteria Other times we can choose it by minimizing the one dimensional function g y L wi yVL w This search is a one dimensional optimization and can be done using e g Newton s method Gradient descent has the advan tage of being very simple and only requiring computation of the gradient It has the disadvantage of the fact that although the negative gradient is locally the direction of biggest decrease in L it often is not the best global direction In some cases a gradient descent method can zig zag around missing the optimal solution and take very long to converge See figure 6 6 1 Gradient descent should never be used for linear problems since far superior methods exist here Another disadvantage is that it does require the gradient which is not possible for some problems e g L1 regularized regression 68 CHAPTER 6 LOGISTIC REGRESSION F
121. ression based on an earlier one For example suppose we have the following simple regex ab c d 6699 The machine will first check to see if there is a or d in the string then whether it followed by one or more c s and then a d Now by the time it makes to say d it doesn t remember why it made it there how many c s it encountered or that this was preceded by an a You can visualize the situation with the following diagram This is precisely the limitation which prevented us from distinguishing be tween valid dates i e those with either all or as separators above One solutions is to extend the regular expression syntax to include backref 8 2 REGULAR EXPRESSIONS 83 pcregrep 4 1 PCRE 6 4 St O Ruby 1 8 4 gt Python 2 4 4c1 Perl 5 8 7 100 ms 10 ms awk 20050424 Tel 8 4 GNU grep 2 5 1 Thompson NFA GNU awk 3 1 5 1 ms 100 us 10 us lus 0 20 40 60 80 100 regular expression and text size n ee A A Figure 8 1 Regular Expression Implementation Time Comparison see for more details erences The PCRE or Perl Compatible Regular Expressions implementa tion which is what python RE module was originally based on However matching regular expressions with backreference is an NP hard problem 8 2 4 Backreference To mitigate the limitation of standard regular expression as described above
122. rom nltk book import text1 similar whale returns ship boat sea time captain deck man pequod world other whales air crew head water line thing side way body which are not surprisingly the words that appear in a similar context to whale The frequency distribution of words is common to consider when looking at given text This can lead to some interesting statistics which can be used for analysis classification or text comparison The NLTK library provides an 88 CHAPTER 8 PROCESSING TEXT instance for such exploration The following commands will call FreqDist return the first 10 items in decreasing count order return the count for whale as well as its frequency freqDist nltk FreqDist text1 freqDist items 10 freqDist whale freqDist freq whale There is an additional set of functionalities that come along with nitk FregDist such as max plot hapaxes words that appear only once and many others which are helpful for exploration Aside from the additions methods that come along with it FregDist is re ally a sort and count operation and as useful exercise we challenge you to reproduce it with the python groupby function from the tertools library In addition you can do some fun things like generate random text based on a trigram language model For example from nltk book import textl text1 generate generates a random text
123. rovide hints as to how they will guide our software development Our programs will be small so hopefully few compromises will have to be made Rule of Simplicity This is sometimes expressed as K I S S or Keep It Simple Stupid All other philosophical points presented here can be seen as special cases of this Complex programs are difficult to debug implement maintain or extend We will keep things simple by for example i writ ing CLI utilities that do one thing well ii avoiding objects unless using 1 6 END NOTES 11 them results in a simpler more transparent design and iii in our modules include only features that will be used right now Rule of Collaboration We will make programs that work together by for example i writing CLI utilities that work as filters and ii choosing common data structures such as Numpy arrays Pandas DataFrames We will work together with people to make programs by for example i em ploying Git as a version control system using Github to host our code and ii enforcing code readability standards such as PEP8 Rule of Modularity Write simple parts connected by clean interfaces Humans can hold only a limited amount of information in their head at one time Make your functions small simple enough so that they can be explained in one sentence Rule of Clarity Clarity is better than cleverness Maintenance and de bugging of code is very expensive Take time to make sure yo
124. s uncertainty we model it as a random variable This is reasonable under the following viewpoint Suppose we select individuals from the population at large by some random sampling process Then for each fixed age x the probability that z A will be given by the fraction of the total population with x z xz x A What is more difficult is to select the distribution of e Once again fixing x we are left with many random effects on height the effects of the many variables in z If these effects are all small then a central limit result allows us to adequately approximate e z N u x 0 x A more likely scenario would be that the sex of the individual has a huge effect and e z would exhibit at least two distinct modes Suppose however that we fix sex at Female also fixing ethnicity parents height nutrition and more then we will be left with a number of small effects that can likely be modeled as normal we then have e z x N u x 07 x The dependence of the noise on our covariate x is still impossible to remove e g o x should be much larger for teenagers than adults So what to do There are methods for dealing with non normality and dependence of e on x e g generalized linear models and models dealing with heteroscedasticity What is most commonly done at least as a first hack is to add more explicitly modeled variables e g nonlinear functions of x and to segment the data e g to build a separate model fo
125. s_type NN Construct a function of one variable by fixing all but the last argument f filename process_file filename partial process_file pos_type basepath h R Construct an iterator that is equivalent to f filename for filename in allfiles If we are using 1 processor just use the normal itertools imap function Otherwise use the worker_pool if n_procs results_iter itertools imap f allfiles else worker_pool Pool n_procs results_iter worker_pool imap_unordered f allfiles chunksize chunksi for result in results_iter sys stdout write result n imap_wrap func Wrapper for Pool imap_unordered that allows exit upon Ctrl C This is a fir of the known python bug bugs python org issue8296 given by https gist github com aljungberg 626518 def wrap self timeout None return func self timeout timeout if timeout is not None else 1e100 return wrap 10 3 PRACTICAL PERFORMANCE IN PYTHON 129 Redefine IMapUnorderedIterator so we can exit with Ctrl C IMapUnorderedIterator next imap_wrap IMapUnorderedIterator next IMapIterator next imap_wrap IMapIterator next It is instructive to run this script on a large collection of files while moni toring htop We see that each core is able to spend most of its time with a full green bar This indicates that they are being fully used If chunksize is too large then one core will end up with significantly more work to
126. sin 2rkr rk sin 2rkz This is useful algebraically but also intuitively because nature tends to treat low and high frequencies differently low frequency sounds travel further in water for example The same is true of all compact operators matrices being one example of this For that reason we often refer to the tail end of the singular values e g uy _3 UN 2 UN 1 UN as higher frequencies 5 3 COEFFICIENT ESTIMATION OPTIMIZATION FORMULATION37 Assuming one has an SVD of X computing that will be saved for later we can solve the unregularized least squares problem Start by using the fact that the u form an orthonormal basis to write We now seek to find a solution w of the form K W J WU k 1 The coefficients are written Wx to emphasize that these are not the coeffi cients in the standard Euclidean basis For simplicity let s assume a common case that K lt N and inserting the expressions for Y and w into Xw Y This yields K N Xw Y 1 Ln z S un Y un k 1 n 1 K N 2 De Dx Ag UL S un Y Jun k l n l r N 2 Y pdx ur ur X un Y Ju k 1 n r 1 r N Y dpe wY XO un Y k 1 n r 1 The fourth equality follows since the un are orthonormal Remark 5 11 We note a few things e If N gt K there is an exact solution to Xw Y if and only if un Y 0 forn gt r e Solutions to the unregularized least squares problem 5 7
127. stands Unix will be able to understand how your data was generated If your script pipes together five programs then all five can run at once This is a simple way to parallelize things More complex scripts can be written that can automate this process 3 2 TEMPLATE FOR A PYTHON CLI UTILITY 21 3 2 Template for a Python CLI Utility Python can be written to work as a filter To demonstrate we write a program that would delete every n line of a file from optparse import OptionParser import sys def main r nn DESCRIPTION Deletes every nth line of a file or stdin starting with the first Line print to stdout EXAMPLES Delete every second line of a file python deleter py n 2 infile csu usage usage prog options dataset usage n main __doc__ parser OptionParser usage usage parser add_option p deletion_rate help Delete every nth line default default action store dest deletion_rate type float default 2 options args parser parse_args Parse args Raise an exception if the length of args is greater than 1 assert len args lt 1 infilename args 0 if args else None Get the infile 22CHAPTER 3 BUILDING A DATA CLEANING PIPELINE WITH PYTHON if infilename infile open infilename r else infile sys stdin Call the function that does the real work delete infile sys stdout options deletion_rate HH Close the infile i
128. sume far less power and allow more cores each with their own cache to be placed on chip The overall performance increases but comes at the cost of added complexity programmers need to design parallel algorithms This sort of clock slowing became more or less inevitable when for years CPU clock speeds increased at a much higher rate than memory bus speeds One final note on memory While the casual programmer generally thinks very little about main memory vs cache vs registers he or she absolutely needs to be aware of disk versus main memory If your computer cannot fit all data in main memory it will use a portion of the disk called swap space as a kind of pseudo memory Since reading writing to and from disk is exceptionally slow this practice of swapping can significantly slow down all operations on your computer This can crash servers For this reason it is your responsibility to keep track of memory usage While all machines have built in utilities that do this the best one is htop In Ubuntu you Ihtop does not work on all versions of mac There are htop installers but sometimes 110 CHAPTER 10 HIGH ER PERFORMANCE PYTHON langmore labratt 7677 langmore 7678 langmore 7676 langmore 1774 langmore 1275 7681 langmore 7755 langmore home langmore an home langmore an home Langmore an compiz usr bin X 0 co gnome screenshot htop 0 8 568 0 524 S e 0 508 S 0 880 S o 440 8 ooo momo
129. t to realize that these parallel tasks can be performed by both processes and threads A process is a new program that is started When you start the Firefox web browser you are starting a new process That process has access to its own space in memory Python provides support for starting processes through the subprocess and multiprocessing modules more on that later Since these processes have their own memory space any data used by them must be copied into that space So for example if separate processes handled each of the two blocks in 10 1 then each process would need a copy of the data needed to perform that multiplication 10 2 PARALLELISM 113 For example one process would need a copy of A11 Aj2 and v and the other process would need a copy of A21 A22 and v This can result in unfortunate increase in total memory used There are multiple ways to copy data between processes In Python the preferred method is to serialize it with the cPickle module turn it into a byte stream and send using stdin and stdout In contrast multiple threads can share a memory space This saves memory and eliminates communication overhead Threads however are more restricted in what they can do For example if you have two threads running in a Python program then only one is able to execute code at a time This is due to the Global Interpreter Lock GIL This means that multi threading in Python is very low performance A common library
130. teammate modifies lt file gt 2 Your teammate pushes changes 3 You modify lt file gt 4 You pull with git pull Git will recognize that you have two versions of the same file that are in conflict Git will tell you which files are in conflict You can open these files and see something like the following lt lt lt lt lt lt lt HEAD filename lt My work gt lt My teammate s work gt gt gt gt gt gt gt gt iss53 filename The lines above the are the version in the commit you most recently made The lines below are those in your teammate s version You can do a few things e Edit the file by hand to get in in the state you want it in e Keep your version with git checkout ours lt filename gt e Keep their version with git checkout theirs lt filename gt Chapter 3 Building a Data Cleaning Pipeline with Python A quotation One of the most useful things you can do with Python is to quickly build CLI utilities that look and feel like standard Unix tools These utilities can be tied together using pipes and a shell script into a pipeline These can be used for many purposes We will concentrate on the task of data cleaning or data preparation 3 1 Simple Shell Scripts A pipeline that sorts and cleans data could be put into a shell script that looks like 4 bin bash Here is a comment SRC src DATA data cat DATA inputfile csv A 19 20CHAPTER 3 BUILDING
131. the sample mean height as a function of age One more possibility is to segment your model into three groups People between zero and eighteen people between eighteen and sixty and people older than sixty Then three different models could be built Which is best If you are only using height then using a mean curve would be bet ter than segmenting since at best the segments can only hope to achieve the mean When combined with other variables however segmentation al lows more room for the model to adjust since e g the fit between eighteen and sixty won t effect the fit between zero and eighteen Segmenting has the disadvantage in that it requires splitting the data up into three smaller 5 5 ERROR METRICS 53 groups and keeping track of three models 5 5 Error Metrics To evaluate your model you need some metrics In our case linear regression produces a point estimate 4 From this it is easy to obtain a prediction Y X These can be compared in a few different ways Note that any method can be performed on the training set or a test cross validation set The most popular metric seems to be the coefficient of determination or R spoken as R squared This is defined as Xw Y A IY Y 5 25 where Y is the N x 1 vector with every entry equal to the mean of Y R enjoys some nice properties which we leave as an exercise Exercise 5 25 1 Show that 1 a perfect model X Y will lead to R 1
132. the mainstream Unix found a second life not so much through better business decisions but through the efforts of Richard Stallman and GNU Project The goal was to produce a Unix like operating system that depended only on free software Free in this case meant users are free to run the software share it study it and modify it The GNU Project succeeded in creating a huge suite of utilities for use with an operating system e g a C compiler but were lacking the kernel which handles communication between e g hardware and software or among processes It just so happened that Linux Torvalds had developed a kernel the Linux kernel in need of good utilities Together the Linux operating system was born 1 2 The Shell Modern Linux distributions such as Ubuntu come with a graphical user interface GUI every bit as slick as Windows or Mac OSX Software is easy to install and with at most a tiny bit of work all non proprietary applications work fine The real power of Unix is realized when you start using the shell Digression 1 Linux without tears 4 CHAPTER 1 UNIX The easiest way to have access to the bash shell and a modern sci entific computing environment is to buy hardware that is pre loaded with Linux This way the hardware vendor is takes responsibility for maintaining the proper drivers Use caution when reading blogs talking about how easy it was to get some off brand laptop computer work ing wit
133. tinued to explore the language extract tags and feature as in the text processing chapter you would assume to increase the accuracy of your classifier With the intuition above we have practically arrived at the construction of a naive Bayes classifier so lets write down the details Given a set of labels L l1 ly and set a features F we can calculate the probability that a given article has label J with 92 CHAPTER 9 CLASSIFICATION P L F x P F li P k This is if course just the max likelihood calculation coming from Bayes theorem but if add the assumption that the features are conditionally independent we arrive at our classifier i e P li F o P li PEIG JEF with each P fIL P L _Sli as above The naivety comes from the count l fact that most feature are indeed dependent and sometime highly so Hence the classifier will assign the label where l arg max P li F to an item with a give feature set F An astute reader would immediately raise at least three objects to the above setup e Over counting Imagine that you have an extreme case where two variables are not just dependent but are actually the same feature repeated Then instead its contribution to the likelihood will for that feature will be double what it should have been As another example suppose in addition to the three labels above you also consider individual sports You have two features one of which detects whet
134. type grep Professor people txt or even grep Pr people txt 8 2 REGULAR EXPRESSIONS 81 both of which return lan Professor Daniel Professor Chang Professor However suppose you would like to do something slightly less trivial such as finding all people in the class whose name starts with the letter P If you try something like grep P people txt this will return lan Professor Daniel Professor Chang Professor Paul Student i e every line with a capital P in it You can use regular expression to help out do so you on some systems you will need to invoke the E flag grep E P people txt which will return Paul Student as desired For another perhaps more natural example suppose you are a systems administrator and you need to find all login related processes as well as all processes corresponding to a certain user You can type ps e grep E login daniel which will return the appropriate PID time and command executed Lets look at a more interesting example Suppose you have a big file and you would like to extract all lines which constitute a valid date where a valid date is of the yyyy mm dd format between 1900 01 01 and 2099 12 31 with a choice of separators Hence we would accept character substrings 2000 01 15 and 2000 01 15 but not 2000 01 15 The regular expression that would do this for you is 19 20 d d 0 1 9 1110121 0 1 9 12 0 9 13 01 82 CHAPTER 8 PR
135. ur program logic will be clear to someone reading your code some time in the future this person might be you Comments are important Better yet code can often be written to read like a story and no comments are necessary for row in reader rowsum sum_row row row append rowsum writer write row Rule of Composition Design programs to be connected to other pro grams The Unix command line utilities are an example of this They typically can read from a file or stdin and write to stdout Thus multiple utilities can be tied together with pipes cat infile csv cut f1 sort unig c Rule of Least Surprise Try to do the least surprising thing We will follow Unix or Python convention whenever possible For example our data files will be in common formats such as csv xml json etc 1 6 End Notes Revolution OS is a fun movie about the rise of Linux Ray04 gives a comprehensive exposition of the history and philosophy of 12 CHAPTER 1 UNIX Unix and provides most of the material you see in our history and philoso phy sections The quote by Kernighan and Pike can be found in The Unix programming environment KP84 Software Carpentry held a bootcamp for students in three courses at Columbia University in 2013 Wilb The impact of the inventions to come out of Bell Labs cannot be understated Also developed there were radio astronomy the transistor the laser the CCD information theory and the
136. ution is w oo if you can call this a solution since no computer will ever reach this point and the model will say that any negative point x will correspond to y 0 with 100 certainty and that any positive point x will correspond to y 1 with 100 certainty Exercise 6 11 2 Consider a logistic regression problem with training data as above Note that we will not use a constant in this model 1 Show that for any w R L w 1 lt L w and that as w gt co L w decreases to 0 This means that the maximum likelihood solution is WML 00 2 Show that this could not happen if you used L1 or L2 regularization 3 Draw the function o x w for w 10000000 and x 3 3 4 What is a separating hyperplane for this problem When perfect separation occurs the numerical solution cannot converge A good solver will detect that w oo and will give you a warning The question for the modeler is what to do next It is possible that you included too many variables since if you have as many variables as you have data points and all data points are unique then you will always be able to find a separating hyperplane In this case it makes sense to remove variables or increase the number of data points The next issue is specific to Newton s method Recall the expression for the hessian 6 5 and the discussion following it This showed that if there is linear dependency in the columns of X then the hessian w
137. ux type sudo apt get install git After installation get an account at www github com Then in your home directory create a file or edit if it already exists called gitconfig It should have the lines user name Ian Langmore email ianlangmore gmail com credential helper cache timeout 3600 alias lol log graph decorate pretty oneline abbrev commit lola log graph decorate pretty oneline abbrev commit all color branch auto diff auto interactive auto status auto Now when you re in a repository you can see the project structure by typing git lola 2 4 Online Materials Lots of materials are available online Here we list a few Be advised that these tutorials are usually written for experienced developers who are mi grating from other systems to Git For example in this class you will not have to use branches e http git scm com book has complete documentation with exam ples I recommend reading section 1 before proceeding 2 5 BASIC GIT CONCEPTS 15 e http osteele com posts 2008 05 commit policies is a visual ization of how to transport data over the multiple layers of Git e http marklodato github com visual git guide index en html provides a more complete visual reference e http learn github com has a number of video tutorials e http www kernel org pub software scm git docs user manual html is a reference for commands 2 5 Basic Git Concepts
138. xibility and syntactical compactness they form a language in their own right which takes a bit of getting used to However with a little practice you become attuned to the internal logic and intelligent design Python incorporates the regular expressions toolbox in a standard library called re You can find most of the information you will need at http docs python org 2 library re html The library allows for added func tionality on top of returning search patterns such as boolean match function replacement etc 8 2 1 Basic Concepts The easiest way to understand regular expressions is to begin using them so lets start with an example We are going to take a simple string of characters and extract some information from them Hopefully you will have a terminal open and can following along import re myString lt I am going to show 2 or maybe 10 examples of using regular expressions gt re findall r a zA Z0 9 mystring returns a list of every alphanumeric character that appears in the string above i e we get a list of 57 characters from T to s As you can probably guess the code inside the parentheses simply asks for any character that is 8 2 REGULAR EXPRESSIONS 79 either a letter any case or a number If we would like to include the period in this list we can simply add it to the list of characters we are interested in re findall r a zA Z0 9 s If we are interested in words numbers included an
139. y compute Av w This process repeats many times 112 CHAPTER 10 HIGH ER PERFORMANCE PYTHON Amdahl s Law 18 00 Parallel Portion 16 00 50 75 14 00 90 95 12 00 qu 8 00 l iF ID EX nen ie e i F D EX a we i IF ID lt jmeM ws EEE IF 5 EX MEM WB Number of Processors itd je ID EX MEM WB Figure 10 6 Left Amdahl s law states that the fraction of code that is serial places limits on the speedup that parallelization can achieve Right A canonical five stage pipeline in a RISC machine IF Instruction Fetch ID Instruction Decode EX Execute MEM Memory access WB Register write back cut and paste from Wikipedia and during each repetition the result must be shared between cores This communication incurs some overhead both in terms of programming time and cpu cycles More important at least to our intended audience than communication overhead is the increase in complexity that comes with parallelizing a pro gram Because of this and the fact that some algorithms cannot be paral lelized large portions of your code will likely remain serial not parallel If 50 of your code is serial then even if you have 1000 cores you can only achieve 50 speed up This fact is generalized in Amdahl s law see figure 10 2 It is importan
140. y lprof This prints the results to stdout so it is often useful to pipe them to 10 3 PRACTICAL PERFORMANCE IN PYTHON 117 import numpy as np def pythondot list1 list2 dotsum 0 for i in xrange len list1 dotsum listi i list2 i return dotsum def numpydot arr1 arr2 return arr1 dot arr2 profile def testfuns arrsize numiter mylist 1 arrsize myarray np ones arrsize for i in xrange numiter temp pythondot mylist mylist temp numpydot myarray myarray if __name__ __main__ testfuns 1000 10 Listing 1 Profiling various dot products with the line profiler module see also figure 10 8 less The profile results show the percentage time spent in each function this is usually the first place I look the number of times each function was called and the time spent in each function The output in figure 10 8 shows that the pure python for loop dot product is much much slower than a numpy dot product 10 3 2 Standard Python rules of thumb Here we go over some standard Python performance tricks and rules of thumb These are standard in the sense that most developers know of them 118 CHAPTER 10 HIGH ER PERFORMANCE PYTHON List comprehensions and map As evidenced in figure 10 7 a list comprehension is often faster than an explicit for loop The reason for this is that a list comprehension handles the append step in an optimized manner Thus if you have a for loop so
141. your index using git checkout lt file gt You can replace your working copy with the copy in HEAD using git checkout HEAD lt file gt 2 6 3 Erasing changes If you committed something you shouldn t have and want to completely wipe out the commit git reset hard HEAD This moves your commit back in history and wipes out the most recent commit To move the index and HEAD back one commit use git reset HEAD To move the index to a certain commit designated by a SHA1 hash use git reset lt hash gt If you then want to move changes into your working copy use git checkout lt filename gt To move contents of a particular commit into your working directory use git checkout lt hash gt lt filename gt 2 6 4 Remotes To copy a remote repository use one of the following git clone lt remote url gt git clone lt remote url gt b lt branchname gt git clone lt remote url gt b lt branchname gt lt destination gt To get changes from a remote repository and put them into your repo in dex working use git pull You will get an error if you have uncommitted changes in your index or working so first save your changes then git add lt filename gt then git commit m lt message gt To send changes to a remote repository use 18 CHAPTER 2 VERSION CONTROL WITH GIT git add lt file gt git commit lt message gt git push 2 6 5 Merge conflicts A typical situation is as follows 1 Your

Lecture notes - Columbia Applied Data Science

Contents

Download Pdf Manuals

Related Search

Related Contents