Rtips123

Rtips. Revival 2014!
Paul E. Johnson <pauljohn @ ku.edu>
March 24, 2014
The original Rtips started in 1999. It became dicult to update because of limitations in the
software with which it was created. Now I know more about R, and have decided to wade in
again. In January, 2012, I took the FaqManager HTML output and converted it to LATEX with
the excellent open source program pandoc, and from there I've been editing and updating it in
LYX.
From here on out, the latest html version will be at http://pj.freefaculty.org/R/Rtips.
html and the PDF for the same will be
http://pj.freefaculty.org/R/Rtips.pdf.
You are reading the New Thing!
The

rst chore is to cut out the old useless stu that was no good to start with, correct
mistakes in translation (the quotation mark translations are particularly dangerous, but also
there is trouble with ~, $, and -.
Original Preface
(I thought it was cute to call this StatsRus but the Toystore's lawyer called and, well, you
know. . . )
If you need a tip sheet for R, here it is.
This is not a substitute for R documentation, just a list of things I had trouble remembering
when switching from SAS to R.
Heed the words of Brian D. Ripley, One enquiry came to me yesterday which suggested that
some users of binary distributions do not know that R comes with two Guides in the doc/manual
directory plus an FAQ and the help pages in book form. I hope those are distributed with all
the binary distributions (they are not made nor installed by default). Windows-speci

c versions
are available. Please run help.start() in R!
Contents
1 Data Input/Output 5
1.1 Bring raw numbers into R (05/22/2012) . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Basic notation on data access (12/02/2012) . . . . . . . . . . . . . . . . . . . . . 6
1.3 Checkout the new Data Import/Export manual (13/08/2001) . . . . . . . . . . . 6
1.4 Exchange data between R and other programs (Excel, etc) (01/21/2009) . . . . . 6
1.5 Merge data frames (04/23/2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Add one row at a time (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Need yet another dierent kind of merge for data frames (11/08/2000) . . . . . . 9
1.8 Check if an object is NULL (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . 10
1

1.9 Generate random numbers (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . 10
1.10 Generate random numbers with a

xed mean/variance (06/09/2000) . . . . . . . 11
1.11 Use rep to manufacture a weighted data set (30/01/2001) . . . . . . . . . . . . . 11
1.12 Convert contingency table to data frame (06/09/2000) . . . . . . . . . . . . . . . 12
1.13 Write: data in text

le (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Working with data frames: Recoding, selecting, aggregating 12
2.1 Add variables to a data frame (or list) (02/06/2003) . . . . . . . . . . . . . . . . 12
2.2 Create variable names on the
y (10/04/2001) . . . . . . . . . . . . . . . . . . 13
2.3 Recode one column, output values into another column (12/05/2003) . . . . . . . 13
2.4 Create indicator (dummy) variables (20/06/2001) . . . . . . . . . . . . . . . . . . 16
2.5 Create lagged values of variables for time series regression (05/22/2012) . . . . . 16
2.6 How to drop factor levels for datasets that don't have observations with those
values? (08/01/2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Select/subset observations out of a dataframe (08/02/2012) . . . . . . . . . . . . 17
2.8 Delete

rst observation for each element in a cluster of observations (11/08/2000) 18
2.9 Select a random sample of data (11/08/2000) . . . . . . . . . . . . . . . . . . . . 18
2.10 Selecting Variables for Models: Don't forget the subset function (15/08/2000) . . 19
2.11 Process all numeric variables, ignore character variables? (11/02/2012) . . . . . . 19
2.12 Sorting by more than one variable (06/09/2000) . . . . . . . . . . . . . . . . . . 19
2.13 Rank within subgroups de

ned by a factor (06/09/2000) . . . . . . . . . . . . . . 20
2.14 Work with missing values (na.omit, is.na, etc) (15/01/2012) . . . . . . . . . . . . 20
2.15 Aggregate values, one for each line (16/08/2000) . . . . . . . . . . . . . . . . . . 21
2.16 Create new data frame to hold aggregate values for each factor (11/08/2000) . . 21
2.17 Selectively sum columns in a data frame (15/01/2012) . . . . . . . . . . . . . . . 21
2.18 Rip digits out of real numbers one at a time (11/08/2000) . . . . . . . . . . . . . 21
2.19 Grab an item from each of several matrices in a List (14/08/2000) . . . . . . . . 22
2.20 Get vector showing values in a dataset (10/04/2001) . . . . . . . . . . . . . . . . 22
2.21 Calculate the value of a string representing an R command (13/08/2000) . . . . 22
2.22 Which can grab the index values of cases satisfying a test (06/04/2001) . . . . 22
2.23 Find unique lines in a matrix/data frame (31/12/2001) . . . . . . . . . . . . . . . 23
3 Matrices and vector operations 23
3.1 Create a vector, append values (01/02/2012) . . . . . . . . . . . . . . . . . . . . 23
3.2 How to create an identity matrix? (16/08/2000) . . . . . . . . . . . . . . . . . . 24
3.3 Convert matrix m to one long vector (11/08/2000) . . . . . . . . . . . . . . . . 24
3.4 Creating a peculiar sequence (1 2 3 4 1 2 3 1 2 1) (11/08/2000) . . . . . . . . . . 24
3.5 Select every n'th item (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Find index of a value nearest to 1.5 in a vector (11/08/2000) . . . . . . . . . . . 25
3.7 Find index of nonzero items in vector (18/06/2001) . . . . . . . . . . . . . . . . . 25
3.8 Find index of missing values (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . 26
3.9 Find index of largest item in vector (16/08/2000) . . . . . . . . . . . . . . . . . . 26
3.10 Replace values in a matrix (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . 26
3.11 Delete particular rows from matrix (06/04/2001) . . . . . . . . . . . . . . . . . . 27
3.12 Count number of items meeting a criterion (01/05/2005) . . . . . . . . . . . . . . 27
3.13 Compute partial correlation coecients from correlation matrix (08/12/2000) . . 27
3.14 Create a multidimensional matrix (R array) (20/06/2001) . . . . . . . . . . . . . 28
3.15 Combine a lot of matrices (20/06/2001) . . . . . . . . . . . . . . . . . . . . . . . 28
3.16 Create neighbormatrices according to speci

c logics (20/06/2001) . . . . . . . 28
3.17 Matching two columns of numbers by a key variable (20/06/2001) . . . . . . 29
3.18 Create Upper or Lower Triangular matrix (06/08/2012) . . . . . . . . . . . . . . 29
3.19 Calculate inverse of X (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . 30
2

3.20 Interesting use of Matrix Indices (20/06/2001) . . . . . . . . . . . . . . . . . . . 31
3.21 Eigenvalues example (20/06/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 Applying functions, tapply, etc 32
4.1 Return multiple values from a function (12/02/2012) . . . . . . . . . . . . . . . . 32
4.2 Grab p values out of a list of signi

cance tests (22/08/2000) . . . . . . . . . . . 32
4.3 ifelse usage (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Apply to create matrix of probabilities, one for each cell (14/08/2000) . . . . . 32
4.5 Outer. (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Check if something is a formula/function (11/08/2000) . . . . . . . . . . . . . . . 33
4.7 Optimize with a vector of variables (11/08/2000) . . . . . . . . . . . . . . . . . . 33
4.8 slice.index, like in S+ (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Graphing 33
5.1 Adjust features with par before graphing (18/06/2001) . . . . . . . . . . . . . . . 33
5.2 Save graph output (03/21/2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 How to automatically name plot output into separate

les (10/04/2001) . . . . . 36
5.4 Control papersize (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5 Integrating R graphs into documents: LATEX and EPS or PDF (20/06/2001) . . . 37
5.6 Snapshot graphs and scroll through them (31/12/2001) . . . . . . . . . . . . . 37
5.7 Plot a density function (eg. Normal) (22/11/2000) . . . . . . . . . . . . . . . . . 37
5.8 Plot with error bars (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.9 Histogram with density estimates (14/08/2000) . . . . . . . . . . . . . . . . . . . 37
5.10 How can I overlay several line plots on top of one another? (09/29/2005) . . . 37
5.11 Create matrix of graphs (18/06/2001) . . . . . . . . . . . . . . . . . . . . . . . 39
5.12 Combine lines and bar plot? (07/12/2000) . . . . . . . . . . . . . . . . . . . . . . 39
5.13 Regression scatterplot: add

tted line to graph (03/20/2014) . . . . . . . . . . . 40
5.14 Control the plotting character in scatterplots? (11/08/2000) . . . . . . . . . . . . 40
5.15 Scatterplot: Control Plotting Characters (men vs women, etc)g (11/11/2002) . . 41
5.16 Scatterplot with size/color adjustment (12/11/2002) . . . . . . . . . . . . . . . . 41
5.17 Scatterplot: adjust size according to 3rd variable (06/04/2001) . . . . . . . . . . 42
5.18 Scatterplot: smooth a line connecting points (02/06/2003) . . . . . . . . . . . . . 42
5.19 Regression Scatterplot: add estimate to plot (18/06/2001) . . . . . . . . . . . . . 42
5.20 Axes: controls: ticks, no ticks, numbers, etc (22/11/2000) . . . . . . . . . . . . . 42
5.21 Axes: rotate labels (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.22 Axes: Show formatted dates in axes (06/04/2001) . . . . . . . . . . . . . . . . . 43
5.23 Axes: Reverse axis in plot (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . 43
5.24 Axes: Label axes with dates (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . 44
5.25 Axes: Superscript in axis labels (11/08/2000) . . . . . . . . . . . . . . . . . . . . 44
5.26 Axes: adjust positioning (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . 44
5.27 Add error arrows to a scatterplot (30/01/2001) . . . . . . . . . . . . . . . . . . 44
5.28 Time Series: how to plot several lines in one graph? (06/09/2000) . . . . . . . 45
5.29 Time series: plot

tted and actual data (11/08/2000) . . . . . . . . . . . . . . . . 45
5.30 Insert text into a plot (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.31 Plotting unbounded variables (07/12/2000) . . . . . . . . . . . . . . . . . . . . . 45
5.32 Labels with dynamically generated content/math markup (16/08/2000) . . . . . 45
5.33 Use math/sophisticated stu in title of plot (11/11/2002) . . . . . . . . . . . . . 46
5.34 How to color-code points in scatter to reveal missing values of 3rd variable?
(15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.35 lattice: misc examples (12/11/2002) . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.36 Make 3d scatterplots (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.37 3d contour with line style to re
ect value (06/04/2001) . . . . . . . . . . . . . . . 47
3

5.38 Animate a Graph! (13/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.39 Color user-portion of graph background dierently from margin (06/09/2000) . . 47
5.40 Examples of graphing code that seem to work (misc) (11/16/2005)g . . . . . . . 48
6 Common Statistical Chores 51
6.1 Crosstabulation Tables (01/05/2005) . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 t-test (18/07/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Test for Normality (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Estimate parameters of distributions (12/02/2012) . . . . . . . . . . . . . . . . . 52
6.5 Bootstrapping routines (14/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.6 BY subgroup analysis of data (summary or model for subgroups)(06/04/2001) . 52
7 Model Fitting (Regression-type things) 53
7.1 Tips for specifying regression models (12/02/2002) . . . . . . . . . . . . . . . . . 53
7.2 Summary Methods, grabbing results inside an output object . . . . . . . . . . . 53
7.3 Calculate separate coecients for each level of a factor (22/11/2000) . . . . . . . 53
7.4 Compare

ts of regression models (F test subset B's =0) (14/08/2000) . . . . . . 54
7.5 Get Predicted Values from a model with predict() (11/13/2005) . . . . . . . . . . 55
7.6 Polynomial regression (15/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.7 Calculate p value for an F stat from regression (13/08/2000) . . . . . . . . . . 57
7.8 Compare

ts (F test) in stepwise regression/anova (11/08/2000) . . . . . . . . . 57
7.9 Test signi

cance of slope and intercept shifts (Chow test?) . . . . . . . . . . . . . 58
7.10 Want to estimate a nonlinear model? (11/08/2000) . . . . . . . . . . . . . . . . . 58
7.11 Quasi family and passing arguments to it. (12/11/2002) . . . . . . . . . . . . . . 58
7.12 Estimate a covariance matrix (22/11/2000) . . . . . . . . . . . . . . . . . . . . . 58
7.13 Control number of signi

cant digits in output (22/11/2000) . . . . . . . . . . . . 59
7.14 Multiple analysis of variance (06/09/2000) . . . . . . . . . . . . . . . . . . . . . . 59
7.15 Test for homogeneity of variance (heteroskedasticity) (12/02/2012) . . . . . . . . 59
7.16 Use nls to estimate a nonlinear model (14/08/2000) . . . . . . . . . . . . . . . . 60
7.17 Using nls and graphing things with it (22/11/2000) . . . . . . . . . . . . . . . . . 60
7.18 2Log(L) and hypo tests (22/11/2000) . . . . . . . . . . . . . . . . . . . . . . . 60
7.19 logistic regression with repeated measurements (02/06/2003) . . . . . . . . . . . 61
7.20 Logit (06/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.21 Random parameter (Mixed Model) tips (01/05/2005) . . . . . . . . . . . . . . . . 61
7.22 Time Series: basics (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.23 Time Series: misc examples (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . 62
7.24 Categorical Data and Multivariate Models (04/25/2004) . . . . . . . . . . . . . . 62
7.25 Lowess. Plot a smooth curve (04/25/2004) . . . . . . . . . . . . . . . . . . . . . . 62
7.26 Hierarchical/Mixed linear models. (06/03/2003) . . . . . . . . . . . . . . . . . . 62
7.27 Robust Regression tools (07/12/2000) . . . . . . . . . . . . . . . . . . . . . . . . 63
7.28 Durbin-Watson test (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.29 Censored regression (04/25/2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8 Packages 63
8.1 What packages are installed on Paul's computer? . . . . . . . . . . . . . . . . . . 63
8.2 Install and load a package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.3 List Loaded Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.4 Where is the default R library folder? Where does R look for packages in a
computer? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.5 Detach libraries when no longer needed (10/04/2001) . . . . . . . . . . . . . . . . 66
4

9 Misc. web resources 66
9.1 Navigating R Documentation (12/02/2012) . . . . . . . . . . . . . . . . . . . . . 66
9.2 R Task View Pages (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
9.3 Using help inside R(13/08/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9.4 Run examples in R (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10 R workspace 67
10.1 Writing, saving, running R code (31/12/2001) . . . . . . . . . . . . . . . . . . . . 67
10.2 .RData, .RHistory. Help or hassle? (31/12/2001) . . . . . . . . . . . . . . . . . . 68
10.3 Save Load R objects (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.4 Reminders for object analysis/usage (11/08/2000) . . . . . . . . . . . . . . . . . 68
10.5 Remove objects by pattern in name (31/12/2001) . . . . . . . . . . . . . . . . . . 68
10.6 Save work/create a Diary of activity (31/12/2001) . . . . . . . . . . . . . . . . . 69
10.7 Customized Rpro

le (31/12/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11 Interface with the operating system 69
11.1 Commands to system like change working directory (22/11/2000) . . . . . . . . 69
11.2 Get system time. (30/01/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.3 Check if a

le exists (11/08/2000) . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.4 Find

les by name or part of a name (regular expression matching) (14/08/2001) 70
12 Stupid R tricks: basics you can't live without 70
12.1 If you are asking for help (12/02/2012) . . . . . . . . . . . . . . . . . . . . . . . . 70
12.2 Commenting out things in R

les (15/08/2000) . . . . . . . . . . . . . . . . . . . 71
13 Misc R usages I

nd interesting 71
13.1 Character encoding (01/27/2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
13.2 list names of variables used inside an expression (10/04/2001) . . . . . . . . . . . 71
13.3 R environment in side scripts (10/04/2001) . . . . . . . . . . . . . . . . . . . . . 71
13.4 Derivatives (10/04/2001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1 Data Input/Output
1.1 Bring raw numbers into R (05/22/2012)
This is truly easy. Suppose you've got numbers in a space-separated

lemyData, with column
names in the

rst row (thats a header). Run
myDataFrame r e a d . t a b l e ( ` `myData ' ' , header=TRUE)
If you type ?read.table it tells about importing

les with other delimiters.
Suppose you have tab delimited data with blank spaces to indicate missing values. Do this:
myDataFramer e a d . t a b l e ( myData , sep=n t , n a . s t r i n g s= , header=TRUE)
Be aware than anybody can choose his/her own separator. I am fond of j because it seems
never used in names or addresses (unlike just about any other character I've found).
Suppose your data is in a compressed gzip

le function to decompress
on the
y. Do this:
myDataFrame r e a d . t a b l e ( g z f i l e ( myData.gz ) , header=TRUE)
If you read columns in from separate

les, combine into a data frame as:
5

v a r i a b l e 1 scan ( f i l e 1 )
v a r i a b l e 2 scan ( f i l e 2 )
mydata cbind ( va r i abl e 1 , v a r i a b l e 2 )
#or use the equivalent:
#mydata data.frame(variable1 , variable2)
#Optionally save dataframe in R object file with:
wr i t e . t a b l e (mydata , f i l e=f i l ename 3 )
1.2 Basic notation on data access (12/02/2012)
To access the columns of a data frame x with the column number, say x[,1], to get the

rst
column. If you know the column name, say pjVar1, it is the same as x$pjVar1 or x[, pjVar1].
Grab an element in a list as x[[1]]. If you just run x[1] you get a list, in which there is a single
item. Maybe you want that, but I bet you really want x[[1]]. If the list elements are named,
you can get them with x$pjVar1 or x[[pjVar1]].
For instance, if a data frame is nebdata then grab the value in row 1, column 2 with:
nebdata [ 1 , 2 ]
## or to selectively take from column2 only when the column Volt equals ABal
nebdata [ nebdata $Volt==ABal , 2 ]
( from Diego Kuonen)
1.3 Checkout the new Data Import/Export manual (13/08/2001)
With R{1.2, the R team released a manual about how to get data into R and out of R. That's
the

rst place to look if you need help. It is now distributed with R. Run
h e l p . s t a r t ( )
1.4 Exchange data between R and other programs (Excel, etc) (01/21/2009)
Patience is the key. If you understand the format in which your data is currently held, chances
are good you can translate it into R with more or less ease.
Most commonly, people seem to want to import Microsoft Excel spreadsheets. I believe there
is an ODBC approach to this, but I think it is only for MS Windows.
In the gdata package, which was formerly part of gregmisc, there is a function that can use Perl
to import an Excel spreadsheet. If you install gdata, the function to use is called read.xls. You
have to specify which sheet you want. That works well enough if your data is set in a numeric
format inside Excel. If it is set with the GENERAL type, I've seen the imported numbers turn
up as all asterixes.
Other packages have appeared to oer Excel import ability, such as xlsReadWrite.
In the R-help list, I also see reference to an Excel program addon called RExcel that can
install an option in the Excel menus called Put R Dataframe. The home for that project is
http://sunsite.univie.ac.at/rcom/
I usually proceed like this.
Step 1. Use Excel to edit the sheet so that it is RECTANGULAR. It should have variable
names in row 1, and it has numbers where desired and the string NA otherwise. It must have
NO embedded formulae or other Excel magic. Be sure that the columns are declared with the
proper Excel format. Numerical information should have a numerical type, while text should
have text as the type. Avoid General.
6

Step 2. First, try the easy route. Try gdata's read.xls method. As long as you tell it which
sheet you want to read out of your data set, and you add header=T and whatever other options
you'd like in an ordinary read.table usage, then it has worked well for us.
Step 3. Suppose step 2 did not work. Then you have something irregular in your Excel sheet
and you should proceed as follows. Either open Excel and clean up your sheet and try step 2
again, or take the more drastic step: Export the spread sheet to text format. File/Save As,
navigate to csv or txt and choose any options like advanced format or text con

gurable.
Choose the delimiter you want. One option is the tab key. When I'm having trouble, I use
the bar symbol, j, because there's little chance it is in use inside a column of data. If your
columns have addresses or such, usage of the COMMA as a delimiter is very dangerous.
After you save that

le in text mode, open it in a powerful
at text editor like Emacs and
look through to make sure that 1) your variable names in the

rst row do not have spaces or
quotation marks or special characters, and 2) look to the bottom to make sure your spreadsheet
program did not insert any crud. If so, delete it. If you see commas in numeric variables, just
use the Edit/Search/replace option to delete them.
Then read in your tab separated data with
r e a d . t a b l e ( ` ` f i l ename ' ' , header=T, sep=``nt ' ' )
I have tested the foreign package by importing an SPSS

le and it worked great. I've had
great results importing Stata Data sets too.
Here's a big caution for you, however. If your SPSS or Stata numeric variables have some value
lables, say 98=No Answer and 99=Missing, then R will think that the variable is a factor, and
it will convert it into a factor with 99 possible values. The foreign library commands for reading
spss and dta

les have options to stop R from trying to help with factors and I'd urge you to
read the manual and use them.
If you use read.spss, for example, setting use.value.labels=F will stop R from creating factors
at all. If you don't want to go that far, there's an option max.value.labels that you can set to
5 or 10, and stop it from seeing 98=Missing and then creating a factor with 98 values. It will
only convert variables that have fewer than 5 or 10 values. If you use read.dta (for Stata), you
can use the option convert.factors=F.
Also, if you are using read.table, you may have trouble if your numeric variables have any
illegal values, such as letters. Then R will assume you really intend them to be factors and it
will sometimes be tough to

x. If you add the option as.is=T, it will stop that cleanup eort
by R.
At one time, the SPSS import support in foreign did not work for me, and so I worked out a
routine of copying the SPSS data into a text

le, just as described for Excel.
I have a notoriously dicult time with SAS XPORT

les and don't even try anymore. I've seen
several email posts by Frank Harrel in r-help and he has some pretty strong words about it. I
do have one working example of importing the Annenberg National Election Study into R from
SAS and you can review that at http://pj.freefaculty.org/DataSets/ANES/2002. I wrote a long
boring explanation. Honestly, I think the best thing to do is to

nd a bridge between SAS and
R, say use some program to convert the SAS into Excel, and go from there. Or just write the
SAS data set to a

le and then read into R with read.table() or such.
1.5 Merge data frames (04/23/2004)
update:Merge is confusing! But if you study this, you will see everything in perfect clarity:
x1 rnorm(100)
x2 rnorm(100)
x3 rnorm(100)
7

x4 rnorm(100)
ds1 data. f rame ( c i t y=rep ( 1 , 1 0 0 ) , x1=x1 , x2=x2 )
ds2 data. f rame ( c i t y=rep ( 2 , 1 0 0 ) , x1=x1 , x3=x3 , x4=x4 )
merge ( ds1 , ds2 , by.x=c ( ` ` c i t y ' ' , ` `x1 ' ' ) , by.y=c ( ` ` c i t y ' ' , ` `x1 ' ' ) , a l l=TRUE)
The trick is to make sure R understands which are the common variables in the two datasets
so it lines them up, and then all=T is needed to say that you don't want to throw away the
variables that are only in one set or the other. Read the help page over and over, you eventually
get it.
More examples:
exper iment data. f rame ( t imes = c ( 0 , 0 , 1 0 , 1 0 , 2 0 , 2 0 , 3 0 , 3 0 ) , expval = c
( 1 , 1 , 2 , 2 , 3 , 3 , 4 , 4 ) )
s imul data. f rame ( t imes = c ( 0 , 1 0 , 2 0 , 3 0 ) , s imul = c ( 3 , 4 , 5 , 6 ) )
I want a merged datatset like:
t imes expval s imul
1 0 1 3
2 0 1 3
3 10 2 4
4 10 2 4
5 20 3 5
6 20 3 5
7 30 4 6
8 30 4 6
Suggestions
merge ( experiment , s imul )
( from Brian D. Ripley )
does all the work for you.
Or consider:
exp. s im data. f rame ( experiment , s imul=s imul $ s imul [match ( exper iment $ times , s imul $
t imes ) ] )
( from Jim Lemon)
I have dataframes like this:
s t a t e count1 pe r c ent1
CA 19 0 . 3 4
TX 22 0 . 3 5
FL 11 0 . 2 4
OR 34 0 . 4 2
GA 52 0 . 6 2
MN 12 0 . 1 7
NC 19 0 . 3 4
s t a t e count2 pe r c ent2
FL 22 0 . 3 5
MN 22 0 . 3 5
CA 11 0 . 2 4
TX 52 0 . 6 2
And I want
s t a t e count1 pe r c ent1 count2 pe r c ent2
CA 19 0 . 3 4 11 0 . 2 4
TX 22 0 . 3 5 52 0 . 6 2
FL 11 0 . 2 4 22 0 . 3 5
OR 34 0 . 4 2 0 0
GA 52 0 . 6 2 0 0
8

MN 12 0 . 1 7 22 0 . 3 5
NC 19 0 . 3 4 0 0
( from YuLing Wu )
In response, Ben Bolker said
s t a t e 1
c ( ` `CA' ' , ` `TX' ' , ` `FL ' ' , ` `OR' ' , ` `GA' ' , ` `MN' ' , ` `NC' ' )
count1 c ( 1 9 , 2 2 , 1 1 , 3 4 , 5 2 , 1 2 , 1 9 )
pe r c ent1 c (0 .34 , 0 .35 , 0 .24 , 0 .42 , 0 .62 , 0 .17 , 0 . 3 4 )
s t a t e 2 c ( ` `FL ' ' , ` `MN' ' , ` `CA' ' , ` `TX' ' )
count2 c ( 2 2 , 2 2 , 1 1 , 5 2 )
pe r c ent2 c (0 .35 , 0 .35 , 0 .24 , 0 . 6 2 )
data1 data. f rame ( s tat e1 , count1 , pe r c ent1 )
data2 data. f rame ( s tat e2 , count2 , pe r c ent2 )
datac data1m match ( data1 $ s tat e1 , data2 $ s tat e2 , 0 )
datac $ count2 i f e l s e (m==0 ,0 , data2 $ count2 [m] )
datac $ pe r c ent2 i f e l s e (m==0 ,0 , data2 $ pe r c ent2 [m] )
If you didn't want to keep all the rows in both data sets (but just the shared rows) you could
use
merge ( data1 , data2 , by=1)
1.6 Add one row at a time (14/08/2000)
Question: I would like to create an (empty) data frame withheadingsfor every column (column
titles) and then put data row-by-row into this data frame (one row for every computation I will
be doing), i.e.
no. time temp pr e s sur e the headings
1 0 100 80 f i r s t r e s u l t
2 10 110 87 2nd r e s u l t . . . . .
Answer: Depends if the cols are all numeric: if they are a matrix would be better. But if you
insist on a data frame, here goes:
If you know the number of results in advance, say, N, do this
df data. f rame ( time=numeric (N) , temp=numeric (N) , pr e s sur e=numeric (N) )
df [ 1 , ] c ( 0 , 100 , 80)
df [ 2 , ] c (10 , 110 , 87)
. . .
or
m matrix ( nrow=N, ncol=3)
colnames (m) c ( time , temp , pr e s sur e )
m[ 1 , ] c ( 0 , 100 , 80)
m[ 2 , ] c (10 , 110 , 87)
The matrix form is better size it only needs to access one vector (a matrix is a vector with
attributes), not three.
If you don't know the

nal size you can use rbind to add a row at a time, but that is substantially
less ecient as lots of re-allocation is needed. It's better to guess the size,

ll in and then rbind
on a lot more rows if the guess was too small.(from Brian Ripley)
1.7 Need yet another dierent kind of merge for data frames (11/08/2000)
Convert these two

Fi l e 1
C A T
Fi l e 2
1 2 34 56
2 3 45 67
3 4 56 78
( from Stephen Arthur )
Into a new data frame that looks like:
C A T 1 2 34 56
C A T 2 3 45 67
C A T 3 4 56 78
This works:
r epcbind func t i on (x , y ) f
nx nrow( x )
ny nrow( y )
i f (nxny)
x apply (x , 2 , rep , l eng th=ny )
e l s e i f (nynx)
y apply (y , 2 , rep , l eng th=nx )
cbind (x , y )
g
( from Ben Bolker )
1.8 Check if an object is NULL (06/04/2001)
NULL does not mean that something does not exist. It means that it exists, and it is nothing.
X NULL
This may be a way of clearing values assigned to X, or initializing a variable as nothing.
Programs can check on whether X is null
i f ( i s . n u l l ( x ) ) f #then...}
If you load things, R does not warn you when they are not found, it records them as NULL.
You have the responsibility of checking them. Use
i s . n u l l ( l i s t $component )
to check a thing named component in a thing named list.
Accessing non-existent dataframe columns with [ does give an error, so you could do that
instead.
data ( t r e e s )
t r e e s $ aardvark
NULL
t r e e s [ , aardvark ]
Error in [.data.frame(trees, , aardvark) : subscript out of bounds (from Thomas Lumley)
1.9 Generate random numbers (12/02/2012)
You want randomly drawn integers? Use Sample, like so:
# If you mean sampling without replacement:
sample ( 1 : 1 0 , 3 , r e p l a c e=FALSE)
#If you mean with replacement:
sample ( 1 : 1 0 , 3 , r e p l a c e=TRUE)
( from Bi l l Simpson )
10

Included with R are many univariate distributions, for example the Gaussian normal, Gamma,
Binomial, Poisson, and so forth. Run
? r u n i f
? rnorm
?rgamma
? r p o i s
You will see a distribution's functions are a base name like norm with pre

x letters r, d,
p, q.
ˆ rnorm: draw pseudo random numbers from a normal
ˆ dnorm: the density value for a given value of a variable
ˆ pnorm: the cumulative probability density value for a given value
ˆ qnorm: the quantile function: given a probability, what is the corresponding value of the
variable?
I made a long-ish lecture about this in my R workshop (http://pj.freefaculty.org/guides/
Rcourse/rRandomVariables)
Multivariate distributions are not (yet) in the base of R, but they are in several packages, such
as MASS and mvtnorm. Note, when you use these, it is necessary to specify a mean vector and
a covariance matrix among the variables. Brian Ripley gave this example: with mvrnorm in
package MASS (part of the VR bundle),
mvrnorm( 2 , c ( 0 , 0 ) , matrix ( c (0 .25 , 0 .20 , 0 .20 , 0 . 2 5 ) , 2 ,2) )
If you don't want to use a contributed package to draw multivariate observations, you can
approximate some using the univariate distributions in R itself. Peter Dalgaard observed a less
general solution for this particular case would be
rnorm( 1 , sd=s q r t (0 . 2 0 ) ) + rnorm( 2 , sd=s q r t (0 . 0 5 ) )
1.10 Generate random numbers with a

xed mean/variance (06/09/2000)
If you generate random numbers with a given tool, you don't get a sample with the exact mean
you specify. A generator with a mean of 0 will create samples with varying means, right?
I don't know why anybody wants a sample with a mean that is exactly 0, but you can draw a
sample and then transform it to force the mean however you like. Take a 2 step approach:
R x rnorm(100 , mean = 5 , sd = 2)
R x ( x mean( x ) ) / s q r t ( var ( x ) )
R mean( x )
[ 1 ] 1 .385177e16
R var ( x )
[ 1 ] 1
and now create your sample with mean 5 and sd 2:
R x x*2 + 5
R mean( x )
[ 1 ] 5
R var ( x )
[ 1 ] 4
( from Tor s ten.Hothorn )
1.11 Use rep to manufacture a weighted data set (30/01/2001)
11

x c ( 1 0 , 4 0 , 5 0 , 1 0 0 ) # income vector for instance
w c ( 2 , 1 , 3 , 2 ) # the weight for each observation in x with the same
rep (x ,w)
[ 1 ] 10 10 40 50 50 50 100 100
( from P. Malewski )
That expands a single variable, but we can expand all of the columns in a dataset one at a time
to represent weighted data
Thomas Lumley provided an example: Most of the functions that have weights have frequency
weights rather than probability weights: that is, setting a weight equal to 2 has exactly the
same eect as replicating the observation.
expanded.dataa s .da t a . f r ame ( l apply ( compres sed.data ,
func t i on ( x ) rep (x , compr e s s ed.data $we ight s ) ) )
1.12 Convert contingency table to data frame (06/09/2000)
Given a 8 dimensional crosstab, you want a data frame with 8 factors and 1 column for fre-quencies
of the cells in the table.
R1.2 introduces a function as.data.frame.table() to handle this.
This can also be done manually. Here's a function (it's a simple wrapper around expand.grid):
d f i f y func t i on ( ar r , value.name = value , dn.names = names ( dimnames ( a r r ) ) ) f
Ver s ion $ Id : d f i f y . s f u n , v 1 . 1 1995 /10/09 1 6 : 0 6 : 1 2 d3a061 Exp $
dn dimnames ( a r r a s . a r r a y ( a r r ) )
i f ( i s . n u l l (dn) )
s top ( Can ' t dataframei fy an ar ray wi thout dimnames )
names (dn) dn.names
ans cbind ( expand.gr id (dn) , a s . v e c t o r ( a r r ) )
names ( ans ) [ ncol ( ans ) ] value.name
ans
g
The name is short for data-frame-i-fy.
For your example, assuming your multi-way array has proper dimnames, you'd just do:
my.data. f rame d f i f y (my.array , value.name=``f r equency ' ' )
(from Todd Taylor)
1.13 Write: data in text

le (31/12/2001)
Say I have a command that produced a 28 x 28 data matrix. How can I output the matrix into
a txt

le (rather than copy/paste the matrix)?
wr i t e . t a b l e (mat , f i l e= f i l e n ame . t x t )
Note MASS library has a function write.matrix which is faster if you need to write a numerical
matrix, not a data frame. Good for big jobs.
2 Working with data frames: Recoding, selecting, aggregating
2.1 Add variables to a data frame (or list) (02/06/2003)
If dat is a data frame, the column x1 can be added to dat in (at least 4) methods, dat$x1, dat[
, x1], dat[x1], or dat[[x1]]. Observe
12

dat data. f rame ( a=c ( 1 , 2 , 3 ) )
dat [ , x1 ] c (12 , 23 , 44)
dat [ x2 ] c (12 , 23 , 44)
dat [ [ x3 ] ] c (12 , 23 , 44)
dat
a x1 x2 x3
1 1 12 12 12
2 2 23 23 23
3 3 44 44 44
There are many other ways, including cbind().
Often I plan to calculate variable names within a program, as well as the values of the variables.
I think of this as generating new column names on the
y. In r-help, I asked I keep

nding
myself in a situation where I want to calculate a variable name and then use it on the left hand
side of an assignment. To me, this was a dicult problem.
Brian Ripley pointed toward one way to add the variable to a data frame:
i t e r a t i o n 9
newname pas t e ( run , i t e r a t i o n , sep=)
mydf [ newname ] aColumn
## or , in one step:
mydf [ pas t e ( run , i t e r a t i o n , sep=) ] aColumn
## for a list , same idea works , use double brackets
myList [ [ pas t e ( run , i t e r a t i o n , sep=) ] ] aColumn
And Thomas Lumley added: If you wanted to do something of this sort for which the above
didn't work you could also learn about substitute():
eva l ( s u b s t i t u t e (myList $newColumnaColumn) ,
l i s t (newColumn=as.name ( varName ) ) )
2.2 Create variable names on the
y (10/04/2001)
The previous showed how to add a column to a data frame on the
y. What if you just want
to calculate a name for a variable that is not in a data frame. The assign function can do that.
Try this to create an object (a variable) named zzz equal to 32.
a s s i g n ( z z z , 32)
z z z
[ 1 ] 32
In that case,I specify zzz, but we can use a function to create the variable name. Suppose you
want a random variable name. Every time you run this, you get a new variable starting with
a.
a s s i g n ( pas t e ( a , round ( rnorm( 1 , 50 ,12) , 2) , sep=) , 324)
I got a44.05:
a44.05
[ 1 ] 324
2.3 Recode one column, output values into another column (12/05/2003)
Please read the documentation for transform() and replace() and also learn how to use the
magic of R vectors.
The transform() function works only for data frames. Suppose a data frame is called mdf and
you want to add a new variable newV that is a function of var1 and var2:
13

mdf t rans form (mdf , newV=l o g ( var1 ) + var2 ) )
I'm inclined to take the easy road when I can. Proper use of indexes in R will help a lot,
especially for recoding discrete valued variables. Some cases are particularly simple because of
the way arrays are processed.
Suppose you create a variable, and then want to reset some values to missing. Go like this:
x rnorm(10000) x [ x 1 . 5 ] NA
And if you don't want to replace the original variable, create a new one

rst ( xnew - x ) and
then do that same thing to xnew.
You can put other variables inside the brackets, so if you want x to equal 234 if y equals 1, then
x [ y==1 ] 234
Suppose you have v1, and you want to add another variable v2 so that there is a translation.
If v1=1, you want v2=4. If v1=2, you want v2=4. If v1=3, you want v2=5. This reminds me
of the old days using SPSS and SAS. I think it is clearest to do:
v1 c ( 1 , 2 , 3 ) # now initialize v2 v2 rep( -9 , length(v1)) # now recode v2
v2[v1= =1] 4 v2[v1= =2]4 v2[v1= =3]5 v2[1] 4 4 5
Note that R's ifelse command can work too:
xi f e l s e (x1.5 ,NA, x )
One user recently asked how to take data like a vector of names and convert it to numbers, and
2 good solutions appeared:
y c ( OLDa , ALL , OLDc , OLDa , OLDb , NEW , OLDb , OLDa , ALL ,
. . . ) e l c ( OLDa , OLDb , OLDc , NEW , ALL) match (y , e l ) [ 1 ] 1 5 3
1 2 4 2 1 5 NA
or
f f a c t o r (x , l e v e l s=c ( OLDa , OLDb , OLDc , NEW , ALL) ) a s . i n t e g e r ( f )
[ 1 ] 1 5 3 1 2 4 2 1 5
I asked Mark Myatt for more examples:
For example, suppose I get a bunch of variables coded on a scale
1 = no 6 = yes 8 = t i e d 9 = mi s s ing 10 = not a p p l i c a b l e .
Recode that into a new variable name with 0=no, 1=yes, and all else NA.
It seems like the replace() function would do it for single values but you end up with empty
levels in factors but that can be

xed by re-factoring the variable. Here is a basic recode()
function:
r e code func t i on ( var , old , new) f
x r e p l a c e ( var , var==old , new)
i f ( i s . f a c t o r ( x ) ) f a c t o r ( x )
e l s e x
g
For the above example:
t e s t c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 )
t e s t t e s t r e code ( t e s t , 1 , 0)
t e s t r e code ( t e s t , 2 , 1)
t e s t r e code ( t e s t , 8 , NA)
t e s t r e code ( t e s t , 9 , NA)
t e s t r e code ( t e s t , 10 , NA) t e s t
Although it is probably easier to use replace():
14

t e s t c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 )
t e s t t e s t r e p l a c e ( t e s t , t e s t ==8 j t e s t ==9 j t e s t ==10 , NA)
t e s t r e p l a c e ( t e s t , t e s t ==1 , 0)
t e s t r e p l a c e ( t e s t , t e s t ==2 , 1) t e s t
I suppose a better function would take from and to lists as arguments:
r e code func t i on ( var , from , to ) f
x a s . v e c t o r ( var )
f o r ( i in 1 : l ength ( from) ) f
x r e p l a c e (x , x==from [ i ] , to [ i ] )
g
i f ( i s . f a c t o r ( var ) ) f a c t o r ( x )
e l s e x
g
For the example:
t e s t c ( 1 , 1 , 2 , 1 , 1 , 8 , 1 , 2 , 1 , 1 0 , 1 , 8 , 2 , 1 , 9 , 1 , 2 , 9 , 1 0 , 1 )
t e s t t e s t r e code ( t e s t , c ( 1 , 2 , 8 : 1 0 ) , c ( 0 , 1 ) )
t e s t
and it still works with single values.
Suppose somebody gives me ascale from 1 to 100, and I want to collapse it into 10 groups, how
do I go about it?
Mark says: Use cut() for this. This cuts into 10 groups:
t e s t t runc ( r u n i f (1000 ,1 ,100) )
groups cut ( t e s t , seq ( 0 , 1 0 0 , 1 0 ) )
t abl e ( t e s t , groups )
To get ten groups without knowing the minimum and maximum value you can use pretty():
groups cut ( t e s t , pr e t ty ( t e s t , 1 0 ) )
You can specify the cut-points:
groups cut ( t e s t , c ( 0 , 2 0 , 4 0 , 6 0 , 8 0 , 1 0 0 ) )
And they don't need to be even groups:
groups cut ( t e s t , c ( 0 , 3 0 , 5 0 , 7 5 , 1 0 0 ) )
Mark added, I think I will add this sort of thing to the REX pack.
2003{12{01, someone asked how to convert a vector of numbers to characters, such as
i f x [ i ] 250 then c o l [ i ] = `` red ' '
e l s e i f x [ i ] 500 then c o l [ i ] = `` blue ' '
and so forth. Many interesing answers appeared in R-help. A big long nested ifelse would work,
as in:
x . c o l i f e l s e ( x 250 , red ,
i f e l s e ( x500 , blue , i f e l s e ( x750 , green , black ) ) )
There were some nice suggestions to use cut, such as Gabor Grothendeick's advice:
The following results in a character vector:
c o l o u r s c ( red , blue , green , back )
c o l o u r s [ cut (x , c (Inf , 2 5 0 , 5 0 0 , 7 0 0 , I n f ) , r i g h t=F, lab=FALSE) ]
While this generates a factor variable:
c o l o u r s c ( red , blue , green , black )
cut (x , c (Inf , 2 5 0 , 5 0 0 , 7 0 0 , I n f ) , r i g h t=F, lab=c o l o u r s )
15

2.4 Create indicator (dummy) variables (20/06/2001)
2 examples:
c is a column, you want dummy variable, one for each valid value. First, make it a factor, then
use model.matrix():
x c ( 2 , 2 , 5 , 3 , 6 , 5 ,NA)
xf f a c t o r (x , l e v e l s =2:6)
model .mat r ix ( xf1 )
xf2 xf3 xf4 xf5 xf6
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 1 0
4 0 1 0 0 0
5 0 0 0 0 1
6 0 0 0 1 0
a t t r ( , a s s i g n )
[ 1 ] 1 1 1 1 1
(from Peter Dalgaard)
Question: I have a variable with 5 categories and I want to create dummy variables for each
category.
Answer: Use row indexing or model.matrix.
f f f a c t o r ( sample ( l e t t e r s [ 1 : 5 ] , 25 , r e p l a c e=TRUE) )
diag ( n l e v e l s ( f f ) ) [ f f , ]
#or
model .mat r ix (f f 1)
2.5 Create lagged values of variables for time series regression (05/22/2012)
Peter Dalgaard explained, the simple way is to create a new variable which shifts the response,
i.e.
y shf t c ( y [1 ] , NA) # pad with missing
summary( lm( y shf t x + y ) )
Alternatively, lag the regressors:
N l eng th ( x )
xlag c (NA, x [ 1 : (N1) ] )
ylag c (NA, y [ 1 : (N1) ] )
summary( lm( y xlag + ylag ) )
Dave Armstrong (personal communication, 2012/5/21) brought to my attention the following
problem in cross sectional time series data. Simply inserting an NA will lead to disaster
because we need to insert a lag within each unit. There is also a bad problem when the time
points observed for the sub units are not all the same. He suggests the following
dat data. f rame (
ccode = c ( 1 , 1 , 1 , 1 , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 ) ,
year = c (1980 , 1982 , 1983 , 1984 , 1981:1984 , 1980:1982 , 1984) ,
x = seq ( 2 , 2 4 , by=2) )
dat $ obs 1 : nrow( dat )
dat $ lagobs match ( pas t e ( dat $ ccode , dat $year1 , sep= . ) ,
pas t e ( dat $ ccode , dat $ year , sep= . ) )
dat $ l a g x dat $x [ dat $ lagobs ]
16

Run this example, be surprised, then email me if you

nd a better way. I haven't. This seems
like a dicult problem to me and if I had to do it very often, I am pretty sure I'd have to
re-think what it means to lag when there are years missing in the data. Perhaps this is a rare
occasion where interpolation might be called for.
2.6 How to drop factor levels for datasets that don't have observations with
those values? (08/01/2002)
The best way to drop levels, BTW, is
pr obl em. f a c t o r pr obl em. f a c t o r [ , drop=TRUE]
That has the same eect as running the pre-existing problem.factor through the function
factor:
pr obl em. f a c t o r f a c t o r ( pr obl em. f a c t o r )
2.7 Select/subset observations out of a dataframe (08/02/2012)
If you just want particular named or numbered rows or columns, of course, that's easy. Take
columns x1, x2, and x3.
datSubset1 dat [ , c ( x1 , x2 , x3 ) ]
If those happen to be columns 44, 92, and 93 in a data frame,
datSubset1 dat [ , c (44 , 92 , 93) ]
Usually, we want observations that are conditional.
Want to take observations for which variable Y is greater than A and less or equal than B:
X[Y A Y B ]
Suppose you want observations with c=1 in df1. This makes a new data frame.
df2 df1 [ df1 $ c==1 , ]
and note that indexing is pretty central to using S (the language), so it is worth learning all the
ways to use it. (from Brian Ripley)
Or use match select values from the column d by taking the ones that match the values of
another column, as in
d t ( ar ray ( 1 : 2 0 , dim=c ( 2 , 1 0 ) ) )
i c ( 1 3 , 5 , 1 9 )
d [match ( i , d [ , 1 ] ) , 2 ]
[ 1 ] 14 6 20
( from Peter Wolf )
Till Baumgaertel wanted to select observations for men over age 40, and sex was coded either
m or M. Here are two working commands:
# 1.)
maleOver40 subs e t (d , sex %in% c ( m , M) age 40)
# 2.)
maleOver40 d [ ( d$ sex==m j d$ sex==M) d$ age 40 , ]
To decipher that, do ?match and ?%in to

nd out about the %in% operator.
If you want to grab the rows for which the variable subject is 15 or 19, try:
df1 $ subj e c t %in% c ( 1 9 , 1 5 )
17

to get a True/False indication for each row in the data frame, and you can then use that output
to pick the rows you want:
i n d i c a t o r df1 $ subj e c t %in% c ( 1 9 , 1 5 )
df1 [ indi c a t o r , ]
How to deal with values that are already marked as missing? If you want to omit all rows for
which one or more column is NA (missing):
x2 na.omi t ( x )
produces a copy of the data frame x with all rows that contain missing data removed. The func-tion
na.exclude could be used also. For more information on missings, check help : ?na.exclude.
For exclusion of missing, Peter Dalgaard likes
subs e t (x , c ompl e t e . c a s e s ( x ) ) or x [ c ompl e t e . c a s e s ( x ) , ]
adding is.na(x) is preferable to x !=NA
2.8 Delete

rst observation for each element in a cluster of observations
(11/08/2000)
Given data like:
1 ABK 19910711 11 .1867461 0 .0000000 108
2 ABK 19910712 11 .5298979 11 .1867461 111
6 CSCO 19910102 0 .1553819 0 .0000000 106
7 CSCO 19910103 0 .1527778 0 .1458333 166
remove the

rst observation for each value of the sym variable (the one coded ABK,CSCO,
etc). . If you just need to remove rows 1, 6, and 13, do:
newhi lodata hi l oda t a [c ( 1 , 6 , 1 3 ) , ]
To solve the more general problem of omitting the

rst in each group, assuming sym is a
factor, try something like
newhi lodata subs e t ( hi lodata , d i f f ( c ( 0 , a s . i n t e g e r ( sym) ) ) != 0)
(actually, the as.integer is unnecessary because the c() will unclass the factor automagically)
(from Peter Dalgaard)
Alternatively, you could use the match function because it returns the

rst match. Suppose jm
is the data set. Then:
match ( unique ( jm$sym) , jm$sym)
[ 1 ] 1 6 13
jm jm[ match( unique ( jm$sym) , jm$sym) , ]
(from Douglas Bates)
As Robert pointed out to me privately: duplicated() does the trick
subs e t ( hi lodata , dupl i c a t ed ( sym) )
has got to be the simplest variant.
2.9 Select a random sample of data (11/08/2000)
sample (N, n , r e p l a c e=FALSE)
and
seq (N) [ rank ( r u n i f (N) ) n ]
18

Rtips123

More Related Content

Viewers also liked

Similar to Rtips123

Recently uploaded

Rtips123