Vignette overview


WASH Benefits Bangladesh primary outcome analysis using the washb R package functions.

This vignette applies package functions to the Bangladesh Wash Benefits trial data to calculate several unadjusted and adjusted estimates of differences in diarrheal disease and length-for-age z-score between treatment arms. Subgroup analysis between tzrget children and siblings is also calculated.


List and description of the functions documented in this vignette:

  • washb_mean() Means estimated with robust standard errors for the WASH Benefits trials. Calculate means for a variable along with robust sandwich SEs and 95% confidence intervals that account for clustering within id.
  • washb_glm() Function to run glm models with robust standard errors for unadjusted and adjusted estimates. Includes conversions of coefficients to relative risks and risk differences, calculates confidence intervals, and prescreens possible adjustment variables using washb_prescreen. Can be used for subgroup analysis. As the unadjusted glm function is asymptotically equivalent to the washb_mh() or washb_ttest() results (for logistic and linear regressions, respectively), the washb_glm() will probably be the most used an important function in the document.
  • washb_ttest() Wrapper function to call the paired t-test for two different arms of the study.
  • washb_mh() Estimates the Mantel-Haenszel prevalence ratio. Note: strata with no outcomes (i.e., missing PR) are dropped. This is consistent with a fixed-effects regression analysis, in which those strata would not contribute to the estimates. The arguments Y,tr,strat, below need to be from the same dataset.
  • washb_permute() Conducts a permutation test of the indepdence of Y and tr, conditional on randomization block using the Wilcoxon rank-sum test statistic.
  • Interal package functions (only used internally by the main package functions):
  • washb_prescreen() Function to pre-screen adjustment covariates – restrict to those with a LR test P<0.2 (or user-set p-value). Called internally in the washb_glm function, or can be called on its own.
  • sandwichSE() Function to calculate the Huber Sandwich Estimator for robust standard errors. Reference
  • washb_lincom() Function to get Estimates, SEs, CIs, and P-values for Pr>|Z| from a linear combination of regression coefficients.
  • washb_glm() Convenience wrapper function to run glm models with robust standard errors for unadjusted and adjusted estimates.


R resources

For new users of R, the following sources are helpful learning resources:
1. Calendar of UC Berkeley D-Lab workshops (Look for R bootcamps)
2. Cookbook for R
3. Datacamp: R for SAS, SPSS, and Stata users (This course is for those familiar with statistical programming who only need to learn R-specific syntax. Great GUI learning interface, and free for introductory course, and academic discount for paid courses.)
4. Try R codeschool (An interactive introductory course)
5. UCLA Institute For Digital Research and Education: R resources (Great source for topic-specific tutorials)
6. R -tutor introduction
7. Johns Hopkins Coursera course in R programming




Package Loading

Install package from Github

The WASH Benefits R package is easiest to access by directly installing it from Ben Arnold’s washb Github page. To download the R package from the R console, first the devtools package is needed. This can be installed into the R library with the code: install.packages("devtools"). This only needs to occur once (per computer). Each time R is started, the package needs to be loaded into R environment with library(devtools). Once the devtools package is attached, the washb package can be installed with the code: install_github("ben-arnold/washb").
The package will be updated periodically. When a new package version is available, an email will be sent out to the team, and the code install_github("ben-arnold/washb") will need to be run again to install the new version.


Load in required packages:

The washb package relies on several external functions found in a few different packages. These packaged need to be installed onto each computer once, and loaded into the R workspace each time R is opened. If you are installing the packages for the first time, use the following code:

install.packages("sandwich")
install.packages("lmtest")
install.packages("coin")
install.packages("plyr")
install.packages("metafor")

Once packages are installed for the first time, they will be automatically loaded along with the washb package when the following code is run:

library(washb)

The library() command can be used to load any installed package. It is not needed for any of the required packages listed above, but it is for suggested packages. For example, to fit a negative binomial glm model, the “MASS” package is required, and can be installed once with install.packages("MASS") and loaded each time R is opened with library("MASS").


Package documentation

Use ??washb to see full list of package documentation, or use ?(function name) to see help documentation for any specific package. For example, ?washb_glm returns the help page for the washb_glm function.




Diarrheal disease outcome data cleaning

The following code cleans and merges the enrolment and diarrhea data (not included with the package, but used here for teaching purposes):

# drop svydate and month because they are superceded in the child level diarrhea data
  washb_bd_enrol$svydate <- NULL
  washb_bd_enrol$month <- NULL

# merge the baseline dataset to the follow-up dataset
ad <- merge(washb_bd_enrol,washb_bd_diar,by=c("dataid","clusterid","block","tr"),all.x=F,all.y=T)

# subset to the relevant measurement
# Year 1 or Year 2
ad <- subset(ad,svy==1|svy==2)

#subset the diarrhea to children <36 mos at enrollment
### (exlude new births that are not target children)
ad <- subset(ad,sibnewbirth==0)
ad <- subset(ad,gt36mos==0)

# Exclude children with missing data
ad <- subset(ad,!is.na(ad$diar7d))

#Re-order the tr factor for convenience
ad$tr <- factor(ad$tr,levels=c("Control","Water","Sanitation","Handwashing","WSH","Nutrition","Nutrition + WSH"))

#Ensure that month is coded as a factor
ad$month <- factor(ad$month)

#Sort the data for perfect replication when using V-fold cross-validation
ad <- ad[order(ad$block,ad$clusterid,ad$dataid,ad$childid),]


LAZ outcome data cleaning

Load and merge the anthropometry and enrollment datasets. and clean similiarly to the diarrheal disease analysis shown above.

data(washb_bd_anthro)
data(washb_bd_enrol)
  washb_bd_enrol$svydate <- NULL
  washb_bd_enrol$month <- NULL
laz <- merge(washb_bd_enrol,washb_bd_anthro,by=c("dataid","clusterid","block","tr"),all.x=F,all.y=T)

# subset to the endline target children
laz <- subset(laz,svy==2)
laz <- subset(laz,tchild=="Target child")

# Drop children with extreme LAZ values
laz <- subset(laz,laz_x!=1)

laz$tr <- factor(laz$tr,levels=c("Control","Water","Sanitation","Handwashing","WSH","Nutrition","Nutrition + WSH"))
laz$month <- factor(laz$month)
laz <- laz[order(laz$block,laz$clusterid,laz$dataid,laz$childid),]

Now that the dataset is cleaned and merged, the washb package functions can be applied and the results will match the WASH Benefits Bangladesh primary analysis. (With the exception that the primary outcome analysis used TMLE, targeted maximum likelihood estimation, rather than GLM models in estimating the adjusted treatment effects.)




Function: washb_mean

This function is most useful for calculating variable means and confidence intervals – for example, calculating average compliance (uptake) within a given intervention arm, or calculating the average laz by arm or measurement round. In the WASH Benefits trials, the independent unit is typically the cluster, so the id argument should identify the cluster ID.


Arguments:
Y= Outcome variable.
id= ID variable for independent units (in WASH Benefits: cluster ID).
print= Logical. If (default) the function will print the results.


Usage:

washb_mean(Y,id,print=TRUE)


If you wish to compare means between groups using a difference, prevalence ratio, or incidence ratio (depending on the outcome), use washb_ttest() (for T-test of unadjusted continuous outcomes), or washb_mh() (for Mantel-Hanzel test on unadjusted binary outcomes), or washb_glm() for unadjusted or adjusted generalized linear models. As the unadjusted glm function is asymptotically equivalent to the Mantel-Hanzel or t-test results( for logistic and linear regressions, respectively), the washb_glm() will probably be the most used an important function in the document.


Using the WASH Benefits enrollment survey data and the washb_mean function, the means and 95% confidence intervals of several maternal characteristics are calculated below:

MomAge<-washb_mean(Y=washb_bd_enrol$momage,id=washb_bd_enrol$clusterid,print=TRUE)

-----------------------------------------
Dropping 20 observations
due to missing values in the outcome
 Final sample size: 5531 
-----------------------------------------
        N     Mean      SD Robust SE Lower 95%CI Upper 95%CI
[1,] 5531 23.76749 5.24887 0.0756264    23.61926    23.91572
MomEduY<-washb_mean(Y=washb_bd_enrol$momeduy,id=washb_bd_enrol$clusterid,print=TRUE)
        N     Mean       SD  Robust SE Lower 95%CI Upper 95%CI
[1,] 5551 5.815889 3.417545 0.05928979    5.699681    5.932097

The <- command assigns the output of the washb_mean() function to the new MomAge, MomEduY objects. Notice how there are 20 missing observations in the momage, but none in the momeduy variable.




Function: washb_glm

Generalized linear model function for WASH Benefits study.

 

Arguments:
Y= Binary, count, or continious outcome
tr= binary treatment group variable, comparison group first
pair=stratification variable (In WASH Benefits: the pair-matched block variable)
W= Optional data frame that includes adjustment covariates (for adjusted estimates). W will be prescreened and only those associated with the outcome will be adjusted for. See washb_prescreen for more details.
forcedW= Optional vector of variable names to force as adjustment covariates (no screening)
V= Optional variable name for subgroup analyses, which is interacted with tr. Continious variables can be used and an interaction term will be fit, but the calculated point estimates and confidence intervals for the treatment effect between subgroups will only be returned if V is a factor.
id= id of cluster used in calculating robust standard errors.
contrast= character vector of the format c("contrast1", "contrast2") of the two factor levels within the treatment variable (“tr”) to compare. "contrast1" is the reference group and "contrast2" is the active comparison. family= GLM model family (gaussian, binomial, poisson, and negative binomial). Use “binonial(link=‘log’)” to return prevalence ratios instead of odds ratios when the outcome is binary. Use “neg.binom” for a Negative binomial model.
pval= level of p=value to use in prescreening the set of potential adjustment variables W. Any variable in W with a likelihood ratio test p-value below this threshold will be included in the final adjustment set. Defaults to 0.2
print= logical (TRUE or FALSE, defaults to TRUE) for whether to print output or just store it in the assigned object.


Usage:

washb_glm(Y,tr,pair,W=NULL, forcedW=NULL, V=NULL, id,contrast,family="binonial(link='log')", pval=0.2, print=TRUE)


Example: unadjusted analysis of a binary outcome (diarrhea)

As an example, the following code applies the washb_glm function to compare 7-day recall diarrheal disease prevalence between the sanitation and control arms. Notice that the glm model is fit with a log link, family=binomial(link='log'), to estimate prevalence ratios rather than with a logit link, family="binomial", to estimate the odds ratio. Occasionally, a glm model with a non-canonical link function like family=binomial(link='log') will fail to converge. If this occurs, use a modified poisson regression to estimate prevalence ratio using the argument family=poisson(link='log'). See Zou 2004 for details. The WASH Benefits Kenya primary outcome analysis used modified poisson regressions to estimate prevalence ratios for all adjusted contrasts because the non-canonical binomial(link='log') failed to converge for some contrasts.

Diar.glm.C.S <- washb_glm(Y=ad$diar7d,tr=ad$tr,pair=ad$block, id=ad$clusterid, contrast=c("Control","Sanitation"), family=binomial(link='log'))


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                    PR      2.5%     97.5%   Estimate Std. Error   z value     Pr(>|z|)
trSanitation 0.6109162 0.4677257 0.7979433 -0.4927955  0.1362642 -3.616472 0.0002986455

 Type "`modelname'$TR" to return the treatment effect.
 Type "`modelname'$fit" to return full glm model estimates.
 Type "`modelname'$vcv" to return the variance-covariance matrix.
 Type "`modelname'$rowdropped" to return the vector list of observations included in the model fit
 Type "`modelname'$glmModel" to return the glm model fit to be used with predict() to return model predictions of the outcome

On top of the function’s auto-printed output, the washb_glm function contains a number of objects. For example, 'objectname'$TR returns just the treatment effect of the intervention arm, useful when saving to a row of an R object containing only the treatment effects across all the desired arm contrasts.

Diar.glm.C.S$TR
                    PR      2.5%     97.5%   Estimate Std. Error   z value     Pr(>|z|)
trSanitation 0.6109162 0.4677257 0.7979433 -0.4927955  0.1362642 -3.616472 0.0002986455


All returned objects are:

  • 'objectname'$TR to return the treatment effect.
  • 'objectname$fit to return full glm model estimates.
  • 'objectname$vcv to return the variance-covariance matrix.
  • 'objectname$rowdropped to return the vector list of observations included in the model fit.
  • 'objectname$lincom to return subgroup-specific conditional relative risk estimates if a subgroup V is specified.
  • 'objectname$glmModel to return the glm model fit to be used with predict() to return model predictions of the outcome. Note that this is the glm model fit without adjusting the standard errors for repeated measures so you should not use it for inference if the data include repeated observations and/or if you have fit a modified Poisson model. It is safest to use the returned $TR and $fit objects for inference, which are based on robust SEs.


Often, the washb_glm function will be used for many contrasts, with the results saved to an object. If you want to avoid printing all the output from the function, use the argument print=FALSE. If you still want the results printed, but not the names and descriptions of the returned objects, use verbose=FALSE


Example: Risk difference estimation using gaussian model and robust standard errors.

To estimate a risk difference of intervention on a binary or count outcome, the code is the same, except the family="gaussian" argument is used.

RD.Diar.glm.C.S <- washb_glm(Y=ad$diar7d,tr=ad$tr,pair=ad$block, id=ad$clusterid, contrast=c("Control","Sanitation"), family="gaussian", verbose=FALSE)


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                   Coef.        2.5%      97.5%  Std. Error   z value     Pr(>|z|)
trSanitation -0.02200495 -0.03326159 -0.0107483 0.005743186 -3.831488 0.0001273707


Example: unadjusted analysis of a continuous outcome (LAZ)

For the continuous LAZ outcome analysis, the code is the same, except the family="gaussian" argument is used.

LAZ.glm.C.S <- washb_glm(Y=laz$laz,tr=laz$tr,pair=laz$block, id=laz$clusterid, contrast=c("Control","Sanitation"), family="gaussian", verbose=FALSE)


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                   Coef.       2.5%     97.5% Std. Error    z value  Pr(>|z|)
trSanitation -0.01862361 -0.1080533 0.0708061  0.0456274 -0.4081673 0.6831509


Example: adjusted analysis of a binary outcome (diarrhea)

First, subset to new dataframes the variables to be screened for inclusion in adjusted glm models:

#Subset dataset to covariates to be screened 
Ws_diar <- subset(ad,select=c("month","agedays","sex","momage","momedu","momheight","hfiacat","Nlt18","Ncomp","watmin","elec","floor","walls","roof","asset_wardrobe","asset_table","asset_chair","asset_khat","asset_chouki","asset_tv","asset_refrig","asset_bike","asset_moto","asset_sewmach","asset_mobile"))

#Subset the LAZ dataset to covariates to be screened for inclusion in adjusted glm models
Ws_laz <- subset(laz,select=c("fracode","month","aged","sex","birthord","momage","momedu","momheight","hfiacat","Nlt18","Ncomp","watmin","elec","floor","walls","roof","asset_wardrobe","asset_table","asset_chair","asset_khat","asset_chouki","asset_tv","asset_refrig","asset_bike","asset_moto","asset_sewmach","asset_mobile"))

Use the W= argument and a dataframe of potential adjustment covariates (here, Ws_diar) to fit an adjusted glm model. The covariates in the W dataframe will be prescreened with the washb_prescreen function and variables with a p-value from a likelihood-ratio test lower than 0.2 are included as adjustment covariates in the final model. The pval= argument can be used to alter the p-value threshold for inclusion.


adj.Diar.glm.C.S <- washb_glm(Y=ad$diar7d,tr=ad$tr,pair=ad$block, W=Ws_diar, id=ad$clusterid, contrast=c("Control","Sanitation"), family=binomial(link='log'), verbose=FALSE)

-----------------------------------------
Dropping 67 observations due to missing values in 1 or more variables
 Final sample size: 5210 
-----------------------------------------

-----------------------------------------
Pre-screening the adjustment covariates:
-----------------------------------------

Likelihood Ratio Test P-values:
      [,1]             [,2]       
 [1,] "month"          "P = 0.000"
 [2,] "agedays"        "P = 0.001"
 [3,] "sex"            "P = 0.693"
 [4,] "momage"         "P = 0.958"
 [5,] "momedu"         "P = 0.000"
 [6,] "momheight"      "P = 0.471"
 [7,] "hfiacat"        "P = 0.026"
 [8,] "Nlt18"          "P = 0.414"
 [9,] "Ncomp"          "P = 0.725"
[10,] "watmin"         "P = 0.075"
[11,] "elec"           "P = 0.006"
[12,] "floor"          "P = 0.056"
[13,] "walls"          "P = 0.637"
[14,] "roof"           "P = 0.949"
[15,] "asset_wardrobe" "P = 0.107"
[16,] "asset_table"    "P = 0.074"
[17,] "asset_chair"    "P = 0.432"
[18,] "asset_khat"     "P = 0.002"
[19,] "asset_chouki"   "P = 0.606"
[20,] "asset_tv"       "P = 0.430"
[21,] "asset_refrig"   "P = 0.025"
[22,] "asset_bike"     "P = 0.480"
[23,] "asset_moto"     "P = 0.288"
[24,] "asset_sewmach"  "P = 0.138"
[25,] "asset_mobile"   "P = 0.599"


Covariates selected (P<0.2):
      [,1]             [,2]       
 [1,] "month"          "P = 0.000"
 [2,] "agedays"        "P = 0.001"
 [3,] "momedu"         "P = 0.000"
 [4,] "hfiacat"        "P = 0.026"
 [5,] "watmin"         "P = 0.075"
 [6,] "elec"           "P = 0.006"
 [7,] "floor"          "P = 0.056"
 [8,] "asset_wardrobe" "P = 0.107"
 [9,] "asset_table"    "P = 0.074"
[10,] "asset_khat"     "P = 0.002"
[11,] "asset_refrig"   "P = 0.025"
[12,] "asset_sewmach"  "P = 0.138"


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                    PR      2.5%     97.5%   Estimate Std. Error  z value      Pr(>|z|)
trSanitation 0.5847469 0.4474329 0.7642015 -0.5365762  0.1365574 -3.92931 0.00008519006

 RR of covariates
                                       PR      2.5%     97.5%
trSanitation                    0.5847469 0.4474329 0.7642015
month2                          0.4849529 0.1413928 1.6633051
month3                          0.4044081 0.1615790 1.0121730
month4                          1.1956219 0.5982246 2.3895905
month5                          1.0343119 0.4347668 2.4606320
month6                          2.0379062 0.7645573 5.4319819
month7                          1.1737215 0.4254977 3.2376728
month8                          1.6673212 0.6132235 4.5333555
month9                          0.3258038 0.1106729 0.9591154
month10                         0.2834488 0.1090672 0.7366392
month11                         0.5983482 0.1399533 2.5581436
month12                         0.6922535 0.1591620 3.0108628
agedays                         0.9993945 0.9990645 0.9997246
momeduPrimary (1-5y)            1.2507804 0.8202193 1.9073577
momeduSecondary (>5y)           0.9339511 0.5755248 1.5155990
hfiacatMildly Food Insecure     1.1719294 0.7229925 1.8996303
hfiacatModerately Food Insecure 1.1108363 0.8102622 1.5229111
hfiacatSeverely Food Insecure   1.3250600 0.7567987 2.3200146
watmin                          0.9383602 0.8234832 1.0692627
elec                            0.8751570 0.6641959 1.1531234
floor                           1.2400760 0.7134696 2.1553663
asset_wardrobe                  1.0590677 0.6852611 1.6367841
asset_table                     0.9294539 0.7110060 1.2150174
asset_khat                      0.8254681 0.6216595 1.0960946
asset_refrig                    0.8144093 0.4157987 1.5951528
asset_sewmach                   0.8804924 0.4795535 1.6166433


Example: forcing covariates into the glm model regardless of prescreening p-value

If there are certain variables to include in the glm model, regardless of prescreening results, use the forcedW option. This argument takes a variable name or a vector of variable names. These variables must already be in the W dataframe of potential adjustment covariates. For example, if adjustment by age and sex is standard, force them into the glm model.

glm.C.S <- washb_glm(Y=ad$diar7d,tr=ad$tr,pair=ad$block, W=Ws_diar, forcedW=c("agedays","sex"), id=ad$clusterid, contrast=c("Control","Sanitation"), family=binomial(link='log'), verbose=FALSE)

-----------------------------------------
Dropping 67 observations due to missing values in 1 or more variables
 Final sample size: 5210 
-----------------------------------------

-----------------------------------------
Include the following adjustment covariates without screening:
-----------------------------------------
[1] "agedays" "sex"    

-----------------------------------------
Pre-screening the adjustment covariates:
-----------------------------------------

Likelihood Ratio Test P-values:
      [,1]             [,2]       
 [1,] "month"          "P = 0.000"
 [2,] "momage"         "P = 0.958"
 [3,] "momedu"         "P = 0.000"
 [4,] "momheight"      "P = 0.471"
 [5,] "hfiacat"        "P = 0.026"
 [6,] "Nlt18"          "P = 0.414"
 [7,] "Ncomp"          "P = 0.725"
 [8,] "watmin"         "P = 0.075"
 [9,] "elec"           "P = 0.006"
[10,] "floor"          "P = 0.056"
[11,] "walls"          "P = 0.637"
[12,] "roof"           "P = 0.949"
[13,] "asset_wardrobe" "P = 0.107"
[14,] "asset_table"    "P = 0.074"
[15,] "asset_chair"    "P = 0.432"
[16,] "asset_khat"     "P = 0.002"
[17,] "asset_chouki"   "P = 0.606"
[18,] "asset_tv"       "P = 0.430"
[19,] "asset_refrig"   "P = 0.025"
[20,] "asset_bike"     "P = 0.480"
[21,] "asset_moto"     "P = 0.288"
[22,] "asset_sewmach"  "P = 0.138"
[23,] "asset_mobile"   "P = 0.599"


Covariates selected (P<0.2):
      [,1]             [,2]       
 [1,] "month"          "P = 0.000"
 [2,] "momedu"         "P = 0.000"
 [3,] "hfiacat"        "P = 0.026"
 [4,] "watmin"         "P = 0.075"
 [5,] "elec"           "P = 0.006"
 [6,] "floor"          "P = 0.056"
 [7,] "asset_wardrobe" "P = 0.107"
 [8,] "asset_table"    "P = 0.074"
 [9,] "asset_khat"     "P = 0.002"
[10,] "asset_refrig"   "P = 0.025"
[11,] "asset_sewmach"  "P = 0.138"


-----------------------------------------
 GLM Fit: Sanitation vs. Control 
-----------------------------------------
                    PR      2.5%     97.5%   Estimate Std. Error   z value      Pr(>|z|)
trSanitation 0.5847484 0.4473223 0.7643945 -0.5365736  0.1366849 -3.925625 0.00008650483

 RR of covariates
                                       PR      2.5%     97.5%
trSanitation                    0.5847484 0.4473223 0.7643945
agedays                         0.9993945 0.9990633 0.9997259
sexmale                         0.9991030 0.7616703 1.3105498
month2                          0.4849484 0.1415027 1.6619819
month3                          0.4043968 0.1615580 1.0122482
month4                          1.1955974 0.5982208 2.3895074
month5                          1.0342857 0.4349656 2.4593830
month6                          2.0377791 0.7636719 5.4376012
month7                          1.1736033 0.4253388 3.2382295
month8                          1.6672885 0.6127804 4.5364550
month9                          0.3258044 0.1107348 0.9585830
month10                         0.2834577 0.1091513 0.7361185
month11                         0.5983104 0.1398073 2.5604916
month12                         0.6922687 0.1593723 3.0070209
momeduPrimary (1-5y)            1.2508399 0.8211047 1.9054822
momeduSecondary (>5y)           0.9340191 0.5768368 1.5123715
hfiacatMildly Food Insecure     1.1718797 0.7215877 1.9031670
hfiacatModerately Food Insecure 1.1108493 0.8104317 1.5226280
hfiacatSeverely Food Insecure   1.3249358 0.7536143 2.3293810
watmin                          0.9383588 0.8235262 1.0692037
elec                            0.8751449 0.6641084 1.1532435
floor                           1.2401087 0.7130610 2.1567155
asset_wardrobe                  1.0590718 0.6851276 1.6371156
asset_table                     0.9294398 0.7110521 1.2149014
asset_khat                      0.8254342 0.6208087 1.0975064
asset_refrig                    0.8143569 0.4153225 1.5967766
asset_sewmach                   0.8805112 0.4798462 1.6157259


Example: subgroup analysis with washb_glm

The V arguement can be used to create an interaction term with tr, the treatment variable. V is a variable name of a variable that must be within the W matrix. Here is code for a subgroup analysis of the effect of nutrition treatment on diarrheal disease, stratified by target child (vs. sibling).

#Create a W variable containing "tchild" and other potential covariates:
W_tchild <- subset(ad,select=c("tchild"))

#Estimate subgroup analysis glm with washb_glm
glm.C.N.byChildType <- washb_glm(Y=ad$diar7d,tr=ad$tr,pair=ad$block, W=W_tchild, V="tchild", id=ad$clusterid, contrast=c("Control","Nutrition"), family=binomial(link='log'), verbose=FALSE)

-----------------------------------------
Include the following adjustment covariates without screening:
-----------------------------------------
[1] "tchild"


-----------------------------------------
 GLM Fit: Nutrition vs. Control  by Subgroup: ' tchild '
-----------------------------------------
  Tr vs. C by Subgroup    est se.est est.lb est.ub       Z      P
1              Sibling 0.6255 0.3028 0.3455 1.1324 -1.5494 0.1213
2         Target child 0.6421 0.1645 0.4651 0.8865 -2.6926 0.0071


-----------------------------------------
 Significance of effect modification variables 
-----------------------------------------
                                PR  Pr(>|z|)
trNutrition:VTarget child 1.026499 0.9376021
#Examine the treatment effect across subgroups with `objectname'$lincom
glm.C.N.byChildType$lincom
  Tr vs. C by Subgroup    est se.est est.lb est.ub       Z      P
1              Sibling 0.6255 0.3028 0.3455 1.1324 -1.5494 0.1213
2         Target child 0.6421 0.1645 0.4651 0.8865 -2.6926 0.0071





Function: washb_mh

Unadjusted Mantel-Haenszel estimates

Arguments: Y=binary outcome (in this example, Y= 7-day diarrheal disease recall ad$diar7d)
tr=binary treatment group variable, comparison group first
strat=stratification variable (In WASH Benefits: the pair-matched block variable)
contrast= character vector of the format c("contrast1", "contrast2") of the two factor levels within the treatment variable (“tr”) to compare. "contrast1" is the reference group and "contrast2" is the active comparison.
measure=measure of effect. RR = prev ratio, RD = prev difference


Usage: washb_mh(Y,tr,strat,contrast,measure="RR")

Apply washb_mh to the sanitation vs. control arm contrast (With the round() function added to clean output and keep it from spilling across multiple lines).

mh.CvS<-washb_mh(Y=ad$diar7d,tr=ad$tr, contrast=c("Control","Sanitation"), strat=ad$block,measure="RR")
print(round(mh.CvS,4))
     PR,    ci.lb    ci.ub    logPR se.logPR        Z        p 
  0.6085   0.4569   0.8102  -0.4968   0.1461  -3.4001   0.0007 

The function washb_mh can also be used to calculate an unadjusted risk difference:

round(washb_mh(Y=ad$diar7d,tr=ad$tr, contrast=c("Control","Sanitation"), strat=ad$block,measure="RD"),4)
     RD   se.RD   ci.lb   ci.ub       Z       p 
-0.0220  0.0059 -0.0335 -0.0105 -3.7396  0.0002 


Advanced R tip: Use sapply command to efficiently apply the function to all the treatment arm contrasts

#Create vector of contrasts  for the pre-specified hypothesis 1 (each intervention arm vs. control) to facilitate comparisons between arms.
h1.contrasts <- list(c("Control","Water"), c("Control","Sanitation"), c("Control","Handwashing"), c("Control","WSH"), c("Control","Nutrition"), c("Control","Nutrition + WSH"))

#Apply sapply to run the function on each comparison in the h1.contrasts list
diff.h1 <- t(sapply(h1.contrasts,washb_mh,Y=ad$diar7d,tr=ad$tr,strat=ad$block,measure="RR"))
rownames(diff.h1) <- c("Water v C","Sanitation v C","Handwashing v C","WSH v C","Nutrition v C","Nutrition + WSH v C")
print(round(diff.h1,4))
                       PR,  ci.lb  ci.ub   logPR se.logPR       Z      p
Water v C           0.8883 0.6966 1.1326 -0.1185   0.1240 -0.9556 0.3393
Sanitation v C      0.6085 0.4569 0.8102 -0.4968   0.1461 -3.4001 0.0007
Handwashing v C     0.6002 0.4526 0.7958 -0.5105   0.1440 -3.5463 0.0004
WSH v C             0.6917 0.5310 0.9010 -0.3686   0.1349 -2.7324 0.0063
Nutrition v C       0.6438 0.4872 0.8505 -0.4404   0.1421 -3.0990 0.0019
Nutrition + WSH v C 0.6193 0.4709 0.8146 -0.4791   0.1399 -3.4258 0.0006


Asymptotic equivalence of Mantel-Haenszel and GLM:

The washb_mh() estimate and the washb_glm() unadjusted estimate with a family=binomial(link-"log") link are asymptomatically equivalent. As sample size increases to infinity, the estimates should converge. As seen below, the estimates using the Bangladesh diarrheal data are functionally the same.

#GLM
Diar.glm.C.S$TR
                    PR      2.5%     97.5%   Estimate Std. Error   z value     Pr(>|z|)
trSanitation 0.6109162 0.4677257 0.7979433 -0.4927955  0.1362642 -3.616472 0.0002986455
#mantel-Haenszel
print(round(mh.CvS,4))
     PR,    ci.lb    ci.ub    logPR se.logPR        Z        p 
  0.6085   0.4569   0.8102  -0.4968   0.1461  -3.4001   0.0007 




Function: washb_ttest

Function to call the paired t-test for two different arms of the study. It estimates the paired t-test for differences in means paired within randomization blocks. This function should be used for the unadjusted analysis of continious outcomes.
Arguments: Y=binary outcome (in this example, Y= child length for age z-score laz$laz)
tr=binary treatment group variable, comparison group first
strat=stratification variable (In WASH Benefits: the pair-matched block variable)
contrast= character vector of the format c("contrast1", "contrast2") of the two factor levels within the treatment variable (“tr”) to compare. "contrast1" is the reference group and "contrast2" is the active comparison.


Usage: washb_ttest(Y,tr,strat,contrast)


Below, the washb_ttest is applied to the LAZ outcome across all intervention vs. control contrasts.

#Run washb_ttest on Nutrition vs. control arm comparison
washb_ttest(Y=laz$laz,tr=laz$tr,strat=laz$block, contrast=c("Control","Nutrition"))
          diff          ci.lb          ci.ub         t-stat              p 
0.252588082751 0.147026825613 0.358149339889 4.754463463037 0.000007613508 
#Use sapply to apply across all contrasts
diff.h1LAZ <- t(sapply(h1.contrasts,washb_ttest,Y=laz$laz,tr=laz$tr,strat=laz$block))
rownames(diff.h1LAZ) <- c("Water v C","Sanitation v C","Handwashing v C","WSH v C","Nutrition v C","Nutrition + WSH v C")
print(diff.h1LAZ)
                           diff       ci.lb      ci.ub     t-stat              p
Water v C           -0.06463895 -0.18124459 0.05196670 -1.1014579 0.273667333684
Sanitation v C      -0.02274895 -0.13827579 0.09277788 -0.3912657 0.696536006223
Handwashing v C     -0.06792320 -0.17626035 0.04041394 -1.2457589 0.216122459122
WSH v C              0.01741604 -0.09296438 0.12779645  0.3135094 0.754627284897
Nutrition v C        0.25258808  0.14702683 0.35814934  4.7544635 0.000007613508
Nutrition + WSH v C  0.12918914  0.01658610 0.24179218  2.2796549 0.025018572196


Asymptotic equivalence of the paired t-test and GLM

The washb_ttest() estimate and the washb_glm() unadjusted estimate with a family=gaussian link are asymptomatically equivalent. As sample size increases to infinity, the estimates should converge. As seen above, the estimates using the Bangladesh LAZ data are functionally the same.

LAZ.glm.C.N <- washb_glm(Y=laz$laz,tr=laz$tr,pair=laz$block, id=laz$clusterid, contrast=c("Control","Nutrition"), family="gaussian", print=FALSE)
LAZ.glm.C.N$TR
                Coef.      2.5%     97.5% Std. Error  z value          Pr(>|z|)
trNutrition 0.2427187 0.1613852 0.3240522 0.04149668 5.849113 0.000000004942027
LAZ.glm.C.WSHN <- washb_glm(Y=laz$laz,tr=laz$tr,pair=laz$block, id=laz$clusterid, contrast=c("Control","Nutrition + WSH"), family="gaussian", print=FALSE)
LAZ.glm.C.WSHN$TR
                      Coef.       2.5%     97.5% Std. Error  z value    Pr(>|z|)
trNutrition + WSH 0.1239545 0.03821548 0.2096936 0.04374441 2.833608 0.004602573




Function: washb_permute

Non-parametric unadjusted estimates (Wilcoxon Signed Rank permutation test)

WASH Benefits Wilcoxon Signed Rank permutation test function for two treatment arms conditional on randomization block. It conducts a permutation test of the independence of Y and tr, conditional on randomization block using the Wilcoxon rank-sum test statistic.


Introduction:
The washb_glm and related functions (above) enable us to test hypotheses about whether the average outcome differs between trial arms – that is, whether the difference in means is equal to zero. The mean is just one function of the outcome distribution. A sharper test is the “sharp null hypothesis” first proposed by Ronald Fisher, in which we test whether there is any difference at all in the outcome distributions between intervention arms. Under the null hypothesis of no treatment effect, the outcome distributions should be indistinguishable from random sampling variation. A way to test the sharp null hypothesis is with a permutation test (also called a Fisher randomization test). The intuition behind the test is that there is a single source of random variation in the trial: namely the random allocation of treatment. If we re-randomize treatment in every possible combination (or a very large number of combinations), and compare arms using a test statistic for each combination, then this provides us with the test statistic’s distribution under the null hypothesis of no effect (since we re-shuffle treatment in each iteration, it is completely uninformative). We can then compare the observed test statistic with real group assignments to the null distribution to see how likely it would be to have occurred by chance. For more details on randomization tests, see the references below.

In the WASH Benefits primary analysis, we pre-specified that we would test this sharp null using a Wilcoxon signed rank test, which is a non-parametric test statistic has been shown to have good power against alternatives for outcomes that could potentially have skewed distributions Imbens GW, Rubin DB. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press; 2015..


Arguments:
Y= Outcome variable (continuous, such as LAZ, or binary, such as diarrhea).
tr= Binary treatment group variable (ideally a factor), comparison group first.
pair= Pair-matched randomization ID variable (in WASH Benefits: block).
contrast= Vector of length 2 that includes the groups to contrast, e.g., c(“Control”,“Water”).
nreps= Number of permutations to run.
seed= Number for psuedo-random number generation in R.


Usage: washb_permute(Y,tr,pair,contrast,nreps=100000,seed=NULL)


References:
1. Gail, M. H., Mark, S. D., Carroll, R. J., Green, S. B. & Pee, D. On Design Considerations and Randomization-Based Inference for Community Intervention Trials. Statist. Med. 15, 1069–1092 (1996). Link
2. Braun, T. M. & Feng, Z. Optimal Permutation Tests for the Analysis of Group Randomized Trials. Journal of the American Statistical Association 96, 1424–1432 (2001). Link
3. Rosenbaum, P. R. Covariance Adjustment in Randomized Experiments and Observational Studies. Statist. Sci. 17, 286–327 (2002). Link


Apply the washb_permute function to the Sanitation-Control arm comparison:

permute.C.S <- washb_permute(Y = ad$diar7d, tr = ad$tr, pair = ad$block, contrast = c("Control", "Sanitation"), nreps = 100000, seed = 242524)

    Approximative Wilcoxon-Pratt Signed-Rank Test

data:  y by x (pos, neg) 
     stratified by block
Z = 3.551, p-value = 0.00032
alternative hypothesis: true mu is not equal to 0


Adjusted permutation tests

The permutation test can also test for independence of \(Y\) and \(tr\), conditional on \(W\), by applying the washb_permute function to the residuals of an adjusted model (adjusted for \(W\), but not \(tr\)). The permutation test can have even greater power against alternatives if it is conducted on residuals from an algorithmic fit that removes variability of the outcome due to characteristics other than treatment. The intuition is that by removing outcome variability that is not associated with the treatment of interest, we can more narrowly focus our inference on differences due to treatment. In the example below, we show how to use a simple linear regression to predict LAZ, and then conduct the permutation test on the residuals from the regression. Alternatively, a more flexible, machine-learning algorithm approach, such as super learner, can be used in the initial prediction step (a future update of this software will provide an example using super learner).

#Use the washb_prescreen function to select adjustment covariates associated with the outcome
adj.W<-washb_prescreen(Y=laz$laz,Ws_laz,family="gaussian", pval=0.2)

#Subset the laz dataset to control and nutrition arms:
      laz.subset=laz[which(laz$tr=="Control"|laz$tr=="Nutrition"),]

#Subset the LAZ dataset to the selected adjustment covariates and LAZ, as well as tr and block, which will be needed in the permutation test
perm.adj.data <- subset(laz.subset,select=c(adj.W, "laz", "tr", "block"))

#Subset to complete cases
perm.adj.data<-perm.adj.data[complete.cases(perm.adj.data),]

Wselect <- subset(perm.adj.data,select=c(adj.W, "laz"))
#fit the glm model 
fit <- glm(laz~., family="gaussian", data=Wselect)

#Use the predict to return predicted LAZ from the adjusted glm model, and subtract it from the observed LAZ outcome
residuals<-Wselect$laz-predict(fit)

#run the permutation test function
permute.adj.C.S <- washb_permute(Y=residuals,tr=perm.adj.data$tr,pair=perm.adj.data$block,contrast=c("Control","Sanitation"), nreps=100000,seed=1241353)





Internal function: washb_prescreen

__Note:__The washb_prescreen and washb_lincom functions were written as internal functions (called by washb_glm, above), but we have provided some additional details for how they work in case investigators wish to use them separately.
The washb_prescreen function selects covariates with univariate associations with the outcome of P<0.2 (default) based on a likelihood ratio test. It is called as a part of the washb_glm function to select adjustment covariates, but can also be run as an independent function. The pval= argument can be used to alter the p-value threshold for inclusion.

Arguments:
Y Outcome variable (continuous, such as LAZ, or binary, such as diarrhea).
Ws data frame that includes candidate adjustment covariates to screen.
family GLM model family ("gaussian", "binomial", "poisson", or "neg.binom" for negative binomial).
pval The p-value threshold: any variables with a p-value from the lielihood ratio test below this threshold will be returned. Defaults to 0.2.
print Logical for whether to print function output, defaults to TRUE.


Usage:

washb_prescreen(Y=ad$diar7d,Ws=W,family="binomial", pval=0.2, print=TRUE)


Run the washb_prescreen function

The washb_prescreen function performs a likelihood ratio test on a set of potential covariates and returns all covariates with an associated p-value<0.2 (unless a custom pvalue is specified). It can be called as a function on its own, and is also called internally within the washb_glm function. Unless otherwise specified by the use, the washb_glm function with therefore only include covariates with a LR-test p-value <0.2 in the model. See the Function: washb_glm section for more details.

prescreened_varnames<-washb_prescreen(Y=ad$diar7d,Ws_diar,family="binomial", pval=0.2)

Likelihood Ratio Test P-values:
      [,1]             [,2]       
 [1,] "month"          "P = 0.000"
 [2,] "agedays"        "P = 0.000"
 [3,] "sex"            "P = 0.159"
 [4,] "momage"         "P = 0.858"
 [5,] "momedu"         "P = 0.001"
 [6,] "momheight"      "P = 0.837"
 [7,] "hfiacat"        "P = 0.000"
 [8,] "Nlt18"          "P = 0.146"
 [9,] "Ncomp"          "P = 0.858"
[10,] "watmin"         "P = 0.017"
[11,] "elec"           "P = 0.002"
[12,] "floor"          "P = 0.009"
[13,] "walls"          "P = 0.173"
[14,] "roof"           "P = 0.446"
[15,] "asset_wardrobe" "P = 0.003"
[16,] "asset_table"    "P = 0.278"
[17,] "asset_chair"    "P = 0.264"
[18,] "asset_khat"     "P = 0.054"
[19,] "asset_chouki"   "P = 0.883"
[20,] "asset_tv"       "P = 0.109"
[21,] "asset_refrig"   "P = 0.015"
[22,] "asset_bike"     "P = 0.005"
[23,] "asset_moto"     "P = 0.233"
[24,] "asset_sewmach"  "P = 0.004"
[25,] "asset_mobile"   "P = 0.713"


Covariates selected (P<0.2):
      [,1]             [,2]       
 [1,] "month"          "P = 0.000"
 [2,] "agedays"        "P = 0.000"
 [3,] "sex"            "P = 0.159"
 [4,] "momedu"         "P = 0.001"
 [5,] "hfiacat"        "P = 0.000"
 [6,] "Nlt18"          "P = 0.146"
 [7,] "watmin"         "P = 0.017"
 [8,] "elec"           "P = 0.002"
 [9,] "floor"          "P = 0.009"
[10,] "walls"          "P = 0.173"
[11,] "asset_wardrobe" "P = 0.003"
[12,] "asset_khat"     "P = 0.054"
[13,] "asset_tv"       "P = 0.109"
[14,] "asset_refrig"   "P = 0.015"
[15,] "asset_bike"     "P = 0.005"
[16,] "asset_sewmach"  "P = 0.004"


Description of output

The function runs a likelihood ratio test between a generalized linear model fit to each screened variable and a null model. The first section of the output includes all screened variables and the p-values from the likelihood ratio test. The second section outputs only those variables with a p-value less than the thresholds set with the pval argument (<0.2 by default).


Use the following code to return the saved a list of variable names for the selected variables. Then, subset the covariate dataframe to only those variables:

prescreened_varnames
 [1] "month"          "agedays"        "sex"            "momedu"         "hfiacat"        "Nlt18"          "watmin"         "elec"           "floor"          "walls"          "asset_wardrobe" "asset_khat"     "asset_tv"       "asset_refrig"   "asset_bike"     "asset_sewmach" 
prescreened_vars <- subset(Ws_diar,select=prescreened_varnames)
#Examine the first five observations of the second selected variable, child age in days:
prescreened_vars[1:5,2]
[1]  346  810 1289 1753 1780




Internal function: washb_lincom

__Note:__The washb_prescreen and washb_lincom functions were written as internal functions (called by washb_glm, above), but we have provided some additional details for how they work in case investigators wish to use them separately.
The subgroup analysis option in washb_glm internally calls the washb_lincom function to calculate estimates, SEs, CIs, and P-values from a linear combination of regression coefficients. The washb_lincom function can be called alone to calculate a custom linear combination of coefficients. Arguments:
lc= Index vector of coefficients from glm modellinear combination of coefficients varlist= Character vector of variables to include. Alternative to lc. If lc is specified, this arguement is ignored. fit= Model object returned from coeftest (class=coeftest) command within the washb_glm function. Accessed with $fit from washb_glm output. vcv= variance-covariance matrix of coefficients. Accessed with $vcv from washb_glm output. measure= measure of effect. RR = risk ratio, RD = risk difference flag= Internal argument used to flag and suppress printing if the washb_lincom function is called within another function.


Usage: washb_lincom(lc=NULL,varlist=NULL,fit,vcv, measure="RR", flag=NULL)


Target child v. sibling subgroup analysis

Below is an example of using the lincom() function to externally replicate the target child/sibling subgroup analysis of the diarrheal disease outcome. First, recall the glm model contrasting diarrheal disease prevalence ratio between target child and sibling fitted to diarrheal disease:

glm.C.N.byChildType$lincom
  Tr vs. C by Subgroup    est se.est est.lb est.ub       Z      P
1              Sibling 0.6255 0.3028 0.3455 1.1324 -1.5494 0.1213
2         Target child 0.6421 0.1645 0.4651 0.8865 -2.6926 0.0071

Next, use the lincom function to verify the subgroup analysis calculated within the washb_glm() function. washb_lincom() takes as argements the fit and variance-covariance matrix returned by the glm function, and an index vector lc.
lc= vector indexing coefficients to be linearly combined. If a single coefficient is chosen, it will return the same coefficient and confidence interval as the glm coefficient. Example:

  #Create lc vector of 0's equal in length to the number of coefficients from the glm model.
lc=rep(0,nrow(glm.C.N.byChildType$fit))
#Examine model coefficients (minus the pair-matched block estimates) to determine the position of coefficients to combine.
glm.C.N.byChildType$fit[1:3,]
                     PR       2.5%      97.5%   Estimate Std. Error   z value       Pr(>|z|)
(Intercept)   0.0138250 0.00255654 0.07476143 -4.2812768  0.8611345 -4.971670 0.000000663786
trNutrition   0.6255283 0.34554354 1.13237715 -0.4691588  0.3027948 -1.549428 0.121278884763
VTarget child 1.6236081 1.14428485 2.30371258  0.4846509  0.1785056  2.715045 0.006626669997
  #Replace the second position in the vector with 1 (the position of the treatment coefficient in the model)
lc[2]<-1
  #Run the lincom function and compare output to the treatment effect from the GLM model.
washb_lincom(lc=lc,fit=glm.C.N.byChildType$fit,vcv=glm.C.N.byChildType$vcv, measure="RR") 

Linear combination of coefficients:
[1] "trNutrition"
         RR  se.RR  RR.lb  RR.ub       Z      P
[1,] 0.6255 0.3028 0.3455 1.1324 -1.5494 0.1213

Because sibling is the reference level of the VTarget child term, running the lincom() function with only the treatment contrast term marked in the lc index vector equals both the prevalence ratio of the trSanitation term and the prevalence ratio of the sibling subgroup returned in the glm.C.S.byChildType$lincom object. TO estimate the PR from the target child subgroup, the ineraction term trSanitation:VTarget child needs to be marked in the lc index.

#Add a 1 at the 4th position in the lc vector to include the 4th coefficient, the interaction term, in the linear combination.
lc[4]<-1
washb_lincom(lc=lc,fit=glm.C.N.byChildType$fit,vcv=glm.C.N.byChildType$vcv, measure="RR") 

Linear combination of coefficients:
[1] "trNutrition + pair2"
         RR  se.RR  RR.lb   RR.ub      Z      P
[1,] 3.4565 0.9551 0.5317 22.4722 1.2986 0.1941

This estimate equals the target child PR estimated in the washb_glm estimate with the V argument specified. Note that the coefficient for the VTarget child term is not included in this linear combination calculation. Including that term would calculate the prevalence ratio between sanitation arm target children and control arm siblings, whereas in a subgroup analysis the prevalence ratio between treatments within, rather than across, the groups is desired.