Basic statistics exercises (survival analysis)

Tutorial
Lung cancer data
Time to pregnancy data
- Effect of smoking (KaplanMeier & logRank)
- Effect of alcohol (Cox regression)
  - Continuous alcohol variable
  - Adjusted analyses

Tutorial

Lung cancer data

Load the data "lung" from the survival package.

Create a new status variable that fits to the standard coding 0 for censored and 1 for event (death).
Compute the Kaplan-Meier estimate of the overall survival curves and plot the results.
Read-off the median survival time from the graph.
Compute the Kaplan-Meier estimate of the survival curves in strata defined by patients gender and visualize the results
Use a Cox regression model to assess if the survival chances are gender dependent.
State the findings in a sentence using hazard ratio with 95% confidence interval and the corresponing p-value.
Adjust the results for patient age and revise the result sentence

Create a new status variable

library(survival)
library(prodlim)
library(data.table)
data(lung)
setDT(lung)
lung[,Status:=as.numeric(status==2)]

Univariate Kaplan-Meier

Compute the Kaplan-Meier estimate of the overall survival curves and plot the result:

library(prodlim)
KaplanMeierEst = prodlim(Surv(time,Status)~1,data=lung)
plot(KaplanMeierEst,xlab="Days")

quantile(KaplanMeierEst)

X11cairo 
       2
Quantiles of the event time distribution based on the   method.

Table of quantiles and corresponding confidence limits:
     q quantile lower upper
1 0.00       NA    NA    NA
2 0.25      550   457   643
3 0.50      310   284   361
4 0.75      170   145   197
5 1.00        5     5    11
Median time (IQR):310.00 (170.00;550.00)

According to the Kaplan-Meier estimate the median survival time (95% confidence limits) is 310 days (284;361).

stratified Kaplan-Meier

Compute the Kaplan-Meier estimate of the survival curves in strata defined by patients gender and visualize the result:

KaplanMeierEstStrata = prodlim(Surv(time,Status)~sex,data=lung)
plot(KaplanMeierEstStrata,xlab="Days",confint=TRUE)

Univariate Cox regression

Use a Cox regression model to assess if the survival chances are gender dependent:

coxfit = coxph(Surv(time,Status)~sex,data=lung)
summary(coxfit)

X11cairo 
       2
Call:
coxph(formula = Surv(time, Status) ~ sex, data = lung)

  n= 228, number of events= 165 

       coef exp(coef) se(coef)      z Pr(>|z|)   
sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
---
codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    exp(coef) exp(-coef) lower .95 upper .95
sex     0.588      1.701    0.4237     0.816

Concordance= 0.579  (se = 0.022 )
Rsquare= 0.046   (max possible= 0.999 )
Likelihood ratio test= 10.63  on 1 df,   p=0.001
Wald test            = 10.09  on 1 df,   p=0.001
Score (logrank) test = 10.33  on 1 df,   p=0.001

lung$Sex <- factor(lung$sex,levels=c("1","2"),labels=c("Male","Female"))
lung$Sex <- relevel(lung$Sex,"Female")
coxfit = coxph(Surv(time,Status)~Sex,data=lung)
publish(coxfit,org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
Sex	Female	Ref
	Male	1.70	[1.23;2.36]	0.00149

State the findings in a sentence using hazard ratio with 95% confidence interval and the (Wald test) p-value.

The survival chances were significantly lower for men as compared to women (hazard ratio: 1.7; 95% confidence interval: [1.2;2.4]; Wald test p-value: 0.001).

Age-adjusted Cox regression

coxfit.adj = coxph(Surv(time,Status)~Sex+age,data=lung)
publish(coxfit.adj,org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
Sex	Female	Ref
	Male	1.67	[1.20;2.32]	0.00218
age		1.02	[1.00;1.04]	0.06459

Cox regression adjusted for age showed that the survival chances were significantly lower for men as compared to women (hazard ratio: 1.67; 95% confidence interval: [1.2;2.32]; Wald test p-value: 0.002).

Time to pregnancy data

Load the ttp data.

Analyse the effect of smoking (separately for the woman and the man) on the time-to-pregnancy with Kaplan-Meier curves.
Test the smoking effect with a simple Cox regression model. Write a conclusion sentence which includes the hazard ratio and the corresponding confidence interval and p-value.
Study the interaction of the smoking status of the woman and the smoking status of the man on the time to pregancy. In this model, what is the hazard ratio between a non-smoking couple and a couple where both partners smoke?
Analyse the effect of the woman's alcohol drinking behaviour (''alcWoman'') in a Cox regression analysis. Write a conclusion sentence which includes the hazard ratio and confidence limits.
Categorize the number of beverages per week into three groups:
- group 1 (''no''): 0 drinks
- group 2 (''moderate''): 1-10 drinks
- group 3 (''high''): 11 or more drinks

Then analyse the effect of this three group variable on the time to pregnancy. Write conclusion sentences which include pairwise comparisons in terms of hazard ratios of all three groups.

Adjust the previous analysis of alcohol effect for possible confounding by smoking and revise the results.
Adjust the previous analysis of alcohol effect for possible confounding by sperm concentration and revise the results. Find possible explanations for why the results differ from the univariate analysis.

Effect of smoking (KaplanMeier & logRank)

woman smoking

par(mar=c(10,10,5,5))
fw <- prodlim(Surv(ttp,status)~smokeWoman,data=ttp)
plot(fw,
     type="cuminc",
     ylab="Probability of getting pregnant",
     legend.title="Woman smoking",
     atrisk.title="Woman smoking",
     legend.x="topleft",
     atrisk.at=seq(0,200,50),
     atrisk.pos=-20,
     xlab="Days")

logRankWoman <- survdiff(Surv(ttp,status)~smokeWoman,data=ttp)
publish(logRankWoman,org=TRUE)

X11cairo 
       2
 Log-rank test      N Observed Expected squared(O-E)/E squared(O-E)/V 
 smokeWoman=No 300.00   194.00   176.46           1.74           5.64 
 smokeWoman=Yes 123.00    62.00    79.54           3.87           5.64 

 Chisq= 5.6  on 1 degrees of freedom, p= 0.0176

The chance of getting pregnant was significantly higher when the woman was not smoking compared to when the woman was smoking (Log-rank p-value: 0.0176).

man smoking

par(mar=c(10,10,5,5))
fm <- prodlim(Surv(ttp,status)~smokeMan,data=ttp)
plot(fm,
     type="cuminc",
     ylab="Probability of getting pregnant",
     legend.x="topleft",
     legend.title="Man smoking",
     atrisk.title="Man smoking",
     atrisk.at=seq(0,200,50),
     atrisk.pos=-20,
     xlab="Days")

logRankMan <- survdiff(Surv(ttp,status)~smokeMan,data=ttp)
publish(logRankMan,org=TRUE)

X11cairo 
       2
 Log-rank test      N Observed Expected squared(O-E)/E squared(O-E)/V 
 smokeMan=No 290.00   183.00   172.31           0.66           2.04 
 smokeMan=Yes 133.00    73.00    83.69           1.37           2.04 

 Chisq= 2  on 1 degrees of freedom, p= 0.1536

The chance of getting pregnant was higher when the man was not smoking compared to when the man was smoking (see Kaplan-Meier plot), however, this difference was not statistically significant (Log-rank p-value: 0.1536).

Univariate Cox regression

coxWoman <- coxph(Surv(ttp,status)~smokeWoman,data=ttp)
publish(coxWoman,org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
smokeWoman	No	Ref
	Yes	0.71	[0.53;0.94]	0.0182

Smoking of the woman decreased the hazard and hence increased the time to pregnancy significantly (hazard ratio=0.71, 95%-CI=[0.53;0.94], p=0.018).

coxMan <- coxph(Surv(ttp,status)~smokeMan,data=ttp)
publish(coxMan,org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
smokeMan	No	Ref
	Yes	0.82	[0.63;1.08]	0.154

Smoking of the man did not significantly increase the time to pregnancy (hazard ratio=0.82, 95%-CI=[0.63;1.08], p=0.15).

Interaction

coxInter <- coxph(Surv(ttp,status)~smokeMan*smokeWoman,data=ttp,x=TRUE)
summary(coxInter)

Call:
coxph(formula = Surv(ttp, status) ~ smokeMan * smokeWoman, data = ttp, 
    x = TRUE)

  n= 423, number of events= 256 

                             coef exp(coef) se(coef)      z Pr(>|z|)  
smokeManYes               -0.1178    0.8889   0.1869 -0.630   0.5285  
smokeWomanYes             -0.3695    0.6911   0.2192 -1.686   0.0918 .
smokeManYes:smokeWomanYes  0.1210    1.1287   0.3209  0.377   0.7061  
---
codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                          exp(coef) exp(-coef) lower .95 upper .95
smokeManYes                  0.8889      1.125    0.6162     1.282
smokeWomanYes                0.6911      1.447    0.4497     1.062
smokeManYes:smokeWomanYes    1.1287      0.886    0.6018     2.117

Concordance= 0.536  (se = 0.017 )
Rsquare= 0.015   (max possible= 0.999 )
Likelihood ratio test= 6.31  on 3 df,   p=0.1
Wald test            = 6.01  on 3 df,   p=0.1
Score (logrank) test = 6.07  on 3 df,   p=0.1

publish(coxInter,org=TRUE)

Variable	HazardRatio	CI.95	p-value
smokeMan(No): smokeWoman(Yes vs No)	0.69	[0.45;1.06]	0.0918
smokeMan(Yes): smokeWoman(Yes vs No)	0.78	[0.49;1.23]	0.2892
smokeWoman(No): smokeMan(Yes vs No)	0.89	[0.62;1.28]	0.5285
smokeWoman(Yes): smokeMan(Yes vs No)	1.00	[0.60;1.67]	0.9902

There is no significant effect modification. The hazard ratio for non-smoking couples compared to smoking couples is:

exp(coef_1 + coef_2 + coef_3) = exp(-0.1178-0.3695+0.121)=
 0.8889 * 0.6911 * 1.1287 = 0.693

This is similar to the hazard ratio for smoking woman compared to non-smoking woman when the man is not smoking:

exp(coef_2) = exp(0.3695)= 0.6911

The p-value is 0.7061.

We can obtain the comparison between both smoking and none smoking by including the variable coupleSmoking (which has 4 levels):

publish(coxph(Surv(ttp,status)~coupleSmoking,data=ttp),org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
coupleSmoking	none	Ref
	woman only	0.69	[0.45;1.06]	0.0918
	man only	0.89	[0.62;1.28]	0.5285
	both	0.69	[0.49;0.99]	0.0428

Effect of alcohol (Cox regression)

Continuous alcohol variable

coxDrink <- coxph(Surv(ttp,status)~alcWoman,data=ttp)
publish(coxDrink,org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
alcWoman		0.98	[0.95;1.01]	0.132

The effect of alcohol is not statistically significant (p=0.132) but this model assumes that the relation between a unit change of alcohol drinking behaviour (1 more beverage per week) and the hazard ratio is linear. This assumption may be violated.

ttp[,drinkWoman:=factor(cut(alcWoman,c(-1,0,10,100)),labels=c("no","moderate","high"))]
coxDrink2 <- coxph(Surv(ttp,status)~drinkWoman,data=ttp)
publish(coxDrink2,org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
drinkWoman	no	Ref
	moderate	0.73	[0.55;0.95]	0.0214
	high	0.53	[0.32;0.89]	0.0163

The probability of pregnancy is significantly lower for women drinking moderately as compared to women not drinking (hazard ratio: 0.73, 95%-CI: [0.55;0.95], p-value: 0.021) and for women drinking excessively as compared to women not drinking (hazard ratio: 0.53, 95%-CI: [0.32;0.89], p-value: 0.016).

The comparison of excessively to moderate is obtained by using relevel:

ttp$drinkWoman2 <- relevel(ttp$drinkWoman,"moderate")
coxDrink3 <- coxph(Surv(ttp,status)~drinkWoman2,data=ttp)
publish(coxDrink3,org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
drinkWoman2	moderate	Ref
	no	1.38	[1.05;1.81]	0.0214
	high	0.73	[0.45;1.19]	0.2121

The probability of pregnancy is estimated as being lower, but not statistically significantly lower, for women drinking excessively as compared to women drinking moderately (hazard ratio: 0.73, 95%-CI: [0.45;0.95], p-value: 0.21).

Adjusted analyses

coxDrink4 <- coxph(Surv(ttp,status)~drinkWoman+smokeWoman,data=ttp)
publish(coxDrink4,org=TRUE)

Variable	Units	HazardRatio	CI.95	p-value
drinkWoman	no	Ref
	moderate	0.76	[0.58;1.00]	0.0477
	high	0.55	[0.33;0.92]	0.0240
smokeWoman	No	Ref
	Yes	0.74	[0.55;0.98]	0.0381

After adjusting for smoking the results are similar as above.

coxDrink5 <- coxph(Surv(ttp,status)~drinkWoman+spermConc,data=ttp)
publish(coxDrink5,org=TRUE)

Variable	Units	Missing	HazardRatio	CI.95	p-value
drinkWoman	no	0	Ref
	moderate		0.68	[0.51;0.91]	0.00952
	high		0.47	[0.27;0.83]	0.00881
spermConc		113	1.00	[1.00;1.01]	< 0.001

After adjusting for sperm concentration the results become more significant and the hazard ratios for group comparisons are more extreme. The reasons are

sperm concentration is an important factor which when omitted leads to a systematic underestimation of effects. Like in logistic regression, this occurs even if the variable of interest (i.e. the drinking behavior of the woman) is independent of the factor (i.e. the sperm concentration).
sperm concentration is missing for 113 of the men. The default behavior of the software is to remove all the data of all couples where the sperm concentration is missing. This changes the data set quite a lot and there may be a selection bias.

Basic statistics exercises (survival analysis)

Table of Contents

Tutorial

Lung cancer data

Create a new status variable

Univariate Kaplan-Meier

stratified Kaplan-Meier

Univariate Cox regression

Age-adjusted Cox regression

Time to pregnancy data

Effect of smoking (KaplanMeier & logRank)

woman smoking

man smoking

Univariate Cox regression

Interaction

Effect of alcohol (Cox regression)

Continuous alcohol variable

Adjusted analyses