renvy code r求助,如何求Nelson-Aalen estimator

Austin Rochford - Bayesian Survival Analysis in Python with pymc3
studies the distribution of the time to an event. Its applications span many fields across medicine, biology, engineering, and social science. This post shows how to fit and analyze a Bayesian survival model in Python using .
We illustrate these concepts by analyzing a
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
from pymc3.distributions.timeseries import GaussianRandomWalk
import seaborn as sns
from statsmodels import datasets
from theano import tensor as T
Couldn't import dot_parser, loading of dot files will not be possible.
Fortunately,
makes it quite easy to load a number of data sets from R.
df = datasets.get_rdataset('mastectomy', 'HSAUR', cache=True).data
df.event = df.event.astype(np.int64)
df.metastized = (df.metastized == 'yes').astype(np.int64)
n_patients = df.shape[0]
patients = np.arange(n_patients)
metastized
n_patients
Each row represents observations from a woman diagnosed with breast cancer that underwent a mastectomy. The column time represents the time (in months) post-surgery that the woman was observed. The column event indicates whether or not the woman died during the observation period. The column metastized represents whether the cancer had
prior to surgery.
This post analyzes the relationship between survival time post-mastectomy and whether or not the cancer had metastized.
A crash course in survival analysis
First we introduce a (very little) bit of theory. If the random variable \(T\) is the time to the event we are studying, survival analysis is primarily concerned with the survival function
\[S(t) = P(T & t) = 1 - F(t),\]
where \(F\) is the
of \(T\). It is mathematically convenient to express the survival function in terms of the , \(\lambda(t)\). The hazard rate is the instantaneous probability that the event occurs at time \(t\) given that it has not yet occured. That is,
\[\begin{align*}
\lambda(t)
& = \lim_{\Delta t \to 0} \frac{P(t & T & t + \Delta t\ |\ T & t)}{\Delta t} \\
& = \lim_{\Delta t \to 0} \frac{P(t & T & t + \Delta t)}{\Delta t \cdot P(T & t)} \\
& = \frac{1}{S(t)} \cdot \lim_{\Delta t \to 0} \frac{S(t + \Delta t) - S(t)}{\Delta t}
= -\frac{S'(t)}{S(t)}.
\end{align*}\]
Solving this differential equation for the survival function shows that
\[S(t) = \exp\left(-\int_0^s \lambda(s)\ ds\right).\]
This representation of the survival function shows that the cumulative hazard function
\[\Lambda(t) = \int_0^t \lambda(s)\ ds\]
is an important quantity in survival analysis, since we may consicesly write \(S(t) = \exp(-\Lambda(t)).\)
An important, but subtle, point in survival analysis is . Even though the quantity we are interested in estimating is the time between surgery and death, we do not observe the death of every subject. At the point in time that we perform our analysis, some of our subjects will thankfully still be alive. In the case of our mastectomy study, df.event is one if the subject’s death was observed (the observation is not censored) and is zero if the death was not observed (the observation is censored).
df.event.mean()
Just over 40% of our observations are censored. We visualize the observed durations and indicate which observations are censored below.
fig, ax = plt.subplots(figsize=(8, 6))
blue, _, red = sns.color_palette()[:3]
ax.hlines(patients[df.event.values == 0], 0, df[df.event.values == 0].time,
color=blue, label='Censored');
ax.hlines(patients[df.event.values == 1], 0, df[df.event.values == 1].time,
color=red, label='Uncensored');
ax.scatter(df[df.metastized.values == 1].time, patients[df.metastized.values == 1],
color='k', zorder=10, label='Metastized');
ax.set_xlim(left=0);
ax.set_xlabel('Months since mastectomy');
ax.set_ylim(-0.25, n_patients + 0.25);
ax.legend(loc='center right');
When an observation is censored (df.event is zero), df.time is not the subject’s survival time. All we can conclude from such a censored obsevation is that the subject’s true survival time exceeds df.time.
This is enough basic surival analysis theory for the p for a more extensive introduction, consult Aalen et al.
Bayesian proportional hazards model
The two most basic estimators in survial analysis are the
of the survival function and the
of the cumulative hazard function. However, since we want to understand the impact of metastization on survival time, a risk regression model is more appropriate. Perhaps the most commonly used risk regression model is . In this model, if we have covariates \(\mathbf{x}\) and regression coefficients \(\beta\), the hazard rate is modeled as
\[\lambda(t) = \lambda_0(t) \exp(\mathbf{x} \beta).\]
Here \(\lambda_0(t)\) is the baseline hazard, which is independent of the covariates \(\mathbf{x}\). In this example, the covariates are the one-dimensonal vector df.metastized.
Unlike in many regression situations, \(\mathbf{x}\) should not include a constant term corresponding to an intercept. If \(\mathbf{x}\) includes a constant term corresponding to an intercept, the model becomes . To illustrate this unidentifiability, suppose that
\[\lambda(t) = \lambda_0(t) \exp(\beta_0 + \mathbf{x} \beta) = \lambda_0(t) \exp(\beta_0) \exp(\mathbf{x} \beta).\]
If \(\tilde{\beta}_0 = \beta_0 + \delta\) and \(\tilde{\lambda}_0(t) = \lambda_0(t) \exp(-\delta)\), then \(\lambda(t) = \tilde{\lambda}_0(t) \exp(\tilde{\beta}_0 + \mathbf{x} \beta)\) as well, making the model with \(\beta_0\) unidentifiable.
In order to perform Bayesian inference with the Cox model, we must specify priors on \(\beta\) and \(\lambda_0(t)\). We place a normal prior on \(\beta\), \(\beta \sim N(\mu_{\beta}, \sigma_{\beta}^2),\) where \(\mu_{\beta} \sim N(0, 10^2)\) and \(\sigma_{\beta} \sim U(0, 10)\).
A suitable prior on \(\lambda_0(t)\) is less obvious. We choose a semiparametric prior, where \(\lambda_0(t)\) is a piecewise constant function. This prior requires us to partition the time range in question into intervals with endpoints \(0 \leq s_1 & s_2 & \cdots & s_N\). With this partition, \(\lambda_0 (t) = \lambda_j\) if \(s_j \leq t & s_{j + 1}\). With \(\lambda_0(t)\) constrained to have this form, all we need to do is choose priors for the \(N - 1\) values \(\lambda_j\). We use independent vague priors \(\lambda_j \sim \operatorname{Gamma}(10^{-2}, 10^{-2}).\) For our mastectomy example, we make each interval three months long.
interval_length = 3
interval_bounds = np.arange(0, df.time.max() + interval_length + 1, interval_length)
n_intervals = interval_bounds.size - 1
intervals = np.arange(n_intervals)
We see how deaths and censored observations are distributed in these intervals.
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(df[df.event == 1].time.values, bins=interval_bounds,
color=red, alpha=0.5, lw=0,
label='Uncensored');
ax.hist(df[df.event == 0].time.values, bins=interval_bounds,
color=blue, alpha=0.5, lw=0,
label='Censored');
ax.set_xlim(0, interval_bounds[-1]);
ax.set_xlabel('Months since mastectomy');
ax.set_yticks([0, 1, 2, 3]);
ax.set_ylabel('Number of observations');
ax.legend();
With the prior distributions on \(\beta\) and \(\lambda_0(t)\) chosen, we now show how the model may be fit using MCMC simulation with pymc3. The key observation is that the piecewise-constant proportional hazard model is
to a Poisson regression model. (The models are not identical, but their likelihoods differ by a factor that depends only on the observed data and not the parameters \(\beta\) and \(\lambda_j\). For details, see Germán Rodríguez’s WWS 509 .)
We define indicator variables based on whether or the \(i\)-th suject died in the \(j\)-th interval,
\[d_{i, j} = \begin{cases}
1 & \textrm{if subject } i \textrm{ died in interval } j \\
0 & \textrm{otherwise}
\end{cases}.\]
last_period = np.floor((df.time - 0.01) / interval_length)
death = np.zeros((n_patients, n_intervals))
death[patients, last_period] = df.event
We also define \(t_{i, j}\) to be the amount of time the \(i\)-th subject was at risk in the \(j\)-th interval.
exposure = np.greater_equal.outer(df.time, interval_bounds[:-1]) * interval_length
exposure[patients, last_period] = df.time - interval_bounds[last_period]
Finally, denote the risk incurred by the \(i\)-th subject in the \(j\)-th interval as \(\lambda_{i, j} = \lambda_j \exp(\mathbf{x}_i \beta)\).
We may approximate \(d_{i, j}\) with a Possion random variable with mean \(t_{i, j}\ \lambda_{i, j}\). This approximation leads to the following pymc3 model.
SEED = 5078864 # from random.org
with pm.Model() as model:
lambda0 = pm.Gamma('lambda0', 0.01, 0.01, shape=n_intervals)
sigma = pm.Uniform('sigma', 0., 10.)
tau = pm.Deterministic('tau', sigma**-2)
mu_beta = pm.Normal('mu_beta', 0., 10**-2)
beta = pm.Normal('beta', mu_beta, tau)
lambda_ = pm.Deterministic('lambda_', T.outer(T.exp(beta * df.metastized), lambda0))
mu = pm.Deterministic('mu', exposure * lambda_)
obs = pm.Poisson('obs', mu, observed=death)
We now sample from the model.
n_samples = 40000
burn = 20000
with model:
step = pm.Metropolis()
trace_ = pm.sample(n_samples, step, random_seed=SEED)
[-----------------100%-----------------] 40000 of 40000 complete in 39.0 sec
trace = trace_[burn::thin]
We see that the hazard rate for subjects whose cancer has metastized is about one and a half times the rate of those whose cancer has not metastized.
np.exp(trace['beta'].mean())
pm.traceplot(trace, vars=['beta']);
pm.autocorrplot(trace, vars=['beta']);
We now examine the effect of metastization on both the cumulative hazard and on the survival function.
base_hazard = trace['lambda0']
met_hazard = trace['lambda0'] * np.exp(np.atleast_2d(trace['beta']).T)
def cum_hazard(hazard):
return (interval_length * hazard).cumsum(axis=-1)
def survival(hazard):
return np.exp(-cum_hazard(hazard))
def plot_with_hpd(x, hazard, f, ax, color=None, label=None, alpha=0.05):
mean = f(hazard.mean(axis=0))
percentiles = 100 * np.array([alpha / 2., 1. - alpha / 2.])
hpd = np.percentile(f(hazard), percentiles, axis=0)
ax.fill_between(x, hpd[0], hpd[1], color=color, alpha=0.25)
ax.step(x, mean, color=color, label=label);
fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6))
plot_with_hpd(interval_bounds[:-1], base_hazard, cum_hazard,
hazard_ax, color=blue, label='Had not metastized')
plot_with_hpd(interval_bounds[:-1], met_hazard, cum_hazard,
hazard_ax, color=red, label='Metastized')
hazard_ax.set_xlim(0, df.time.max());
hazard_ax.set_xlabel('Months since mastectomy');
hazard_ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
hazard_ax.legend(loc=2);
plot_with_hpd(interval_bounds[:-1], base_hazard, survival,
surv_ax, color=blue)
plot_with_hpd(interval_bounds[:-1], met_hazard, survival,
surv_ax, color=red)
surv_ax.set_xlim(0, df.time.max());
surv_ax.set_xlabel('Months since mastectomy');
surv_ax.set_ylabel('Survival function $S(t)$');
fig.suptitle('Bayesian survival model');
We see that the cumulative hazard for metastized subjects increases more rapidly initially (through about seventy months), after which it increases roughly in parallel with the baseline cumulative hazard.
These plots also show the pointwise 95% high posterior density interval for each function. One of the distinct advantages of the Bayesian model fit with pymc3 is the inherent quantification of uncertainty in our estimates.
Another of the advantages of the model we have built is its flexibility. From the plots above, we may reasonable believe that the additional hazard due to metastizat it seems plausible that cancer that has metastized increases the hazard rate immediately after the mastectomy, but that the risk due to metastization decreases over time. We can accomodate this mechanism in our model by allowing the regression coefficients to vary over time. In the time-varying coefficent model, if \(s_j \leq t & s_{j + 1}\), we let \(\lambda(t) = \lambda_j \exp(\mathbf{x} \beta_j).\) The sequence of regression coefficients \(\beta_1, \beta_2, \ldots, \beta_{N - 1}\) form a normal random walk with \(\beta_1 \sim N(0, 1)\), \(\beta_j\ |\ \beta_{j - 1} \sim N(\beta_{j - 1}, 1)\).
We implement this model in pymc3 as follows.
with pm.Model() as time_varying_model:
lambda0 = pm.Gamma('lambda0', 0.01, 0.01, shape=n_intervals)
beta = GaussianRandomWalk('beta', tau=1., shape=n_intervals)
lambda_ = pm.Deterministic('h', lambda0 * T.exp(T.outer(T.constant(df.metastized), beta)))
mu = pm.Deterministic('mu', exposure * lambda_)
obs = pm.Poisson('obs', mu, observed=death)
We proceed to sample from this model.
with time_varying_model:
step = pm.Metropolis()
time_varying_trace_ = pm.sample(n_samples, step, random_seed=SEED)
[-----------------100%-----------------] 40000 of 40000 complete in 56.7 sec
time_varying_trace = time_varying_trace_[burn::thin]
We see from the plot of \(\beta_j\) over time below that initially \(\beta_j & 0\), indicating an elevated hazard rate due to metastization, but that this risk declines as \(\beta_j & 0\) eventually.
fig, ax = plt.subplots(figsize=(8, 6))
beta_hpd = np.percentile(time_varying_trace['beta'], [2.5, 97.5], axis=0)
beta_low = beta_hpd[0]
beta_high = beta_hpd[1]
ax.fill_between(interval_bounds[:-1], beta_low, beta_high,
color=blue, alpha=0.25);
beta_hat = time_varying_trace['beta'].mean(axis=0)
ax.step(interval_bounds[:-1], beta_hat, color=blue);
ax.scatter(interval_bounds[last_period[(df.event.values == 1) & (df.metastized == 1)]],
beta_hat[last_period[(df.event.values == 1) & (df.metastized == 1)]],
c=red, zorder=10, label='Died, cancer metastized');
ax.scatter(interval_bounds[last_period[(df.event.values == 0) & (df.metastized == 1)]],
beta_hat[last_period[(df.event.values == 0) & (df.metastized == 1)]],
c=blue, zorder=10, label='Censored, cancer metastized');
ax.set_xlim(0, df.time.max());
ax.set_xlabel('Months since mastectomy');
ax.set_ylabel(r'$\beta_j$');
ax.legend();
The coefficients \(\beta_j\) begin declining rapidly around one hundred months post-mastectomy, which seems reasonable, given that only three of twelve subjects whose cancer had metastized lived past this point died during the study.
The change in our estimate of the cumulative hazard and survival functions due to time-varying effects is also quite apparent in the following plots.
tv_base_hazard = time_varying_trace['lambda0']
tv_met_hazard = time_varying_trace['lambda0'] * np.exp(np.atleast_2d(time_varying_trace['beta']))
fig, ax = plt.subplots(figsize=(8, 6))
ax.step(interval_bounds[:-1], cum_hazard(base_hazard.mean(axis=0)),
color=blue, label='Had not metastized');
ax.step(interval_bounds[:-1], cum_hazard(met_hazard.mean(axis=0)),
color=red, label='Metastized');
ax.step(interval_bounds[:-1], cum_hazard(tv_base_hazard.mean(axis=0)),
color=blue, linestyle='--', label='Had not metastized (time varying effect)');
ax.step(interval_bounds[:-1], cum_hazard(tv_met_hazard.mean(axis=0)),
color=red, linestyle='--', label='Metastized (time varying effect)');
ax.set_xlim(0, df.time.max() - 4);
ax.set_xlabel('Months since mastectomy');
ax.set_ylim(0, 2);
ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
ax.legend(loc=2);
fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6))
plot_with_hpd(interval_bounds[:-1], tv_base_hazard, cum_hazard,
hazard_ax, color=blue, label='Had not metastized')
plot_with_hpd(interval_bounds[:-1], tv_met_hazard, cum_hazard,
hazard_ax, color=red, label='Metastized')
hazard_ax.set_xlim(0, df.time.max());
hazard_ax.set_xlabel('Months since mastectomy');
hazard_ax.set_ylim(0, 2);
hazard_ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
hazard_ax.legend(loc=2);
plot_with_hpd(interval_bounds[:-1], tv_base_hazard, survival,
surv_ax, color=blue)
plot_with_hpd(interval_bounds[:-1], tv_met_hazard, survival,
surv_ax, color=red)
surv_ax.set_xlim(0, df.time.max());
surv_ax.set_xlabel('Months since mastectomy');
surv_ax.set_ylabel('Survival function $S(t)$');
fig.suptitle('Bayesian survival model with time varying effects');
We have really only scratched the surface of both survival analysis and the Bayesian approach to survival analysis. More information on Bayesian survival analysis is available in Ibrahim et al. (For example, we may want to account for individual frailty in either or original or time-varying models.)
This post is available as an
notebook .SAS Seminar_ Introduction to Survival Analysis in SAS10-第2页
上亿文档资料,等你来发现
SAS Seminar_ Introduction to Survival Analysis in SAS10-2
日;SASSeminar:Introductiont;1.00.....14991.00.....24;0.0;0.0.00.....94912.00...;;0.87;......;
日SAS Seminar: Introduction to Survival Analysis in SAS1.00 .....14991.00 .....24981.00 .....34971.00 .....44961.00 .....54951.00 .....64941.00 .....80.98400.01600.0056184922.00 .....94912.00 ..... ..... ..... ..... ..... ..... 0.03200.00787 ..... ..... 48430.96200.03800.0085519481Above we see the table of Kaplan-Meier estimates of the survival function produced by proc lifetest. Each row of the table corresponds to an intervalof time, beginning at the time in the \that has a different \see that we had 500 people at risk and that no one died, as \next interval, spanning from 1 day to just before 2 days, 8 people died, indicated by 8 rows of \where \Survival column are unconditional, and are to be interpreted asthe probability of surviving from the beginning of follow up time up to the number days in the LENFOL column.Let's take a look at later survival times in the table:Product-Limit Survival EstimatesLENFOL Numberat RiskObservedEventsSurvivalFailureSurvival StandardErrorNumberFailedNumberLeft359.00 ..... . 0.27600.0200*3620...*.0...*.0...*3610...*3580...*.0...*3570... . 35410.71990.28010.0201140353From \indicated by the \they drop out of the study, but are not counted as a failure. We can see this reflected in the survival function estimate for \interval [382,385) 1 out of 355 subjects at-risk died, yielding a conditional probability of failure (the probability of failure in the given interval, given that thesubject has survived up to the begininng of the interval) in this interval of 355?2. We see that the uncoditional probability of surviving beyond382 days is .7220, since
up to 382 S^days(382))==0.==0.7240p(surviving up to 382 days)×0.9971831, we can solve for p(surviving. In the table above, we see that the probability surviving beyond 363 days = 0.7240, the sameprobability as what we calculated for surviving up to 382 days, which implies that the censored observations do not change the survival estimates when0.9972they leave the study, only the number at risk.3.1.3. Graphing the Kaplan-Meier estimateGraphs of the Kaplan-Meier estimate of the survival function allow us to see how the survival function changes over time and are fortunately very easy togenerate in SAS:By default, proc lifetest graphs the Kaplan Meier estimate, even without the plot= option on the proc lifetest statement, so we couldhave used the same code from above that produced the table of Kaplan-Meier estimates to generate the graph.However, we would like to add confidence bands and the number at risk to the graph, so we add plots=survival(cb).http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/default.htm6/25日SAS Seminar: Introduction to Survival Analysis in SASroc lifetest data=whas500 atrisk plots=survival(cb) outs=outwhas500;p time lenfol*fstat(0);The step function form of the survival function is apparent in the graph of the Kaplan-Meier estimate. When a subject dies at a particular time point, thestep function drops, whereas in between failure times the graph remains flat. The survival function drops most steeply at the beginning of study, suggestingthat the hazard rate is highest immediately after hospitalization during the first 200 days. Censored observations are represented by vertical ticks on thegraph. Notice the survival probability does not change when we encounter a censored observation. Because the observation with the longest follow-up iscensored, the survival function will not reach 0. Instead, the survival function will remain at the survival probability estimated at the previous interval. Thesurvival function is undefined past this final interval at 2358 days. The blue-shaded area around the survival curve represents the 95% confidence band,here Hall-Wellner confidence bands. This confidence band is calculated for the entire survival function, and at any given interval must be wider than thepointwise confidence interval (the confidence interval around a single interval) to ensure that 95% of all pointwise confidence intervals are contained withinthis band. Many transformations of the survivor function are available for alternate ways of calculating confidence intervals through the conftype option,though most transformations should yield very similar confidence intervals.3.2. Nelson-Aalen estimator of the cumulative hazard functionBecause of its simple relationship with the survival function, S(t)=e?H(t), the cumulative hazard function can be used to estimate the survival function.The Nelson-Aalen estimator is a non-parametric estimator of the cumulative hazard function and is given by:H^(t)=∑diti≤tni,where di is the number who failed out of t.ni at risk in interval ti. The estimator is calculated, then, by summing the proportion of those at risk who failedin each interval up to time The Nelson-Aalen estimator is requested in SAS through the nelson option on the proc lifetest statement. SAS will output both Kaplan Meierestimates of the survival function and Nelson-Aalen estimates of the cumulative hazard function in one table.proc lifetest data=whas500time lenfol*fstat(0);Survival Function and Cumulative Hazard RateProduct-LimitNelson-AalenNumberat RiskObservedEventsSurvival StandardCumulativeCum HazNumberLENFOL FailedNumberLeftSurvivalFailureErrorHazardStandardError0.00 000.05001.00 .......14991.00 .......24981.00 .......34971.00 .......44961.00 .......54951.00 .......64941.00 .......80.98400.01600.005610.01600.0056684922.00 .......94912.00 ....... ....... ....... ....... ....... ....... 0.03200.007870.03230.00807 ....... ....... 48430.96200.03800.008550.03850.0088219481http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/default.htm7/25日3.00 8500SAS Seminar: Introduction to Survival Analysis in SAS4843..9481Let's confirm our understanding of the calculation of the Nelson-Aalen estimator by calculating the estimated cumulative hazard at day 3: ^(3)=H+person) by the end of 3 days. The estimate of survival beyond 3 days based off this Nelson-Aalen estimate of the cumulative hazard would then be 5, which matches the value in the table. The interpretation of this estimate is that we expect 0.0385 failures (per^(3)=exp(?0.3. This matches closely with the Kaplan Meier product-limit estimate of survival beyond 3 days of 0.9620. One canSrequest that SAS estimate the survival function by exponentiating the negative of the Nelson-Aalen estimator, also known as the Breslow estimator, ratherthan by the Kaplan-Meier estimator through the method=breslow option on the proc lifetest statement. In very large samples the Kaplan-Meierestimator and the transformed Nelson-Aalen (Breslow) estimator will converge.3.3. Calculating median, mean, and other survival times of interest in proc lifetestResearchers are often interested in estimates of survival time at which 50% or 25% of the population have died or failed. Because of the positive skewoften seen with followup-times, medians are often a better indicator of an \estimates of the mean survival time by default from proc lifetest. We see that beyond beyond 1,671 days, 50% of the population is expected to havefailed. Notice that the interval during which the first 25% of the population is expected to fail, [0,297) is much shorter than the interval during which thesecond 25% of the population is expected to fail, [297,1671). This reinforces our suspicion that the hazard of failure is greater during the beginning offollow-up time.proc lifetest data=whas500time lenfol*fstat(0);Quartile Estimates95% Confidence IntervalPointPercentEstimateTransform[LowerUpper)7550252353.00LOGLOG1627.00LOGLOG296.00LOGLOG.353.00146.00406.00MeanStandard Error1417.2148.143.4. Comparing survival functions using nonparametric testsSuppose that you suspect that the survival function is not the same among some of the groups in your study (some groups tend to fail more quickly thanothers). One can also use non-parametric methods to test for equality of the survival function among groups in the following manner:When provided with a grouping variable in a strata statement in proc lifetest, SAS will produce graphs of the survival function (unless othergraphs are requested) stratified by the grouping variable as well as tests of eqaulity of the survival function across strata. For example, we could enterthe class (categorical) variable gender on the strata statement to request that SAS compare the survival experiences of males and females.proc lifetest data=whas500 atrisk plots=survival(atrisk cb) outs=outwhas500 stime lenfol*fstat(0);Test of Equality over StrataPr >TestChi-SquareDFChi-SquareLog-RankWilcoxon-2Log(LR)7..51201110..0012In the graph of the Kaplan-Meier estimator stratified by gender below, it appears that females generally have a worse survival experience. This is reinforcedby the three significant tests of equality.3.4.1. Background: Tests of equality of the survival functionIn the output we find three Chi-square based tests of the equality of the survival function over strata, which support our suspicion that survival differsbetween genders. The calculation of the statistic for the nonparametric \http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/default.htm(?)28/25日SAS Seminar: Introduction to Survival Analysis in SAS[∑wj(dij?^eij)]Q=i=1m2i=1^ij∑w2jvm,^ij is the expected number of failures in stratum i at time tj, v^ij is the estimator ofwhere dij is the observed number of failures in stratum i at time tj, e^^the variance of dij, and wi is the weight of the difference at time tj (see Hosmer and Lemeshow(2008) for formulas for eij and vij). In a nutshell, thesestatistics sum the weighted differences between the observed number of failures and the expected number of failures for each stratum at each timepoint,assuming the same survival function of each stratum. In other words, if all strata have the same survival function, then we expect the same proportion todie in each interval. If these proportions systematically differ among strata across time, then the Q statistic will be large and the null hypothesis of nodifference among strata is more likely to be rejected.The log-rank and Wilcoxon tests in the output table differ in the weights wj used. The log-rank or Mantel-Haenzel test uses wj=1, so differences at alltime intervals are weighted equally. The Wilcoxon test uses wj=nj, so that differences are weighted by the number at risk at time tj, thus giving moreweight to differences that occur earlier in followup time. Other nonparametric tests using other weighting schemes are available through the test= optionon the strata statement. The \further discussed in this nonparametric section.3.5. Nonparametric estimation of the hazard functionStandard nonparametric techniques do not typically estimate the hazard function directly. However, we can still get an idea of the hazard rate using agraph of the kernel-smoothed estimate. As the hazard function h(t) is the derivative of the cumulative hazard function H(t), we can roughly estimate the^(t) between adjacent time points, ΔH^(t)=H^(tj)?H^(tj?1). SAS computesrate of change in H(t) by taking successive differences in Hdifferences in the Nelson-Aalen estimate of H(t). We generally expect the hazard rate to change smoothly (if it changes) over time, rather than jumparound haphazardly. To accomplish this smoothing, the hazard function estimate at any time interval is a weighted average of differences within a windowof time that includes many differences, known as the bandwidth. Widening the bandwidth smooths the function by averaging more differences together.However, widening will also mask changes in the hazard function as local changes in the hazard function are drowned out by the larger number of valuesthat are being averaged together. Below is an example of obtaining a kernel-smoothed estimate of the hazard function across BMI strata with a bandwidthof 200 days:We request plots of the hazard function with a bandwidth of 200 days with plot=hazard(bw=200)SAS conveniently allows the creation of strata from a continuous variable, such as bmi, on the fly with the strata statement We specify the leftendpoints of each bmi to form 5 bmi categories: 15-18.5, 18.5-25, 25-30, 30-40, and >40. strata bmi(15,18.5,25,30,40);time lenfol*fstat(0);proc lifetest data=whas500 atrisk plots=hazard(bw=200) outs=outwhas500;The lines in the graph are labeled by the midpoint bmi in each group. From the plot we can see that the hazard function indeed appears higher at thebeginning of follow-up time and then decreases until it levels off at around 500 days and stays low and mostly constant. The hazard function is alsogenerally higher for the two lowest BMI categories. The sudden upticks at the end of follow-up time are not to be trusted, as they are likely due to the fewnumber of subjects at risk at the end. The red curve representing the lowest BMI category is truncated on the right because the last person in that groupdied long before the end of followup time.4. Background: The Cox proportional hazards regression model4.1. Background: Estimating the hazard function, h(t)Whereas with non-parametric methods we are typically studying the survival function, with regression methods we examine the hazard function, h(t). Thehazard function for a particular time interval gives the probability that the subject will fail in that interval, given that the subject has not failed up to that pointin time. The hazard rate can also be interpreted as the rate at which failures occur at that point in time, or the rate at which risk is accumulated, aninterpretation that coincides with the fact that the hazard rate is the derivative of the cumulative hazard function, H(t).In regression models for survival analysis, we attempt to estimate parameters which describe the relationship between our predictors and the hazard rate.We would like to allow parameters, the βs, to take on any value, while still preserving the non-negative nature of the hazard rate. A common way toaddress both issues is to parameterize the hazard function as:h(t|x)=exp(β0+β1x).In this parameterization, h(t|x) is constrained to be strictly positive, as the exponential function always evaluates to positive, while β0 and β1 areallowed to take on any value. Notice, however, that t does not appear in the formula for the hazard function, thus implying that in this parameterization, wedo not model the hazard rate's dependence on time. A complete description of the hazard rate's relationship with time would require that the functionalform of this relationship be parameterized somehow (for example, one could assume that the hazard rate has an exponential relationship with time).However, in many settings, we are much less interested in modeling the hazard rate's relationship with time and are more interested in its dependence onother variables, such as experimental treatment or age. For such studies, a semi-parametric model, in which we estimate regression parameters ascovariate effects but ignore (leave unspecified) the dependence on time, is appropriate.http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/default.htm9/25日SAS Seminar: Introduction to Survival Analysis in SAS4.2. Background: The Cox proportional hazards modelWe can remove the dependence of the hazard rate on time by expressing the hazard rate as a product of r(x,β)h0(t), a baseline hazard rate which describesthe hazard rates dependence on time alone, and x, which describes the hazard rates dependence on the other x covariates:h(t)=h0(t)r(x,βx).In this parameterization, h(t) will equal h0(t) when r(x,βx)=1. It is intuitively appealing to let r(x,β0(t), equivalent to a regression intercept. Above, we discussed that expressing the hazard rate's dependence on its covariates asx)=1 when all x=0, thus making thebaseline hazard rate, han exponential function conveniently allows the regression coefficients to take on any value while still constraining the hazard rate to be positive. Theexponential function is also equal to 1 when its argument is equal to 0. We will thus let r(x,βx)=exp(xβx), and the hazard function will be given by:h(t)=h0(t)exp(xβx).This parameterization forms the Cox proportional hazards model. It is called the proportional hazards model because the ratio of hazard rates between twogroups with fixed covariates will stay constant over time in this model. For example, the hazard rate when time h(t|xt when x=x1 would then be 1)=h0(t)exp(x1βx), and at time t when x=x2 would be h(t|x2)=h0(t)exp(x2βx). The covariate effect of x, then is the ratio betweenthese two hazard rates, or a hazard ratio(HR):HR=h(t|x2)h0(t)exp(x2h(t|x1)=βx)h0(t)exp(x1βx)Notice that the baseline hazard rate, h0(t) is cancelled out, and that the hazard rate does not depend on time t:HR=exp(βx(x2?x1))The hazard rate than additive and are expressed as hazard ratios, rather than hazard differences. As we see above, one of the great advantages of the Cox model is thatHR will thus stay constant over time with fixed covariates. Because of this parameterization, covariate effects are multiplicative ratherestimating predictor effects does not depend on making assumptions about the form of the baseline hazard function, h0(t), which can be left unspecified.Instead, we need only assume that whatever the baseline hazard function is, covariate effects multiplicatively shift the hazard function and thesemultiplicative shifts are constant over time.Cox models are typically fitted by maximum likelihood methods, which estimate the regression parameters that maximize the probability of observing thegiven set of survival times. So what is the probability of observing subject i fail at time tj? At the beginning of a given time interval tj, say there are Rjsubjects still at-risk, each with their own hazard rates:h(tj|xi)=h0(tj)exp(xiβ)The probability of observing subject j fail out of all Rj remaing at-risk subjects, then, is the proportion of the sum total of hazard rates of all Rj subjectsthat is made up by subject j's hazard rate. For example, if there were three subjects still at risk at time tj, the probability of observing subject 2 fail attime tj would be:Pr(subject=2|failure=tj)=h(tj|x2)h(tj|x1)+h(tj|x2)+h(tj|x3)All of those hazard rates are based on the same baseline hazard rate h0(ti), so we can simplify the above expression to:Pr(subject=2|failure=tj)=exp(x2β)exp(x1β)+exp(x2β)+exp(x3β)We can similarly calculate the joint probability of observing each of the n subject's failure times, or the likelihood of the failure times, as a function of theregression parameters, β, given the subject's covariates values xj:L(β)=∏n{exp(xjβ)j=1∑}i∈Rjexp(xiβ)where Rj is the set of subjects still at risk at time tj. Maximum likelihood methods attempt to find the β values that maximize this likelihood, that is, theregression parameters that yield the maximum joint probability of observing the set of failure times with the associated set of covariate values. Becausethis likelihood ignores any assumptions made about the baseline hazard function, it is actually a partial likelihood, not a full likelihood, but the resulting have the same distributional properties as those derived from the full likelihood.β.5. Cox proportional hazards regression in SAS using proc phreg5.1. Fitting a simple Cox regression modelWe request Cox regression through proc phreg in SAS. Previously, we graphed the survival functions of males in females in the WHAS500 dataset andsuspected that the survival experience after heart attack may be different between the two genders. Perhaps you also suspect that the hazard ratechanges with age as well. Below we demonstrate a simple model in proc phreg, where we determine the effects of a categorical predictor, gender, anda continuous predictor, age on the hazard rate:To specify that gender is a categorical predictor, we enter it on the class statement.We also would like survival curves based on our model, so we add plots=survival to the proc phreg statement, although as we shall see thisspecification is probably insufficient for what we want.On the model statement, on the left side of the equation, we provide the follow up time variable, lenfol, and the censoring variable, fstat, with allcensoring values listed in parentheses. On the right side of the equation we list all the predictors.http://www.ats.ucla.edu/stat/sas/seminars/sas_survival/default.htm10/25三亿文库包含各类专业文献、各类资格考试、外语学习资料、行业资料、高等教育、中学教育、幼儿教育、小学教育、SAS Seminar_ Introduction to Survival Analysis in SAS10等内容。 
 (var is any valid SAS variable name.) replaces specific characters in a ...computes the logarithm of a survival function computes probability density (...  IF和WHERE语句_计算机软件及应用_IT/计算机_专业资料。SAS学习,IF语句,WHERE语句...actlevel in ('LOW','MOD'); fee in (124.80,178.20); age&=55 and...  A note in the SAS log refers to invalid ...to generate the following results: Analysis ...What is the first observation in SAS data set ...  (140-2010: Dear Miss SASAnswers: A Guide to Sorting Your Data 中的描述是这样的: If you want the sort to complete entirely in memory, a simple ...  SAS character expression LENGTH Returns the length of an argument LOWCASE Converts all letters in an argument to lowercase MISSING Returns a numeric result...  the following is written to the SAS log: data ...to generate the following results: Analysis ...What is the first observation in SAS data set ...  No messages are written to the SAS log. 7 Q ...to generate the following results: Analysis ...What is the first observation in SAS data set ...  一、基本操作 Editor 窗口 打开 sas 程序 (扩展名*.sas) Log 窗口 Output ...in none, join, spline, needle 7.proc g3d data=名字 曲面图 Plot x*y=...}

我要回帖

更多关于 envy code r 的文章

更多推荐

版权声明:文章内容来源于网络,版权归原作者所有,如有侵权请点击这里与我们联系,我们将及时删除。

点击添加站长微信