Niels:

I'm not sure where our communication problem lies, so let me give some
examples of questions that ask for something outside the range of the data
available.

Wayne Nelson's example: GE bought rotors from another company; the vendor only
delivered rotors that exceeded a certain specification; GE wanted to know the
percent of rotors scrapped and the percentage that would be scrapped under a
proposed higher specification; the vendor refused to supply that information
so the data was left truncated (no data on how many rotors were "too weak" or
on how much "too weak" they were; data on the strengths of all supplied rotors
was available.  Is there any way to estimate what GE wants to know?  Using
assumptions, yes.  (This is from the Nelson article we cite.)

I do a lot of work in employment discrimination; one issue that arises is
promotion discrimination; I have access to the entire histories of all people
still with the company as of a certain date; however, I have no information on
anyone who left before that date; some who left before that data were promoted
and some were not; some still in the data were promoted before the cutoff date
while others were not even hired until after the cutoff date; what do I do?

There are other examples in the other articles we cite.

The *essential* difference, it seems to me, between left-truncated data and
delayed entry data is whether the subject's data are observed at all; in
truncated data, the data are *not observed at all*; in delayed entry data,
there is data for the subject.  I want to emphasize that truncated data means
that _no_ data is observed for that subject; in fact, one does not even have a
count of how many such subjects there are; one only has reason to believe that
there are such subjects.  Note that this use of truncated data is consistent
with other areas of statistics (and with how it is used in the papers cited in
our article).

I understand that for many biostatisticians, but not all, left-truncation has
somehow been defined to be the same as "delayed entry".  I do not think that
this is useful; worse, I think it is confusing to take a term with a long
history in statistics of meaning "unobserved" and give it a new meaning.
Further, not all biostatisticians equate these two -- see the articles we
cited.

(Other areas of statistics where truncation is used to mean "unobserved"
include:

astronomy (e.g., stars that give off so little light they cannot be seen with
any instruments we have);

many sports (the first occurrence I know of was a study of horse speeds by
Francis Galton using sources that only included horses that covered a one mile
course in no more than 2.5 minutes);

agriculture (Fisher used maximum likelihood estimators);

economics (esp. studies dealing with income which exclude people with incomes
above "X"); etc.)

A quick look at the example you cite in your email says to me that it is a
delayed entry example since there appears to be some data on every subject.

I hope that this is clearer.

Rich

From: Niels Keiding <nk@kubism.ku.dk>
Subject: Re: left-truncation
To: Richard Goldstein <richgold@netcom.com>
Date: Thu, 6 Mar 1997 08:12:01 +0100 (MET)
Cc: nk@kubism.ku.dk, fharrell@virginia.edu, pka@kubism.ku.dk

Rich: I (and other biostatisticians) agree completely with your 
interpretation of left truncation: we do not know how many there were 
with values less than the truncation point. In our example on diabetic 
patients, we know (right-censored) life times for those diabetics alive 
on 1 July 1973, and we know nothing about those already dead at that 
date, not even how many there were.
   I do not understand what you mean by delayed entry. If you know how 
many who died before the cut-off you should interpret the data as left 
CENSORED, which is very different, and not what the packages you speak about 
are doing. Another matter, carefully discussed in our book, is that 
one can always analyse left censored data as left truncated 
(=delayed entry), but that will involve loss of information. Maybe we 
need to agree on a particular example of what you mean by delayed entry, I 
have never heard of that interpretation before.
   Per just told me that he met Frank last week in Germany. We understand 
that these matters are more in your than in Frank's focus.

Niels

-- 
Niels Keiding                         Telephone +45 35 32 79 03
Department of Biostatistics           Fax       +45 35 32 79 07
University of Copenhagen
Blegdamsvej 3, DK-2200 Copenhagen N, Denmark



Niels:

"In many contexts individuals only come under observation some time after the
beginning of the relevant time scale.  There is now a complete theory
available for such _delayed entry_ situations." (from the Abstract, Keiding,
N., "Delayed Entry, the prevalent cohort study and survival synthesis," 1993.)

Several statistical packages clearly mean delayed entry to mean sometime after
whatever time 0 means in the study (Egret, S-Plus, Stata are examples).
Further, I don't know of any software that means left truncation when it says
that and I don't know of any software (other than LIMDEP) that has any
adjustment for left truncated data. So where does that leave us?

Rich



Date: Fri, 14 Mar 1997 01:13:13 -0800 (PST)
From: Richard Goldstein <richgold@netcom.com>
To: Bill Greene <wgreene@stern.nyu.edu>,
        Terry Therneau <therneau.terry@mayo.edu>,
        Bill Gould <wgould@stata.com>
cc: Frank Harrell <fharrell@virginia.edu>
Subject: query re: survival analysis
Return-Receipt-to: richgold@netcom.com

Bill Greene (Limdep), Terry Therneau (S-Plus) and Bill Gould (Stata):

Frank Harrell and I have recently written, and had accepted, two articles on
software for survival analysis (one for forthcoming Encyclopedia of
Biostatistics (Wiley) and one for The American Statistician.  An issue on
which I now find myself a bit confused has come up and I am writing to get
some clarification on exactly what your software does in a certain situation.

The issue relates to "left truncation", a term used in your documentation.

My concern here is mechanical, not theoretical and not mathematical.  In other
words, I want to know exactly how a user estimates a model with left
truncation when using your software.  Specifically, does one need to have
observations in the data set that are the "left-truncated" observations or
does one add an option to the command line that tells the software to modify
the likelihood function, or both?  An example may help:

Say I am estimating a model where survival of people with a particular heart
problem is at issue.  Say I work at a health care provider that is a referral
center for people of this kind (i.e., other providers, including doctors, HMOs
and hospitals refer these patients to my center).  I think there are three
classes of patients that I am interested in:

1. I have data on all those who are referred, and come, alive, to the center.

2. I know there are others who have the same problem but never even receive a
   referral before they die (maybe they are too scared to see a doctor at all,
   but an autopsy shows they had the problem of interest).  I don't know how
   many people are in this situation (even if autopsied, and probably not all
   are, I may not know of the results of the autopsy).

3. I know there are others who are referred but who don't live long enough to
   arrive at the center; I may or may not know about all specific cases in
   this group.

What if I want to estimate survival of all people with this problem,
regardless of whether they actually get to my center?  Can you software do
this?  Can your software estimate a model including either class 2 or class 3
or both?  If yes, how?  Which, if any of the classes above is what you mean by
"left truncation"?  If none of the above fits, please give an example of what
you mean by left truncation (possibly a subset of one of the classes).

If your software requires that an observation be entered for a left truncated
person, please give me a specific example of how this is done (and for the
sake of clarity, please add one or more examples of non-left-truncation so I
can see the differences); if such an example already appears in your model,
please give the exact page number.

If you incorporate left truncation via the inclusion of left truncated
observations, does the number of observations in the data set affect the
results of the estimation?  E.g., if the total data set has 250 people, will
the results be different, other things equal, if 1 person is left-truncated as
compared with having 25 left-truncated people?

If you also use the phrase "delayed entry" in your documentation, does
this mean something different from what you mean by "left truncation"?  If
yes, how does it differ, especially with respect to the example and questions
above?

If the above questions do not appear to make sense in the context of what you
mean by "left truncation", please do not hesitate to tell me what you mean,
and describe how it is used in your software; feel free to ask questions if
anything I have said is not clear to you.

I would like to share your answer with my co-author (Frank Harrell) and with
the editor with whom I am currently discussing this issue (Neils Keiding for
the Encyclopedia); if you have any objection to my sharing your answer, please
tell me so.

Thank you very much,
Rich Goldstein

Date: Fri, 14 Mar 1997 07:47:48 -0600
From: "Therneau, Terry M., Ph.D." <therneau@mayo.edu>
To: richgold@netcom.com
Subject: Re: query re: survival analysis
Cc: wgreene@stern.nyu.edu, wgould@stata.com,
        "Therneau, Terry M., Ph.D." <therneau@mayo.edu>, fharrell@virginia.edu
X-Sun-Charset: US-ASCII

The answer to your example is tied into another question, which is the
function you are trying to estimate.

   To estimate "Survival from appearance at my institution", use a standard
Cox model (or KM or logrank or ....), with 0= day of arrival.

   To estimate "Survival from onset of condition" you need to allow for
left truncation -
		subjects do not arrive on day 0, in this time scale
		some subjects don't arrive at all

	In Splus/SAS, the solution is the same.  Code the data for those
subjects who did come to your institution as (start, stop], where "start"
is the time they came into your clinic, "stop" is the end of their
follow-up,  and both numbers are relative to the onset of condition, which is
the day 0 in this case.  No options to the fit routine are specified, beyond
this data manipulation.
		coxph(Surv(start, stop, status) ~ x1 +......

   If subject Smith had onset Jan 1, came to your clinic on March 1, and died
on April 1, their observation would be (59, 90] with a status of 1.  Smith
would not be in the risk set on day 27, say.  He is left truncated on day 59.

    As soon as you tell me the time scale "time from onset", I believe that
you have also restricted the population to those with an "onset".  In your
original question to me, if "time from referral to death" is the time scale
then those who are never referred are not part of the population, by 
definition.  

     Software: the Cox models do this.  The KM version exists as a local macro
at Mayo, and real-soon-now as a part of the S survival library.  (The SAS
macro is tightly integrated with some expected survival tables, which are
relevant only to Olmsted County, so we do not distribute it).  There is a
sneaky way to do the KM in Splus:
		fit <- coxph(Surv(start, stop, status) ~1, data=....)
		curve <- survfit(fit)
		plot(curve);  print(curve);  etc...
That is, fit a Cox model with no covariates, then ask for the corresponding
survival curve.  The more straightforward  
		curve <- survfit(Surv(start, stop, status))
still needs some `prettifying' of the printout.  
	
   For the Cox model, truncation doesn't have a large effect, numerically, in
most examples.   The more important issue there is that the "right" time
scale for analysis, properly a scientific debate/decision, should be reflected
by using that scale in the analysis.  As an example consider a Mayo study on
the survival impact of L-Dopa in Parkinson disease.  Since this is a
slowly progressing disease, the time interval from diagnosis to Mayo referral
could be more related to distance, affluence, and prior Mayo contact than
to any feature of the disease process.  The investigators felt that "time
from diagnosis" was a more clinically meaningful quantitiy than "time from
appearance at Mayo".  


For the KM trunctation can have a large effect on the arithmetic.
The reason is that with truncation you can have small n (risk sets)
both at the end AND at the beginning of the survival curve.  This leads to
big step sizes and large SE's at both ends of the curve, and, since the
SE in the Kaplan-Meier is cumulative, large SE all the way along the curve.
For instance, assume that in your example one of the early referrals (came
on day 3 after referral) dies on day 7, and that only 4/2000 total cases
were referred before day 7.  The K-M will have a multiplicative step of
3/4 on that day, with se increment of 1/12 (Greenwood).   


 Terry M. Therneau, Ph.D.                        (507) 284-3694 
 Head, Section of Biostatistics                  (507) 284-9542  FAX
 Mayo Clinic                                     therneau.terry@mayo.edu
 Rochester, Minn 55905

From: William Gould <wgould@stata.com>
To: therneau@mayo.edu, wgreene@stern.nyu.edu, fharrell@virginia.edu
Subject: Re: query re: survival analysis
X-Organization: Stata Corporation
X-URL: http://www.stata.com
Date: Fri, 14 Mar 1997 11:07:20 -0600
Sender: wwg@stata.com

Dear Terry Therneau, Bill Greene, and Frank Harrell, 

As Terry was kind enough to copy me on his response to Rich Goldestein's
question, attached is the response I just sent.  I should have copied all 
of you at the outset.

Regards, 

-- Bill
wgould@stata.com 

>From:  wgould@stata.com (William Gould)
>To: richgold@netcom.com
>Subject:  Re: query re: survival analysis

Dear Rich Goldstein,

I think I do need a clarification of your question.

First, you may share my answer and qusetions with anybody you wish, including 
Bill Greene (Limdep) and Terry Therneau (S-Plus) and anybody else you wish 
to include.

One question I can answer:

     Do we use the phrase "left-truncation"?   I suspect we do and I wish we 
     had not because the term is confusing.  We do use the phrase "delayed
     entry" and where we used the term "left-truncation", we mean it as a
     synonym for "delayed entry".

     By "delayed entry", we mean a subject j was under observation from 
     t0_j to t_j and that, as a condition for the subject to enter the
     sample, the event could not have occurred before t0_j.  Had the event
     occurred before then, the subject would not have been observed.

Let me expand on that and, in the process, make clear my confusion concerning
your questions.  Although we talk about survival analysis as concering the
time of an event, we are being sloppy because time is not a well-defined
concept.  Time measured from when?  What we really mean is the time *BETWEEN*
two different kinds of events, the first event that starts the clock ticking
and the second that stops it.  Let us call 

       event0       The event that starts the clock
       event1       The failure event of interest

As some examples:

       1) We are analyzing newbornes.
               a) event0 might be birth and event1 death.

               b) event0 might be diagnosis of a problem sometime after 
                  birth and event1 death.

       2) We are analyzing survival of firms

               a) event0 might be incorporation and event1 bankruptcy.

               b) event0 might be going public and event1 bankruptcy.

               c) event0 might be 1dec1995, the date a law changed, and 
                  event1 the date the firm changed its policies.

       3)  We are analyzing survival of older patients

               a) event0 might be referral to the center and event1 death.

               b) event0 might be imputed onset of disease and event1 death.

Regardless of the situation, what we are analyzing is the time between event0
and event1.  In computer software, it is common to (arbitrarily) assign event0
a "time" of 0 and so the duration between event0 and event1 becomes the "time"
of event1.

The establishment of what event0 is and when it occurred, however, is of great
substantive importance.  When we get to doing the statistics, we are asserting
that the hazard function h() is a function of the duration between the two
events.  This is a substantive assumption.  We are asserting that, other
things being equal, two subjects the same time out from event0 face the same
hazard of event1.  We are asserting that the durations between event0 and
event1 are the natural units for the hazard function of the process we are
modeling.

In the "usual" case we have a sample and every subject is under observation
from the instant of event0.  By under observation, I mean that event1 would be
observed if it occurred.  In the usual case we talk about event0 corresponding
to "time" 0 and simply measure t (the time of event1 or censoring) as the
duration:

                       event0       event1
                         |<----t----->|
     Subject 1   --------|------------|-----------------------> calendar time

                                   |<----t----->|
     Subject 2   ------------------|------------|-------------> calendar time
                                  event0      event1

When we analyze this data, we treat them as

                 0            t
                 |<----t----->|
     Subject 1   |------------|-------------------------------> duration t

     Subject 2   |------------|-------------------------------> duration t
                 |<----t----->|
                 0            t

In the "delayed entry" case, the subject comes under observation at a time
later than event0.  Had event1 occurred while the subject was not under
observation, we would not have observed that event or the subject (assuming
the event is death).

Remember, it is still t, the time between event0 and event1, that is relevant
for our hazard function.  We might have:

                              entry
                                |
                       event0   |       event1
                         |<-----|-t------>|
     Subject 1   --------|------|---------|-------------------> calendar time

                                   |<----t----->|
     Subject 2   ------------------|------------|-------------> calendar time
                                  event0      event1
                                and entry

We now have a third time, the time of entry.  The problem with subject 1
is that the subject could not have had event1 occur between event0 and 
entry.  If that had happened, the subject would not have been observed.
Such data might arise as follows:

    1)  I define event0 as date of diagnosis.

    2)  I start collecting data at a health center.  For everyone who 
        is diagnosed after I start my data collection, event0 and entry 
        are the same instant.

    3)  I enrich the sample by going into their records and pulling 
        the files on some current patients.  The files say when a person
        was diagnosed.  However, the files are only for current patients, 
        meaning live ones.  In the enriched sample, had the patient died
        prior to the date I collected my data, the file would have been 
        closed and moved out of the cabinet.  Ergo, the subjects in the 
        enriched sample could not possibly have died between diagnosis
        and the date I collected data on them.  They entered my experiment 
        late.


That is the issue and the difficulty I have answering your questions is 
that I do not know when or what event0 is in your example.

> Say I am estimating a model where survival of people with a particular heart
> problem is at issue.

Survival from when?  What is event0?

> I think there are three classes of
> patients that I am interested in:
>
> 1. I have data on all those who are referred, and come, alive, to the
>   center.

This suggests to me you are thinking of event0 and being referred.

> 2. I know there are others who have the same problem but never even receive
>    a referral before they die (maybe they are too scared to see a doctor at
>    all, but an autopsy shows they had the problem of interest).  I don't
>    know how many people are in this situation (even if autopsied, and
>    probably not all are, I may not know of the results of the autopsy).

This suggests that you are thinking of event0 and something prior to 
referral.

> 3. I know there are others who are referred but who don't live long enough
>    to arrive at the center; I may or may not know about all specific cases
>    in this group.

This suggests the same thing #2 suggested to me.


Case 1
------

To make this problem explicit, I am first going to assume that event0 is
"preliminary diagnosis".  The process is this:

    1)  Persons go somewhere and receive a preliminary diagnosis.  This 
        is event0, this starts the clock ticking.

    2)  Patients at this point are referred to your health center.  Some of 
        them arrive at your health center some time later.

    3)  It is only when a patient arrives at the health center that you, 
        the analyst, see them.  They enter your data at that point.

    4)  You follow them continually after that.


So now let's consider each of your three cases.

> 1. I have data on all those who are referred, and come, alive, to the
>   center.

Well, you started with the difficult one.  The time line is

                        arrive at
                         center
            prelim diag.   |
             (event0)      |                       die
                 |         |<- under observation -> |
                 |         |                        |
    Subject:  ---|---------|------------------------|----------> calendar time 
                 |<--------------- t -------------->|
                 |<-- e -->|


What makes this case (statistically) difficult is that, had the subject died
in the interval e, they would never have entered your data.  Statistically,
this subject must be treated as surviving t conditional on surviving e.

Stata can handle this.  Pretend that, in the above diagram, t=20 and e=4.
The observation for this person would be coded:

             entry     t       outcome
             -------------------------
                 4    20         1

The person would inform Stata that the data looked like this by typing 

	. stset t outcome, t0(entry)

>From there on, every survival-analysis command would produce properly
conditioned results.

> 2. I know there are others who have the same problem but never even receive
>    a referral before they die (maybe they are too scared to see a doctor at
>    all, but an autopsy shows they had the problem of interest).  I don't
>    know how many people are in this situation (even if autopsied, and
>    probably not all are, I may not know of the results of the autopsy).

In this case -- under my assumptions about event0 -- #2 makes no sense.  This
would be a person who died prior to event0 and that is logically impossible.
I.e., the way I have defined my hazard function gives no meaning to this
statement, which suggests I have not defined my hazard function properly.  In
Case 2, below, I make a different set of assumptions about the meaning of
event0 and then this question will have meaning.


> 3. I know there are others who are referred but who don't live long enough
>    to arrive at the center; I may or may not know about all specific cases
>    in this group.

This is the issue that made #1 statistically difficult.  We have handled 
that problem.  Given how I defined event0 and event1 and entry time, 
you have no data on these people and there is no problem that arises because
of that.


Case 2
------

In this case I am going to make a far more complex set of assumptions about 
event0, event1, and the data-collection effort.

    1)  Persons get sick with the disease.  Maybe they know this, maybe 
        they do not.

    2)  Some go to see doctors.  Some who are sick get referred, some do 
        not.  I suppose the doctor imputes the time corresponding to 1.

    3)  Of those referred, some go the health center.  Some do not because
        they die too soon or they just don't go.

    4)  My data collection starts here, at the health center, although 
        I will get some other data, too.

    5)  Patients are under continual observation once they go to the center.



> 1. I have data on all those who are referred, and come, alive, to the
>   center.

The time-line is this:

             stricken          
             (event0)
                 |    see
                 |    doc
                 |     |     go to
                 |     |     center
                 |     |       |                   die
                 |     |       |                    |
    Subject:  ---|-----|-------|--------------------|----------> calendar time 
                 |<--------------- t -------------->|
                 |<---- e ---->|

As a data analyst, I am a little bothered that the time stricken is not 
observed but instead imputed, but that does not affect the scenario.

This is just like #1 in case 1 and my answer is the same.  Stata has no 
problem with this observation.  There are statistical issues in dealing 
with observations like this, but Stata handles them properly.

> 2. I know there are others who have the same problem but never even receive
>    a referral before they die (maybe they are too scared to see a doctor at
>    all, but an autopsy shows they had the problem of interest).  I don't
>    know how many people are in this situation (even if autopsied, and
>    probably not all are, I may not know of the results of the autopsy).

There are a number of cases here.  

             stricken          
             (event0)
                 |    die
                 |     |
    Subject:  ---|-----|---------------------------------------> calendar time 
                 |<-t->|


             stricken          
             (event0)
                 |     see 
                 |     doc
                 |      |       die
                 |      |        |
    Subject:  ---|------|--------|-----------------------------> calendar time  
                 <-------t------->

If you have data on these people -- which you say you get from autopsy -- 
then you can just include them:

             entry     t       outcome
             -------------------------
                 0     5         1
                 0     9         1

Since the autopsy does not condition on not dying over a certain period, these
people can be treated as if they were under continual observation.  I do,
however, wonder where you will obtain the date stricken.

Excluding these people from analysis is also no problem *ASSUMING* that 
they have the same h(t) as everybody else.

Please note, we are now moving far from mechanical issues of statistical 
software.


> 3. I know there are others who are referred but who don't live long enough
>    to arrive at the center; I may or may not know about all specific cases
>    in this group.

This case is now really no different than case (2).


Summary
-------

The issue of delayed entry/left truncation is

    1)  that the data that is collected be conditioned on not having 
        event1 occur during some period and 

    2)  whether the persons who are not observed somehow bias results.

(1) is an appropriate question for software vendors.  Stata can handle 
this.  Moreover, Stata can handle any kind of conditioning.  Pretend a 
scenario

              event0                             event1
      Subject --|----------|------------|----------|----------- calendar time
                           |<---------->|
                              drops out

During the drop out period the subject was not observed.  Had the subject died
(event1) during that period, we would never have seen the subject again.  In
fact, however, the subject showed up again later at our health center.

Thus, in analyzing this data, we must condition that event1 could not 
happen during the drop out period.  Stata can handle this:

           Subject id   t0     t     died
              46         0     4      0
              46         8     9      1

Handling of this issue will be automatic.  You can make the conditioning 
as complicated as you wish.

The second issue, whether there is bias caused by unobserved cases, is a
substantive issue that statistical software vendors cannot address.  This is
something each researcher must think carefully about.  This is no different
than the issues associated with right censoring.  Software manufacturers
cannot write software that is resistant to bias in all cases; the issue is
substantive.

Anyway, I am not certain I am interpreting your question properly.  If so, 
I need reassurance and if not, I need clarification.

-- Bill
wgould@stata.com
<end>


Date: Fri, 14 Mar 97 12:28:13 EST
From: William Greene <wgreene@stern.nyu.edu>
To: richgold@netcom.com, thernau.terry@mayo.edu, wgould@stata.com,
        fharrell@virginia.edu
Subject: LIMDEP/Survival

Rich: You are getting some great answers to your question - my own
education continues. I feel a bit guilty at this point, I'm not able
to do anything with your question until the weekend, perhaps, but I'll
get back to you as soon as I can.
	From the hip, LIMDEP assumes that the time frame is T0 to T, but
the observation occurs, by construction only at times Tj > T0, so that
the survival distribution, though defined from T0 onward, is only measured
from Tj onward. The truncation, then implies something about the measured
hazard rate.  LIMDEP treats this kind of problem, mathematically, the 
same as it treats other truncated distributions.  Having said that, I
think I should study the other answers you've received, because there
is clearly more to the truncation issue than that.
	Back to you later.
	/Bill Greene


Date: Fri, 14 Mar 1997 12:11:59 -0800 (PST)
From: Richard Goldstein <richgold@netcom.com>
X-Sender: richgold@netcom23
To: Bill Greene <wgreene@stern.nyu.edu>, Terry Bard <tbard@bih.harvard.edu>,
        Bill Gould <wgould@stata.com>
cc: Frank Harrell <fharrell@virginia.edu>
Subject: thanks, and some clarifications and questions
Return-Receipt-to: richgold@netcom.com

Bill, Terry and Bill:

First, thank you very much for your thoughtful, and very quick, responses.
Bill Gould's answer completely met my needs, needs which were not completely
clear, unfortunately, in my prior message.  I will try to be clearer here.
Let me work off of an example from Terry; one of the things he said was:

    If subject Smith had onset Jan 1, came to your clinic on March 1,
    and died on April 1, their observation would be (59, 90] with a
    status of 1. Smith would not be in the risk set on day 27, say.
    He is left truncated on day 59.

I want to build a little off of this and distinguish the following (unusual)
three cases (in all I assume that "event0" or "onset" is 1/1):

1. X had onset 1/1 and died before getting to the clinic; the analyst never
   hears of the existence of X; temporarily, at least, I call this left
   truncation.

2. Y had onset 1/1 and died before getting to the clinic; the analyst knows
   that Y was supposed to arrive on 3/1; when Y does not arrive, inquiries are
   made and it is learned that Y died sometime prior to 3/1 and after 1/1 but
   it is not known exactly when Y died; temporarily, at least, I call this
   left censoring.

3. Z had onset 1/1, came to the clinic on 3/1 and died on 4/1 (this is Terry's
   scenario); temporarily, at least, I call this delayed entry.

My questions:

a. which of these, if any, does your software handle and what do you call
   each?

b. regardless of whether your software can deal with none of these, one of
   these, two of them or all of them, are any of these the same in theoretical
   terms?  that is, there are some people who think that left truncation is
   exactly the same as delayed entry (but it is not clear that they are using
   the terms as I have used them above) at least mathematically; is it correct
   to treat these as equivalent as I have defined them above?  In particular,
   Terry says:

       To estimate "Survival from onset of condition" you need to
       allow for left truncation -

                subjects do not arrive on day 0, in this time scale
                some subjects don't arrive at all

       In Splus/SAS, the solution is the same.  Code the data for
       those subjects who did come to your institution as (start,
       stop], where "start" is the time they came into your clinic,
       "stop" is the end of their follow-up, and both numbers are
       relative to the onset of condition, which is the day 0 in this
       case.

   The indented part of what I here quote from Terry has two phrases
   (clauses?); the first (before the comma) sounds like what I called "delayed
   entry", while the second sounds like what I called "left truncation"; the
   material in the last part of what I quoted seems to be saying that these
   are the same except that I don't see how one can code the data at all for
   subjects you don't even know about; Terry, I would appreciate it if you
   would elucidate.

Note that there is a fourth relevant term in some literature (e.g., Andersen,
et al. (1993), _Statistical Models Based on Counting Processes_,
Springer-Verlag): left filtration; this appears to mean something like what I
have called "delayed entry" above: "by filtering, we mean that the individual
is not under observation all the time, but only when a suitable indicator
process is switched on" (p. 1; see also pp. 166-7 where they discuss
left-filtration, left-censoring and left-truncation).

The kernel of my use of "truncation" is that we know the event must have
happened to some cases, but we don't know who or how many or anything else
about them except that they must exist.  This is consistent with what I think
truncated data is in other areas of statistics.

I hope that the above is clearer.

Thanks again,
Rich


From: William Gould <wgould@stata.com>
To: richgold@netcom.com, wgreene@stern.nyu.edu, fharrell@virginia.edu,
        therneau.terry@mayo.edu
Subject: Re: thanks, and some clarifications and questions
X-Organization: Stata Corporation
X-URL: http://www.stata.com
Date: Fri, 14 Mar 1997 16:49:30 -0600
Sender: wwg@stata.com


> 1. X had onset 1/1 and died before getting to the clinic; the analyst never
>    hears of the existence of X; temporarily, at least, I call this left
>    truncation.
>
> 2. Y had onset 1/1 and died before getting to the clinic; the analyst knows
>    that Y was supposed to arrive on 3/1; when Y does not arrive, inquiries
>    are made and it is learned that Y died sometime prior to 3/1 and after
>    1/1 but it is not known exactly when Y died; temporarily, at least, I
>    call this left censoring.
>
> 3. Z had onset 1/1, came to the clinic on 3/1 and died on 4/1 (this is
>    Terry's scenario); temporarily, at least, I call this delayed entry.


> a. which of these, if any, does your software handle and what do you call
> each?

    1.  There is no data, so Stata can handle it because there is nothing 
        to code.  There is no information that the analyst knows.

    2.  Stata cannot handle this case; the user would be forced to either 
        exclude the observation (no bias associated with this) or 
        guess an exact date (which might bias results).

    3.  Stata can handle this case as I have previously explained.
 

> b. regardless of whether your software can deal with none of these, one of
>    these, two of them or all of them, are any of these the same in
>    theoretical terms?

No.  There is different information content in each of the observations.  If,
however, the researcher excludes the observation in case 2, then he is
treating it as if it has no information, and so in treatment, it would be the
same as case 1.

> is it correct to treat these as equivalent as I have defined them above?

It is correct to treat case 2 the same as case 1, but it is inefficient.


> that is, there are some people who think that left truncation is exactly the
> same as delayed entry (but it is not clear that they are using the terms as
> I have used them above)

In our experience, users use the phrases left truncation and delayed entry
interchangeably and, in all cases, are referring to case (3).  I do not have a
name to suggest for either (1) or (2).

-- Bill
wgould@stata.com


Date: Fri, 14 Mar 1997 17:10:13 -0600
From: "Therneau, Terry M., Ph.D." <therneau@mayo.edu>
To: wwg@stata.com
Cc: fharrell@virginia.edu, wgreene@stern.nyu.edu
Subject: More words


  I managed to forget to cc this --

> 1. X had onset 1/1 and died before getting to the clinic; the analyst never
>    hears of the existence of X; temporarily, at least, I call this left
>    truncation.
>
> 2. Y had onset 1/1 and died before getting to the clinic; the analyst knows
>    that Y was supposed to arrive on 3/1; when Y does not arrive, inquiries
>    are made and it is learned that Y died sometime prior to 3/1 and after
>    1/1 but it is not known exactly when Y died; temporarily, at least, I
>    call this left censoring.
>
> 3. Z had onset 1/1, came to the clinic on 3/1 and died on 4/1 (this is
>    Terry's scenario); temporarily, at least, I call this delayed entry.
 
 
  This is a time scale question.  When there is the possibility for patients to
be lost between time 0 and study entry, we have "left truncation".  In a study
with left truncation occurring, we will have two kinds of patients, those 
who were actually truncated (your case 1), and those who did make it into to
the study (your case 3).  Left truncation is the process that is operative, and
my preferred phrase would be "this study used a data collection process that was
subject to left truncation".  When I see a data set with (15, 28] as the first
observation for a subject, I know that there is left truncation in the data.
I would not normally describe this particular observation as "left truncated",
or as "delayed" or any other thing.                
 
   In my course, which will be the ASA 2 day in 98, I have been careful to say
on my slide "This is known as left truncation" and not "This observation is
left truncated" (which it isn't).
  And I see to my horror that my first note to Rich used exactly this `incorrect'
language!

	Terry

Date: Mon, 17 Mar 1997 05:45:19 -0800 (PST)
From: Richard Goldstein <richgold@netcom.com>
X-Sender: richgold@netcom22
To: Terry Therneau <therneau.terry@mayo.edu>, Bill Gould <wgould@stata.com>
cc: Frank Harrell <fharrell@virginia.edu>
Subject: Re: left truncation (fwd)
Return-Receipt-to: richgold@netcom.com

fyi

---------- Forwarded message ----------
Date: Sun, 16 Mar 97 13:49:30 EST
From: William Greene <wgreene@stern.nyu.edu>
To: Richard Goldstein <richgold@netcom.com>
Subject: Re: left truncation

Rich:  I think you may be misreading my text (and my answer).  Knowing
that the distribution is truncated does not automatically imply that
if you ignore the truncation, the coefficients you estimated are atten-
uated.  That is known to be true in a small set of cases.  Also, LIMDEP
does not estimated "attenuated" coefficients.  It does exactly what I
described in the earlier note.  For example, supose the true, underlying
distribution of survival times is known to be weibull, with parameter
	lambda = exp(-x'beta).
The true parameters, beta, are of interest.  Times are distributed from
0 to +infinity.  Now, suppose that, in spite of this known distribution,
we only observe individuals with t(i) >= T*.  Then, a log-likelihood func-
tion built up from
			f(t(i)|t(i)>=T*) = f(t(i))/S(T*),
then maximized with respect to beta (and sigma), gives consistent and,
based on the data in hand, efficient estimates of beta and sigma.  The
point here (and this relates to the earlier note that you forwarded to
me this morning), the known distribution, common both to observed and
unobserved individuals, is easily modified to produce an appropriate
distribution conditioned on the observation condition t(i) >= T*.  In
answer to your question, LIMDEP estimates beta, not a scaled version of
beta.  It turns out that in the special case in which times are lognormally
distributed and the covariates are normally distributed, least squares
regression of log(time) on the covariates will estimate a scalar multiple
of beta.  But, this is not what LIMDEP does. It uses maximum likelihood
and estimates beta directly.
	I think this addresses your question, but if not, do let me 
know.
		Cheers,
		Bill


Date: Mon, 17 Mar 1997 06:58:56 -0800 (PST)
From: Richard Goldstein <richgold@netcom.com>
X-Sender: richgold@netcom21
To: Frank Harrell <fharrell@virginia.edu>
Subject: summary of vendor discussion
Return-Receipt-to: richgold@netcom.com

Frank:

I think a summary of what I received goes something like this: 

S-Plus and Stata (and epicure and egret also) can deal with 
left-truncated data (meaning cases not observed at all and not in the 
data set) only via the assumption that those not in the data set are like 
those in the data set; 

for data that is *partly* left truncated (delayed entry: they are observed
starting some time after the date of initial interest and so are in the
data set but with incomplete information, and don't fail, or become
right-censored until after enter the data set), these packages can all
handle this; 

if "left-censoring" means that we know they failed prior to entering the 
data set but we only know they failed prior to some particular "date", 
then none of the packages can handle this;

Limdep handles left-truncated data, in parametric models only, by 
modifying the log-likelihood.

My questions: 

1. do you think this is a fair summary (I will check with the vendors too)?

2. do you think it is too long (e.g., should I drop the left-censoring 
issue; recall that Keiding brought it up)?

3. Assuming that it is fair, is this a reasonable stand to take with 
Keiding (after reminding him that our article deals with software, unlike 
his article)?

Do you still think (I do!) that there is a difference between 
left-truncation and delayed entry?  In particular, I would much rather 
assume that there is something different about those who don't survive 
into the data set as compared with those who do survive long enough to 
get into the data.

Thanks,
Rich

Date: Mon, 17 Mar 1997 12:51:44 -0800 (PST)
From: Richard Goldstein <richgold@netcom.com>
X-Sender: richgold@netcom9
To: Bill Greene <wgreene@stern.nyu.edu>,
        Terry Therneau <therneau.terry@mayo.edu>,
        Bill Gould <wgould@stata.com>
cc: Frank Harrell <fharrell@virginia.edu>
Subject: summary of our discussion
Return-Receipt-to: richgold@netcom.com

Gentlemen:

Below is a summary; please let me know asap if you have any problems with
this.  Thank you all very much for your help and your patience.

S-Plus and Stata (and epicure and egret also) can deal with
left-truncated data (meaning cases not observed at all and not in the
data set) only via the assumption that those not in the data set are like
those in the data set;

for data that is *partly* left truncated (delayed entry: they are observed
starting some time after the date of initial interest and so are in the
data set but with incomplete information, and don't fail, or become
right-censored until after enter the data set), these packages can all
handle this;

if "left-censoring" means that we know they failed prior to entering the
data set but we only know they failed prior to some particular "date",
then none of the packages can handle this;

Limdep handles left-truncated data, in parametric models only, by
modifying the log-likelihood.

Thanks,
Rich

From: William Gould <wgould@stata.com>
To: richgold@netcom.com, wgreene@stern.nyu.edu, fharrell@virginia.edu,
        therneau.terry@mayo.edu
Subject: Re: summary of our discussion
X-Organization: Stata Corporation
X-URL: http://www.stata.com
Date: Mon, 17 Mar 1997 18:49:10 -0600
Sender: wwg@stata.com

I do not think what you wrote correctly summarizes the discussion.  The
problem is with the first and last paragraphs:

> S-Plus and Stata (and epicure and egret also) can deal with left-truncated
> data (meaning cases not observed at all and not in the data set) only via
> the assumption that those not in the data set are like those in the data
> set;
>
> [...]
>
> Limdep handles left-truncated data, in parametric models only, by
> modifying the log-likelihood.

The first paragraph can be applied to any package -- Limdep included.  If
left-truncated data (as you define it) is not a problem, then its existence
can be ignored.

Reading what Bill Greene wrote, he does not have a solution for this problem
nor could he.  Surely no package can handle all the biases that could arise
were the unobserved data different from the data that was collected.  Think
about the following thought experiment:  I change the data not observed at all
and not in the data set so as to vary the biases; do Limdep's answers really
change?

Bill Greene should speak for himself, but I interpret what he wrote to be 
that in the case of parametric survival models, Limdep could condition on 
t(i) >= T*.  This is what you define as *PARTLY* left-truncated data:

> for data that is *partly* left truncated (delayed entry: they are observed
> starting some time after the date of initial interest and so are in the data
> set but with incomplete information, and don't fail, or become
> right-censored until after enter the data set), these packages can all
> handle this;

Finally, I have no problem with the middle paragraph

> if "left-censoring" means that we know they failed prior to entering the
> data set but we only know they failed prior to some particular "date", then
> none of the packages can handle this;

If I had to summarize this, here is what I would write:

     S-Plus and Stata (and epicure and egret also) can deal with data that is
     left truncated (delayed entry) in parametric and nonparametric survival
     models; Limdep can deal with this problem in parametric survival
     models.  Subject j is observed at t_j, some time after onset of the
     condition leading to failure.  The statistical procedure for handling
     such cases involves conditioning on the fact that the failure could not
     have occurred prior to t_j.

     No package can handle "left-censoring", meaning that we know only that
     subjects failed at some unknown time prior to some particular time t_j*.

It is not clear to me that I have summarized this correctly because 
Bill Greene wrote, 

           f(t(i)|t(i)>=T*) = ...

and he did not say whether T* can vary observation to observation; I assume
that it can.

-- Bill
wgould@stata.com


Date: Tue, 18 Mar 1997 05:24:03 -0800 (PST)
From: Richard Goldstein <richgold@netcom.com>
X-Sender: richgold@netcom4
To: Terry Therneau <therneau.terry@mayo.edu>, Bill Gould <wgould@stata.com>
cc: Frank Harrell <fharrell@virginia.edu>
Subject: Re: summary of our discussion (fwd)
Return-Receipt-to: richgold@netcom.com

fyi

---------- Forwarded message ----------
Date: Mon, 17 Mar 97 16:38:21 EST
From: William Greene <wgreene@stern.nyu.edu>
To: Richard Goldstein <richgold@netcom.com>
Subject: Re: summary of our discussion

Rich:  Depending on your audience for this summary, I think I'd
go one more iteration.  It's not clear to me what the summary for
S-Plus and Stata means.  Do I (the reader) infer that these programs
therefore make no special consideration for truncation?  Or, do they
assume that truncation does not cause any biases?  Or what?  I
think all of the programs you looked at make the assumption that
the observations not in the data set are like those that make it
in, except, that is, for the truncation.  SOunds pretty confusing.
The description for LIMDEP, strictly speaking, is true, but if I
were reading this afresh, I can't say I'd know what Greene was
doing about truncation, if anything.
	On another front.  LIMDEP for Windows 95/NT is a few inches
away from completion.  Are you interested in playing with a nearly
final version?
	Regards,
	Bill


Date: Tue, 18 Mar 1997 10:47:34 -0600
From: "Therneau, Terry M., Ph.D." <therneau@mayo.edu>
To: richgold@netcom.com
Cc: fharrell@virginia.edu, wgould@stata.com, wgreene@stern.nyu.edu
Subject: Summary


   1. Splus handles left truncation, but not left censoring, for the Cox
model.
   2. Splus handles left censoring (also interval censoring) for the
parametric survival models, but not left truncation.  

   3. The left truncation for coxph appears to be more general than that
described in Limdep. In Splus each subject in the study has a separate
truncation time, in Limdep it appears that a single overall truncation time
can be specified.  I could easily be mistaken on this.

	Terry Therneau

PS What I have said for Splus also applies to SAS, at this time.