Niels: I'm not sure where our communication problem lies, so let me give some examples of questions that ask for something outside the range of the data available. Wayne Nelson's example: GE bought rotors from another company; the vendor only delivered rotors that exceeded a certain specification; GE wanted to know the percent of rotors scrapped and the percentage that would be scrapped under a proposed higher specification; the vendor refused to supply that information so the data was left truncated (no data on how many rotors were "too weak" or on how much "too weak" they were; data on the strengths of all supplied rotors was available. Is there any way to estimate what GE wants to know? Using assumptions, yes. (This is from the Nelson article we cite.) I do a lot of work in employment discrimination; one issue that arises is promotion discrimination; I have access to the entire histories of all people still with the company as of a certain date; however, I have no information on anyone who left before that date; some who left before that data were promoted and some were not; some still in the data were promoted before the cutoff date while others were not even hired until after the cutoff date; what do I do? There are other examples in the other articles we cite. The *essential* difference, it seems to me, between left-truncated data and delayed entry data is whether the subject's data are observed at all; in truncated data, the data are *not observed at all*; in delayed entry data, there is data for the subject. I want to emphasize that truncated data means that _no_ data is observed for that subject; in fact, one does not even have a count of how many such subjects there are; one only has reason to believe that there are such subjects. Note that this use of truncated data is consistent with other areas of statistics (and with how it is used in the papers cited in our article). I understand that for many biostatisticians, but not all, left-truncation has somehow been defined to be the same as "delayed entry". I do not think that this is useful; worse, I think it is confusing to take a term with a long history in statistics of meaning "unobserved" and give it a new meaning. Further, not all biostatisticians equate these two -- see the articles we cited. (Other areas of statistics where truncation is used to mean "unobserved" include: astronomy (e.g., stars that give off so little light they cannot be seen with any instruments we have); many sports (the first occurrence I know of was a study of horse speeds by Francis Galton using sources that only included horses that covered a one mile course in no more than 2.5 minutes); agriculture (Fisher used maximum likelihood estimators); economics (esp. studies dealing with income which exclude people with incomes above "X"); etc.) A quick look at the example you cite in your email says to me that it is a delayed entry example since there appears to be some data on every subject. I hope that this is clearer. Rich From: Niels Keiding Subject: Re: left-truncation To: Richard Goldstein Date: Thu, 6 Mar 1997 08:12:01 +0100 (MET) Cc: nk@kubism.ku.dk, fharrell@virginia.edu, pka@kubism.ku.dk Rich: I (and other biostatisticians) agree completely with your interpretation of left truncation: we do not know how many there were with values less than the truncation point. In our example on diabetic patients, we know (right-censored) life times for those diabetics alive on 1 July 1973, and we know nothing about those already dead at that date, not even how many there were. I do not understand what you mean by delayed entry. If you know how many who died before the cut-off you should interpret the data as left CENSORED, which is very different, and not what the packages you speak about are doing. Another matter, carefully discussed in our book, is that one can always analyse left censored data as left truncated (=delayed entry), but that will involve loss of information. Maybe we need to agree on a particular example of what you mean by delayed entry, I have never heard of that interpretation before. Per just told me that he met Frank last week in Germany. We understand that these matters are more in your than in Frank's focus. Niels -- Niels Keiding Telephone +45 35 32 79 03 Department of Biostatistics Fax +45 35 32 79 07 University of Copenhagen Blegdamsvej 3, DK-2200 Copenhagen N, Denmark Niels: "In many contexts individuals only come under observation some time after the beginning of the relevant time scale. There is now a complete theory available for such _delayed entry_ situations." (from the Abstract, Keiding, N., "Delayed Entry, the prevalent cohort study and survival synthesis," 1993.) Several statistical packages clearly mean delayed entry to mean sometime after whatever time 0 means in the study (Egret, S-Plus, Stata are examples). Further, I don't know of any software that means left truncation when it says that and I don't know of any software (other than LIMDEP) that has any adjustment for left truncated data. So where does that leave us? Rich Date: Fri, 14 Mar 1997 01:13:13 -0800 (PST) From: Richard Goldstein To: Bill Greene , Terry Therneau , Bill Gould cc: Frank Harrell Subject: query re: survival analysis Return-Receipt-to: richgold@netcom.com Bill Greene (Limdep), Terry Therneau (S-Plus) and Bill Gould (Stata): Frank Harrell and I have recently written, and had accepted, two articles on software for survival analysis (one for forthcoming Encyclopedia of Biostatistics (Wiley) and one for The American Statistician. An issue on which I now find myself a bit confused has come up and I am writing to get some clarification on exactly what your software does in a certain situation. The issue relates to "left truncation", a term used in your documentation. My concern here is mechanical, not theoretical and not mathematical. In other words, I want to know exactly how a user estimates a model with left truncation when using your software. Specifically, does one need to have observations in the data set that are the "left-truncated" observations or does one add an option to the command line that tells the software to modify the likelihood function, or both? An example may help: Say I am estimating a model where survival of people with a particular heart problem is at issue. Say I work at a health care provider that is a referral center for people of this kind (i.e., other providers, including doctors, HMOs and hospitals refer these patients to my center). I think there are three classes of patients that I am interested in: 1. I have data on all those who are referred, and come, alive, to the center. 2. I know there are others who have the same problem but never even receive a referral before they die (maybe they are too scared to see a doctor at all, but an autopsy shows they had the problem of interest). I don't know how many people are in this situation (even if autopsied, and probably not all are, I may not know of the results of the autopsy). 3. I know there are others who are referred but who don't live long enough to arrive at the center; I may or may not know about all specific cases in this group. What if I want to estimate survival of all people with this problem, regardless of whether they actually get to my center? Can you software do this? Can your software estimate a model including either class 2 or class 3 or both? If yes, how? Which, if any of the classes above is what you mean by "left truncation"? If none of the above fits, please give an example of what you mean by left truncation (possibly a subset of one of the classes). If your software requires that an observation be entered for a left truncated person, please give me a specific example of how this is done (and for the sake of clarity, please add one or more examples of non-left-truncation so I can see the differences); if such an example already appears in your model, please give the exact page number. If you incorporate left truncation via the inclusion of left truncated observations, does the number of observations in the data set affect the results of the estimation? E.g., if the total data set has 250 people, will the results be different, other things equal, if 1 person is left-truncated as compared with having 25 left-truncated people? If you also use the phrase "delayed entry" in your documentation, does this mean something different from what you mean by "left truncation"? If yes, how does it differ, especially with respect to the example and questions above? If the above questions do not appear to make sense in the context of what you mean by "left truncation", please do not hesitate to tell me what you mean, and describe how it is used in your software; feel free to ask questions if anything I have said is not clear to you. I would like to share your answer with my co-author (Frank Harrell) and with the editor with whom I am currently discussing this issue (Neils Keiding for the Encyclopedia); if you have any objection to my sharing your answer, please tell me so. Thank you very much, Rich Goldstein Date: Fri, 14 Mar 1997 07:47:48 -0600 From: "Therneau, Terry M., Ph.D." To: richgold@netcom.com Subject: Re: query re: survival analysis Cc: wgreene@stern.nyu.edu, wgould@stata.com, "Therneau, Terry M., Ph.D." , fharrell@virginia.edu X-Sun-Charset: US-ASCII The answer to your example is tied into another question, which is the function you are trying to estimate. To estimate "Survival from appearance at my institution", use a standard Cox model (or KM or logrank or ....), with 0= day of arrival. To estimate "Survival from onset of condition" you need to allow for left truncation - subjects do not arrive on day 0, in this time scale some subjects don't arrive at all In Splus/SAS, the solution is the same. Code the data for those subjects who did come to your institution as (start, stop], where "start" is the time they came into your clinic, "stop" is the end of their follow-up, and both numbers are relative to the onset of condition, which is the day 0 in this case. No options to the fit routine are specified, beyond this data manipulation. coxph(Surv(start, stop, status) ~ x1 +...... If subject Smith had onset Jan 1, came to your clinic on March 1, and died on April 1, their observation would be (59, 90] with a status of 1. Smith would not be in the risk set on day 27, say. He is left truncated on day 59. As soon as you tell me the time scale "time from onset", I believe that you have also restricted the population to those with an "onset". In your original question to me, if "time from referral to death" is the time scale then those who are never referred are not part of the population, by definition. Software: the Cox models do this. The KM version exists as a local macro at Mayo, and real-soon-now as a part of the S survival library. (The SAS macro is tightly integrated with some expected survival tables, which are relevant only to Olmsted County, so we do not distribute it). There is a sneaky way to do the KM in Splus: fit <- coxph(Surv(start, stop, status) ~1, data=....) curve <- survfit(fit) plot(curve); print(curve); etc... That is, fit a Cox model with no covariates, then ask for the corresponding survival curve. The more straightforward curve <- survfit(Surv(start, stop, status)) still needs some `prettifying' of the printout. For the Cox model, truncation doesn't have a large effect, numerically, in most examples. The more important issue there is that the "right" time scale for analysis, properly a scientific debate/decision, should be reflected by using that scale in the analysis. As an example consider a Mayo study on the survival impact of L-Dopa in Parkinson disease. Since this is a slowly progressing disease, the time interval from diagnosis to Mayo referral could be more related to distance, affluence, and prior Mayo contact than to any feature of the disease process. The investigators felt that "time from diagnosis" was a more clinically meaningful quantitiy than "time from appearance at Mayo". For the KM trunctation can have a large effect on the arithmetic. The reason is that with truncation you can have small n (risk sets) both at the end AND at the beginning of the survival curve. This leads to big step sizes and large SE's at both ends of the curve, and, since the SE in the Kaplan-Meier is cumulative, large SE all the way along the curve. For instance, assume that in your example one of the early referrals (came on day 3 after referral) dies on day 7, and that only 4/2000 total cases were referred before day 7. The K-M will have a multiplicative step of 3/4 on that day, with se increment of 1/12 (Greenwood). Terry M. Therneau, Ph.D. (507) 284-3694 Head, Section of Biostatistics (507) 284-9542 FAX Mayo Clinic therneau.terry@mayo.edu Rochester, Minn 55905 From: William Gould To: therneau@mayo.edu, wgreene@stern.nyu.edu, fharrell@virginia.edu Subject: Re: query re: survival analysis X-Organization: Stata Corporation X-URL: http://www.stata.com Date: Fri, 14 Mar 1997 11:07:20 -0600 Sender: wwg@stata.com Dear Terry Therneau, Bill Greene, and Frank Harrell, As Terry was kind enough to copy me on his response to Rich Goldestein's question, attached is the response I just sent. I should have copied all of you at the outset. Regards, -- Bill wgould@stata.com >From: wgould@stata.com (William Gould) >To: richgold@netcom.com >Subject: Re: query re: survival analysis Dear Rich Goldstein, I think I do need a clarification of your question. First, you may share my answer and qusetions with anybody you wish, including Bill Greene (Limdep) and Terry Therneau (S-Plus) and anybody else you wish to include. One question I can answer: Do we use the phrase "left-truncation"? I suspect we do and I wish we had not because the term is confusing. We do use the phrase "delayed entry" and where we used the term "left-truncation", we mean it as a synonym for "delayed entry". By "delayed entry", we mean a subject j was under observation from t0_j to t_j and that, as a condition for the subject to enter the sample, the event could not have occurred before t0_j. Had the event occurred before then, the subject would not have been observed. Let me expand on that and, in the process, make clear my confusion concerning your questions. Although we talk about survival analysis as concering the time of an event, we are being sloppy because time is not a well-defined concept. Time measured from when? What we really mean is the time *BETWEEN* two different kinds of events, the first event that starts the clock ticking and the second that stops it. Let us call event0 The event that starts the clock event1 The failure event of interest As some examples: 1) We are analyzing newbornes. a) event0 might be birth and event1 death. b) event0 might be diagnosis of a problem sometime after birth and event1 death. 2) We are analyzing survival of firms a) event0 might be incorporation and event1 bankruptcy. b) event0 might be going public and event1 bankruptcy. c) event0 might be 1dec1995, the date a law changed, and event1 the date the firm changed its policies. 3) We are analyzing survival of older patients a) event0 might be referral to the center and event1 death. b) event0 might be imputed onset of disease and event1 death. Regardless of the situation, what we are analyzing is the time between event0 and event1. In computer software, it is common to (arbitrarily) assign event0 a "time" of 0 and so the duration between event0 and event1 becomes the "time" of event1. The establishment of what event0 is and when it occurred, however, is of great substantive importance. When we get to doing the statistics, we are asserting that the hazard function h() is a function of the duration between the two events. This is a substantive assumption. We are asserting that, other things being equal, two subjects the same time out from event0 face the same hazard of event1. We are asserting that the durations between event0 and event1 are the natural units for the hazard function of the process we are modeling. In the "usual" case we have a sample and every subject is under observation from the instant of event0. By under observation, I mean that event1 would be observed if it occurred. In the usual case we talk about event0 corresponding to "time" 0 and simply measure t (the time of event1 or censoring) as the duration: event0 event1 |<----t----->| Subject 1 --------|------------|-----------------------> calendar time |<----t----->| Subject 2 ------------------|------------|-------------> calendar time event0 event1 When we analyze this data, we treat them as 0 t |<----t----->| Subject 1 |------------|-------------------------------> duration t Subject 2 |------------|-------------------------------> duration t |<----t----->| 0 t In the "delayed entry" case, the subject comes under observation at a time later than event0. Had event1 occurred while the subject was not under observation, we would not have observed that event or the subject (assuming the event is death). Remember, it is still t, the time between event0 and event1, that is relevant for our hazard function. We might have: entry | event0 | event1 |<-----|-t------>| Subject 1 --------|------|---------|-------------------> calendar time |<----t----->| Subject 2 ------------------|------------|-------------> calendar time event0 event1 and entry We now have a third time, the time of entry. The problem with subject 1 is that the subject could not have had event1 occur between event0 and entry. If that had happened, the subject would not have been observed. Such data might arise as follows: 1) I define event0 as date of diagnosis. 2) I start collecting data at a health center. For everyone who is diagnosed after I start my data collection, event0 and entry are the same instant. 3) I enrich the sample by going into their records and pulling the files on some current patients. The files say when a person was diagnosed. However, the files are only for current patients, meaning live ones. In the enriched sample, had the patient died prior to the date I collected my data, the file would have been closed and moved out of the cabinet. Ergo, the subjects in the enriched sample could not possibly have died between diagnosis and the date I collected data on them. They entered my experiment late. That is the issue and the difficulty I have answering your questions is that I do not know when or what event0 is in your example. > Say I am estimating a model where survival of people with a particular heart > problem is at issue. Survival from when? What is event0? > I think there are three classes of > patients that I am interested in: > > 1. I have data on all those who are referred, and come, alive, to the > center. This suggests to me you are thinking of event0 and being referred. > 2. I know there are others who have the same problem but never even receive > a referral before they die (maybe they are too scared to see a doctor at > all, but an autopsy shows they had the problem of interest). I don't > know how many people are in this situation (even if autopsied, and > probably not all are, I may not know of the results of the autopsy). This suggests that you are thinking of event0 and something prior to referral. > 3. I know there are others who are referred but who don't live long enough > to arrive at the center; I may or may not know about all specific cases > in this group. This suggests the same thing #2 suggested to me. Case 1 ------ To make this problem explicit, I am first going to assume that event0 is "preliminary diagnosis". The process is this: 1) Persons go somewhere and receive a preliminary diagnosis. This is event0, this starts the clock ticking. 2) Patients at this point are referred to your health center. Some of them arrive at your health center some time later. 3) It is only when a patient arrives at the health center that you, the analyst, see them. They enter your data at that point. 4) You follow them continually after that. So now let's consider each of your three cases. > 1. I have data on all those who are referred, and come, alive, to the > center. Well, you started with the difficult one. The time line is arrive at center prelim diag. | (event0) | die | |<- under observation -> | | | | Subject: ---|---------|------------------------|----------> calendar time |<--------------- t -------------->| |<-- e -->| What makes this case (statistically) difficult is that, had the subject died in the interval e, they would never have entered your data. Statistically, this subject must be treated as surviving t conditional on surviving e. Stata can handle this. Pretend that, in the above diagram, t=20 and e=4. The observation for this person would be coded: entry t outcome ------------------------- 4 20 1 The person would inform Stata that the data looked like this by typing . stset t outcome, t0(entry) >From there on, every survival-analysis command would produce properly conditioned results. > 2. I know there are others who have the same problem but never even receive > a referral before they die (maybe they are too scared to see a doctor at > all, but an autopsy shows they had the problem of interest). I don't > know how many people are in this situation (even if autopsied, and > probably not all are, I may not know of the results of the autopsy). In this case -- under my assumptions about event0 -- #2 makes no sense. This would be a person who died prior to event0 and that is logically impossible. I.e., the way I have defined my hazard function gives no meaning to this statement, which suggests I have not defined my hazard function properly. In Case 2, below, I make a different set of assumptions about the meaning of event0 and then this question will have meaning. > 3. I know there are others who are referred but who don't live long enough > to arrive at the center; I may or may not know about all specific cases > in this group. This is the issue that made #1 statistically difficult. We have handled that problem. Given how I defined event0 and event1 and entry time, you have no data on these people and there is no problem that arises because of that. Case 2 ------ In this case I am going to make a far more complex set of assumptions about event0, event1, and the data-collection effort. 1) Persons get sick with the disease. Maybe they know this, maybe they do not. 2) Some go to see doctors. Some who are sick get referred, some do not. I suppose the doctor imputes the time corresponding to 1. 3) Of those referred, some go the health center. Some do not because they die too soon or they just don't go. 4) My data collection starts here, at the health center, although I will get some other data, too. 5) Patients are under continual observation once they go to the center. > 1. I have data on all those who are referred, and come, alive, to the > center. The time-line is this: stricken (event0) | see | doc | | go to | | center | | | die | | | | Subject: ---|-----|-------|--------------------|----------> calendar time |<--------------- t -------------->| |<---- e ---->| As a data analyst, I am a little bothered that the time stricken is not observed but instead imputed, but that does not affect the scenario. This is just like #1 in case 1 and my answer is the same. Stata has no problem with this observation. There are statistical issues in dealing with observations like this, but Stata handles them properly. > 2. I know there are others who have the same problem but never even receive > a referral before they die (maybe they are too scared to see a doctor at > all, but an autopsy shows they had the problem of interest). I don't > know how many people are in this situation (even if autopsied, and > probably not all are, I may not know of the results of the autopsy). There are a number of cases here. stricken (event0) | die | | Subject: ---|-----|---------------------------------------> calendar time |<-t->| stricken (event0) | see | doc | | die | | | Subject: ---|------|--------|-----------------------------> calendar time <-------t-------> If you have data on these people -- which you say you get from autopsy -- then you can just include them: entry t outcome ------------------------- 0 5 1 0 9 1 Since the autopsy does not condition on not dying over a certain period, these people can be treated as if they were under continual observation. I do, however, wonder where you will obtain the date stricken. Excluding these people from analysis is also no problem *ASSUMING* that they have the same h(t) as everybody else. Please note, we are now moving far from mechanical issues of statistical software. > 3. I know there are others who are referred but who don't live long enough > to arrive at the center; I may or may not know about all specific cases > in this group. This case is now really no different than case (2). Summary ------- The issue of delayed entry/left truncation is 1) that the data that is collected be conditioned on not having event1 occur during some period and 2) whether the persons who are not observed somehow bias results. (1) is an appropriate question for software vendors. Stata can handle this. Moreover, Stata can handle any kind of conditioning. Pretend a scenario event0 event1 Subject --|----------|------------|----------|----------- calendar time |<---------->| drops out During the drop out period the subject was not observed. Had the subject died (event1) during that period, we would never have seen the subject again. In fact, however, the subject showed up again later at our health center. Thus, in analyzing this data, we must condition that event1 could not happen during the drop out period. Stata can handle this: Subject id t0 t died 46 0 4 0 46 8 9 1 Handling of this issue will be automatic. You can make the conditioning as complicated as you wish. The second issue, whether there is bias caused by unobserved cases, is a substantive issue that statistical software vendors cannot address. This is something each researcher must think carefully about. This is no different than the issues associated with right censoring. Software manufacturers cannot write software that is resistant to bias in all cases; the issue is substantive. Anyway, I am not certain I am interpreting your question properly. If so, I need reassurance and if not, I need clarification. -- Bill wgould@stata.com Date: Fri, 14 Mar 97 12:28:13 EST From: William Greene To: richgold@netcom.com, thernau.terry@mayo.edu, wgould@stata.com, fharrell@virginia.edu Subject: LIMDEP/Survival Rich: You are getting some great answers to your question - my own education continues. I feel a bit guilty at this point, I'm not able to do anything with your question until the weekend, perhaps, but I'll get back to you as soon as I can. From the hip, LIMDEP assumes that the time frame is T0 to T, but the observation occurs, by construction only at times Tj > T0, so that the survival distribution, though defined from T0 onward, is only measured from Tj onward. The truncation, then implies something about the measured hazard rate. LIMDEP treats this kind of problem, mathematically, the same as it treats other truncated distributions. Having said that, I think I should study the other answers you've received, because there is clearly more to the truncation issue than that. Back to you later. /Bill Greene Date: Fri, 14 Mar 1997 12:11:59 -0800 (PST) From: Richard Goldstein X-Sender: richgold@netcom23 To: Bill Greene , Terry Bard , Bill Gould cc: Frank Harrell Subject: thanks, and some clarifications and questions Return-Receipt-to: richgold@netcom.com Bill, Terry and Bill: First, thank you very much for your thoughtful, and very quick, responses. Bill Gould's answer completely met my needs, needs which were not completely clear, unfortunately, in my prior message. I will try to be clearer here. Let me work off of an example from Terry; one of the things he said was: If subject Smith had onset Jan 1, came to your clinic on March 1, and died on April 1, their observation would be (59, 90] with a status of 1. Smith would not be in the risk set on day 27, say. He is left truncated on day 59. I want to build a little off of this and distinguish the following (unusual) three cases (in all I assume that "event0" or "onset" is 1/1): 1. X had onset 1/1 and died before getting to the clinic; the analyst never hears of the existence of X; temporarily, at least, I call this left truncation. 2. Y had onset 1/1 and died before getting to the clinic; the analyst knows that Y was supposed to arrive on 3/1; when Y does not arrive, inquiries are made and it is learned that Y died sometime prior to 3/1 and after 1/1 but it is not known exactly when Y died; temporarily, at least, I call this left censoring. 3. Z had onset 1/1, came to the clinic on 3/1 and died on 4/1 (this is Terry's scenario); temporarily, at least, I call this delayed entry. My questions: a. which of these, if any, does your software handle and what do you call each? b. regardless of whether your software can deal with none of these, one of these, two of them or all of them, are any of these the same in theoretical terms? that is, there are some people who think that left truncation is exactly the same as delayed entry (but it is not clear that they are using the terms as I have used them above) at least mathematically; is it correct to treat these as equivalent as I have defined them above? In particular, Terry says: To estimate "Survival from onset of condition" you need to allow for left truncation - subjects do not arrive on day 0, in this time scale some subjects don't arrive at all In Splus/SAS, the solution is the same. Code the data for those subjects who did come to your institution as (start, stop], where "start" is the time they came into your clinic, "stop" is the end of their follow-up, and both numbers are relative to the onset of condition, which is the day 0 in this case. The indented part of what I here quote from Terry has two phrases (clauses?); the first (before the comma) sounds like what I called "delayed entry", while the second sounds like what I called "left truncation"; the material in the last part of what I quoted seems to be saying that these are the same except that I don't see how one can code the data at all for subjects you don't even know about; Terry, I would appreciate it if you would elucidate. Note that there is a fourth relevant term in some literature (e.g., Andersen, et al. (1993), _Statistical Models Based on Counting Processes_, Springer-Verlag): left filtration; this appears to mean something like what I have called "delayed entry" above: "by filtering, we mean that the individual is not under observation all the time, but only when a suitable indicator process is switched on" (p. 1; see also pp. 166-7 where they discuss left-filtration, left-censoring and left-truncation). The kernel of my use of "truncation" is that we know the event must have happened to some cases, but we don't know who or how many or anything else about them except that they must exist. This is consistent with what I think truncated data is in other areas of statistics. I hope that the above is clearer. Thanks again, Rich From: William Gould To: richgold@netcom.com, wgreene@stern.nyu.edu, fharrell@virginia.edu, therneau.terry@mayo.edu Subject: Re: thanks, and some clarifications and questions X-Organization: Stata Corporation X-URL: http://www.stata.com Date: Fri, 14 Mar 1997 16:49:30 -0600 Sender: wwg@stata.com > 1. X had onset 1/1 and died before getting to the clinic; the analyst never > hears of the existence of X; temporarily, at least, I call this left > truncation. > > 2. Y had onset 1/1 and died before getting to the clinic; the analyst knows > that Y was supposed to arrive on 3/1; when Y does not arrive, inquiries > are made and it is learned that Y died sometime prior to 3/1 and after > 1/1 but it is not known exactly when Y died; temporarily, at least, I > call this left censoring. > > 3. Z had onset 1/1, came to the clinic on 3/1 and died on 4/1 (this is > Terry's scenario); temporarily, at least, I call this delayed entry. > a. which of these, if any, does your software handle and what do you call > each? 1. There is no data, so Stata can handle it because there is nothing to code. There is no information that the analyst knows. 2. Stata cannot handle this case; the user would be forced to either exclude the observation (no bias associated with this) or guess an exact date (which might bias results). 3. Stata can handle this case as I have previously explained. > b. regardless of whether your software can deal with none of these, one of > these, two of them or all of them, are any of these the same in > theoretical terms? No. There is different information content in each of the observations. If, however, the researcher excludes the observation in case 2, then he is treating it as if it has no information, and so in treatment, it would be the same as case 1. > is it correct to treat these as equivalent as I have defined them above? It is correct to treat case 2 the same as case 1, but it is inefficient. > that is, there are some people who think that left truncation is exactly the > same as delayed entry (but it is not clear that they are using the terms as > I have used them above) In our experience, users use the phrases left truncation and delayed entry interchangeably and, in all cases, are referring to case (3). I do not have a name to suggest for either (1) or (2). -- Bill wgould@stata.com Date: Fri, 14 Mar 1997 17:10:13 -0600 From: "Therneau, Terry M., Ph.D." To: wwg@stata.com Cc: fharrell@virginia.edu, wgreene@stern.nyu.edu Subject: More words I managed to forget to cc this -- > 1. X had onset 1/1 and died before getting to the clinic; the analyst never > hears of the existence of X; temporarily, at least, I call this left > truncation. > > 2. Y had onset 1/1 and died before getting to the clinic; the analyst knows > that Y was supposed to arrive on 3/1; when Y does not arrive, inquiries > are made and it is learned that Y died sometime prior to 3/1 and after > 1/1 but it is not known exactly when Y died; temporarily, at least, I > call this left censoring. > > 3. Z had onset 1/1, came to the clinic on 3/1 and died on 4/1 (this is > Terry's scenario); temporarily, at least, I call this delayed entry. This is a time scale question. When there is the possibility for patients to be lost between time 0 and study entry, we have "left truncation". In a study with left truncation occurring, we will have two kinds of patients, those who were actually truncated (your case 1), and those who did make it into to the study (your case 3). Left truncation is the process that is operative, and my preferred phrase would be "this study used a data collection process that was subject to left truncation". When I see a data set with (15, 28] as the first observation for a subject, I know that there is left truncation in the data. I would not normally describe this particular observation as "left truncated", or as "delayed" or any other thing. In my course, which will be the ASA 2 day in 98, I have been careful to say on my slide "This is known as left truncation" and not "This observation is left truncated" (which it isn't). And I see to my horror that my first note to Rich used exactly this `incorrect' language! Terry Date: Mon, 17 Mar 1997 05:45:19 -0800 (PST) From: Richard Goldstein X-Sender: richgold@netcom22 To: Terry Therneau , Bill Gould cc: Frank Harrell Subject: Re: left truncation (fwd) Return-Receipt-to: richgold@netcom.com fyi ---------- Forwarded message ---------- Date: Sun, 16 Mar 97 13:49:30 EST From: William Greene To: Richard Goldstein Subject: Re: left truncation Rich: I think you may be misreading my text (and my answer). Knowing that the distribution is truncated does not automatically imply that if you ignore the truncation, the coefficients you estimated are atten- uated. That is known to be true in a small set of cases. Also, LIMDEP does not estimated "attenuated" coefficients. It does exactly what I described in the earlier note. For example, supose the true, underlying distribution of survival times is known to be weibull, with parameter lambda = exp(-x'beta). The true parameters, beta, are of interest. Times are distributed from 0 to +infinity. Now, suppose that, in spite of this known distribution, we only observe individuals with t(i) >= T*. Then, a log-likelihood func- tion built up from f(t(i)|t(i)>=T*) = f(t(i))/S(T*), then maximized with respect to beta (and sigma), gives consistent and, based on the data in hand, efficient estimates of beta and sigma. The point here (and this relates to the earlier note that you forwarded to me this morning), the known distribution, common both to observed and unobserved individuals, is easily modified to produce an appropriate distribution conditioned on the observation condition t(i) >= T*. In answer to your question, LIMDEP estimates beta, not a scaled version of beta. It turns out that in the special case in which times are lognormally distributed and the covariates are normally distributed, least squares regression of log(time) on the covariates will estimate a scalar multiple of beta. But, this is not what LIMDEP does. It uses maximum likelihood and estimates beta directly. I think this addresses your question, but if not, do let me know. Cheers, Bill Date: Mon, 17 Mar 1997 06:58:56 -0800 (PST) From: Richard Goldstein X-Sender: richgold@netcom21 To: Frank Harrell Subject: summary of vendor discussion Return-Receipt-to: richgold@netcom.com Frank: I think a summary of what I received goes something like this: S-Plus and Stata (and epicure and egret also) can deal with left-truncated data (meaning cases not observed at all and not in the data set) only via the assumption that those not in the data set are like those in the data set; for data that is *partly* left truncated (delayed entry: they are observed starting some time after the date of initial interest and so are in the data set but with incomplete information, and don't fail, or become right-censored until after enter the data set), these packages can all handle this; if "left-censoring" means that we know they failed prior to entering the data set but we only know they failed prior to some particular "date", then none of the packages can handle this; Limdep handles left-truncated data, in parametric models only, by modifying the log-likelihood. My questions: 1. do you think this is a fair summary (I will check with the vendors too)? 2. do you think it is too long (e.g., should I drop the left-censoring issue; recall that Keiding brought it up)? 3. Assuming that it is fair, is this a reasonable stand to take with Keiding (after reminding him that our article deals with software, unlike his article)? Do you still think (I do!) that there is a difference between left-truncation and delayed entry? In particular, I would much rather assume that there is something different about those who don't survive into the data set as compared with those who do survive long enough to get into the data. Thanks, Rich Date: Mon, 17 Mar 1997 12:51:44 -0800 (PST) From: Richard Goldstein X-Sender: richgold@netcom9 To: Bill Greene , Terry Therneau , Bill Gould cc: Frank Harrell Subject: summary of our discussion Return-Receipt-to: richgold@netcom.com Gentlemen: Below is a summary; please let me know asap if you have any problems with this. Thank you all very much for your help and your patience. S-Plus and Stata (and epicure and egret also) can deal with left-truncated data (meaning cases not observed at all and not in the data set) only via the assumption that those not in the data set are like those in the data set; for data that is *partly* left truncated (delayed entry: they are observed starting some time after the date of initial interest and so are in the data set but with incomplete information, and don't fail, or become right-censored until after enter the data set), these packages can all handle this; if "left-censoring" means that we know they failed prior to entering the data set but we only know they failed prior to some particular "date", then none of the packages can handle this; Limdep handles left-truncated data, in parametric models only, by modifying the log-likelihood. Thanks, Rich From: William Gould To: richgold@netcom.com, wgreene@stern.nyu.edu, fharrell@virginia.edu, therneau.terry@mayo.edu Subject: Re: summary of our discussion X-Organization: Stata Corporation X-URL: http://www.stata.com Date: Mon, 17 Mar 1997 18:49:10 -0600 Sender: wwg@stata.com I do not think what you wrote correctly summarizes the discussion. The problem is with the first and last paragraphs: > S-Plus and Stata (and epicure and egret also) can deal with left-truncated > data (meaning cases not observed at all and not in the data set) only via > the assumption that those not in the data set are like those in the data > set; > > [...] > > Limdep handles left-truncated data, in parametric models only, by > modifying the log-likelihood. The first paragraph can be applied to any package -- Limdep included. If left-truncated data (as you define it) is not a problem, then its existence can be ignored. Reading what Bill Greene wrote, he does not have a solution for this problem nor could he. Surely no package can handle all the biases that could arise were the unobserved data different from the data that was collected. Think about the following thought experiment: I change the data not observed at all and not in the data set so as to vary the biases; do Limdep's answers really change? Bill Greene should speak for himself, but I interpret what he wrote to be that in the case of parametric survival models, Limdep could condition on t(i) >= T*. This is what you define as *PARTLY* left-truncated data: > for data that is *partly* left truncated (delayed entry: they are observed > starting some time after the date of initial interest and so are in the data > set but with incomplete information, and don't fail, or become > right-censored until after enter the data set), these packages can all > handle this; Finally, I have no problem with the middle paragraph > if "left-censoring" means that we know they failed prior to entering the > data set but we only know they failed prior to some particular "date", then > none of the packages can handle this; If I had to summarize this, here is what I would write: S-Plus and Stata (and epicure and egret also) can deal with data that is left truncated (delayed entry) in parametric and nonparametric survival models; Limdep can deal with this problem in parametric survival models. Subject j is observed at t_j, some time after onset of the condition leading to failure. The statistical procedure for handling such cases involves conditioning on the fact that the failure could not have occurred prior to t_j. No package can handle "left-censoring", meaning that we know only that subjects failed at some unknown time prior to some particular time t_j*. It is not clear to me that I have summarized this correctly because Bill Greene wrote, f(t(i)|t(i)>=T*) = ... and he did not say whether T* can vary observation to observation; I assume that it can. -- Bill wgould@stata.com Date: Tue, 18 Mar 1997 05:24:03 -0800 (PST) From: Richard Goldstein X-Sender: richgold@netcom4 To: Terry Therneau , Bill Gould cc: Frank Harrell Subject: Re: summary of our discussion (fwd) Return-Receipt-to: richgold@netcom.com fyi ---------- Forwarded message ---------- Date: Mon, 17 Mar 97 16:38:21 EST From: William Greene To: Richard Goldstein Subject: Re: summary of our discussion Rich: Depending on your audience for this summary, I think I'd go one more iteration. It's not clear to me what the summary for S-Plus and Stata means. Do I (the reader) infer that these programs therefore make no special consideration for truncation? Or, do they assume that truncation does not cause any biases? Or what? I think all of the programs you looked at make the assumption that the observations not in the data set are like those that make it in, except, that is, for the truncation. SOunds pretty confusing. The description for LIMDEP, strictly speaking, is true, but if I were reading this afresh, I can't say I'd know what Greene was doing about truncation, if anything. On another front. LIMDEP for Windows 95/NT is a few inches away from completion. Are you interested in playing with a nearly final version? Regards, Bill Date: Tue, 18 Mar 1997 10:47:34 -0600 From: "Therneau, Terry M., Ph.D." To: richgold@netcom.com Cc: fharrell@virginia.edu, wgould@stata.com, wgreene@stern.nyu.edu Subject: Summary 1. Splus handles left truncation, but not left censoring, for the Cox model. 2. Splus handles left censoring (also interval censoring) for the parametric survival models, but not left truncation. 3. The left truncation for coxph appears to be more general than that described in Limdep. In Splus each subject in the study has a separate truncation time, in Limdep it appears that a single overall truncation time can be specified. I could easily be mistaken on this. Terry Therneau PS What I have said for Splus also applies to SAS, at this time.