Data management: notes about editing overview files

[Return to the Contents List]

Contents List


Please note that updates to patient records are applied via a 'difference mask' strategy, which enables an audit trail to be left and mistakes to be contained. It is dangerous to edit the 'active' file, as errors are difficult to spot and undo and in any case the changes will be lost if the file is regenerated in the future. The correct way to apply updates is as follows:
    1. Take a copy of the latest data file (OVERVIEW:nnn.NEW or OVERVIEW:nnn.DAT) into your private directory and edit all the changes into the local copy;
    2. Run DIFFFE to produce a difference mask (BITnnn.DAT) between the latest data file and the modified local copy, and check them against the new information (sometimes it helps to look at the difference log DIFnnn.DAT too);
    3. Incorporate the new mask into the existing update mask OVERVIEW:nnnBIT.NEW or OVERVIEW:nnnBIT.DAT, either by hand or using LAYBIT;
    4. Run UPDATE to produce a new version of the data file (OVERVIEW:nnn.NEW or OVERVIEW:nnn.DAT) and check that it is identical with your local updated version (using the VMS DIFFERENCES utility);
    5. Update the calendar file OVERVIEW:LBCnnn and run the usual checking procedures (STARE, INQUIST, LBASIC, LIFE etc). That's it!

Missing data

With the exception of second neoplasm sites, ALLC second remission dates and w.b.c. measurements (below), '0' or 'blank' always means 'not (yet) available'.


Ideally, dates are represented as a 7- or 8-digit number representing DMMYYYY or DDMMYYYY.

Sometimes the day is not available, in which case please just put MYYYY or MMYYYY (ranged right).

Sometimes, too, only the year is available, in which case please just put YYYY (ranged right).

It isn't necessary to guesstimate the missing digits, which can confuse trialists who didn't supply them (missing information is automatically 'best-guessed' by the software when doing analyses).

Because there is a great deal of 'heritage' (pre-2000) information currently on file awaiting update, a 2-digit year 'YY' would currently be interpreted as '19YY'.

There is a special convention for 'date first remission' in the ALLC overview: because there is no separate 'yes/no' variable, '-1' denotes 'definitely no CR' and '-2' denotes 'CR on unknown date'.

When placing dates in the text area, please include leading underscores where necessary to reduce the risk that an incomplete date might be misconstrued as something else, e.g. __MMYYYY or ____YYYY.

Death cause codes

'0' means 'not (yet) available', usually because the trialist neglected to say what the patient died of.

'12' means 'unknown' in the sense that the trialist has tried and failed to find out what the patient died of.

Unless the trialist says 'unknown', please don't put in code '12' as this tells us to abandon hope of pursuing it further. If the trialist says 'unknown but no evidence of (overview) disease at death', please put '10' (i.e. non-disease cause) and enter something like "NED (d. unknown)" in the text field (as described below).

'9' really means 'none of the other categories' whereas '10' just means 'not the overview disease', either because that is all that we know or sometimes because more than one death cause is supplied (in which case please enter the death cause text as described below, so that A.N. Expert can adjudicate later).

[Note added for BC 2000 overview: there is now a proliferation of additional codes for uncertain cases - please consider the coding scheme accordingly].

[Note added for BC 2005 overview: there is now a proliferation of additional codes for uncertain cases - please consider the coding scheme accordingly].

BC additional neoplasms

A site code '199' means 'not (yet) supplied' (same ethos as '0', which could be mistaken for 'no additional neoplasm' if no date is supplied). A site code '1999' means 'unknown' in the sense that the trialist has tried and failed to find out the site.

ER/PR measurements

If the measurement is actually '0' please put '-13'! Otherwise, it would be mistaken for 'missing'. Sorry, this is for historical reasons.

w.b.c. measurements

If the measurement is actually '0' please put '-1'. Otherwise, it would be mistaken for 'missing'.


There is an infinitely-extensible space to put text, e.g. patient name, birth date, hospital number, N.H.S. number, life events, causes of death, multiple ICD codes, sundry comments etc etc.

A convention is applied, mainly to make it easier to scan files by eye but also in some cases to enable software to recognise the information (e.g. birth dates and multiple ICD codes), that these items are entered in the following way:

Order of multiple items, where available (suggestion only):

1. Patient name Given Names Surname (Maiden Surname)
2. Birth date (b. DDMMYYYY) or (b. __MMYYYY) or (b. ____YYYY)
3. Hospital number (XYZ456789A)
4. N.H.S. number NHS[ABC/123]
5. Additional ER measurement ER[code]
6. Additional PR measurement PR[code]
7. Sentinel nodes SN[nodes] where nodes might be a count (n), a fraction (n/n) or a percentage (n%)
8. HER-2 assay HER2[type of test,HER-2 measurement]
9. FISH assay (HER-2/CEP-17 ratio) FISH[type of FISH,FISH measurement]
10. CISH assay CISH[type of CISH,CISH measurement]
11. Life events, e.g. additional malignancies, serious diseases (event, DDMMYYYY) or event (DDMMYYYY) or (event, __MMYYYY) etc
12. Multiple ICD codes for additional malignancies 3MAL[n,NNN,DDMMYYYY,n,NNN,DDMMYYYY] or 3MAL[n,NNN.N,DDMMYYYY,n,NNN.N,DDMMYYYY] etc, where n is 7, 8, 9 or 10
13. If died definitely 'without disease', i.e. recurrence-free NED (or NSR, if the trialist prefers)
14. Cause of death (d. cause, cause) or (d. 1a. cause, 1b. cause, 2. cause)
15. Multiple ICD codes for death ICDn[NNN,NNN] or ICDn[NNN.N,NNN,NNN.N] etc, where n is 7, 8, 9 or 10
16. Other remarks emigrated to Australia; flagged
17. Text supplied by trialist, where strange TEXT[strange illegibilia in tongues]
There is a program ICDADD which will append the text associated with ICD codes to the end of the record, where required, e.g. before printing out raw data for inspection. However, this can make the record too wide for some text editors and needs to be applied with care. It is also a pain if the ICD code is subsequently changed, as the text has to be stripped out again; for these reasons, ICDADD is not applied to the central data set.

Checklist for creating and amending entries

To add a new centre/group

To add a new trial/stratum

To add new data

To update existing data

Deleting items

The main watchword is 'DON'T'. The nuisance factor of past deletions lives on as a perpetual reminder and can entail the unwelcome tedium of groping through dusty old boxes of papers and near-unintelligible old computer files and media whilst trying to ascertain the provenance of various latter-day data sets and the associated decisions and actions taken down the ages. If the decision to remove a centre/group, trial/stratum or data set is imminent, please consider the alternatives first:

Excluding or 'hiding' items

When a trial/stratum is to be excluded, please note that there are three options in EXTRAS:
  • Reject studies: these are truly rubbish (e.g. empty trials, duplications, forgeries etc etc).
  • Studies with dubious randomisation: 'nonstandard' is a polite but meaningless word used in LIST to avoid possible offence should a trialist see it. These include trials with the possibility of foreknowledge of the treatment allocation, including 'date-of-birth' or odd-even' allocations, and trials with significant numbers of missing patients or other serious imbalances between treatment allocations or prognostic variables (whether as a result of deliberate 'exclusions' by the trialists or by mistake or chance, unless exhaustively investigated). Some of these trials are otherwise well-conducted and the data might be useful for incidence and other epidemiological analyses where the allocations are not relevant.
  • Studies not eligible for overview: these are otherwise properly-conducted randomised trials addressing questions not currently of interest to the politburo, e.g. DCIS, advanced disease, some primary treatments and various non-fatal endpoints. Trials which would be ignored only on account of their arcane treatment comparisons are kept as 'eligible', but they are relegated to appropriately arcane steering files such as BAG.
  • A trial/stratum may, of course, be marked as a 'reject' in more than one of the above categories. It is extremely helpful to append a brief explanatory phrase to each 'reject' item in the comments field of EXTRAS.

    Updating LIST

    As you will see, there are various conventions in LIST to enable software to read it:

    Data from multi-centre trials

    Often a centre/group has sent us a dataset from their 'local' part of a multi-centre trial, or at least we have made a placeholder in the hope of obtaining data from them. Similarly, we might have received a dataset from the coördinating centre/group for the multi-centre trial, consisting typically of records collected from various participating centres/groups.

    What to do when we have both? The safest answer is probably to operate 'parallel universes', as follows. Datasets supplied directly from participating centres/groups are catalogued and stored under serial numbers appropriate to each centre/group, as would normally be done with single trials. Datasets supplied from the coördinating centre/group are catalogued and stored under serial numbers appropriate to the coördinating centre/group with (if possible) a separate stratum for each participating centre/group.

    The advantages of holding datasets of different provenance in parallel include the ease of tracking and updating (one deals directly with whoever supplied each set) and the possibility of comparing the results and using the 'better' data in analyses. Of course, if it is possible to compare paired patient records between the datasets a 'best of both' can be constructed.

    By harmonising the 'year numbers' (in SERIAL) and the 'brief names' (in BRIEF) one can represent the various parts of a multi-centre trial as a single entity in forest plots, if required, selecting which datasets to use via the 'reject' option (in EXTRAS).

    [Return to the Contents List]

    [End of document, updated to 27 July 2006]