Data management: notes about editing overview files

Contents List

Introduction
Missing data
Dates
Death cause codes
BC second neoplasms
ER/PR measurements
w.b.c. measurements
Text
Checklist for creating and amending entries
Data from multi-centre trials

Introduction

Please note that updates to patient records are applied via a 'difference mask' strategy, which enables an audit trail to be left and mistakes to be contained. It is dangerous to edit the 'active' file, as errors are difficult to spot and undo and in any case the changes will be lost if the file is regenerated in the future. The correct way to apply updates is as follows:

Take a copy of the latest data file (OVERVIEW:nnn.NEW or OVERVIEW:nnn.DAT) into your private directory and edit all the changes into the local copy;
Run DIFFFE to produce a difference mask (BITnnn.DAT) between the latest data file and the modified local copy, and check them against the new information (sometimes it helps to look at the difference log DIFnnn.DAT too);
Incorporate the new mask into the existing update mask OVERVIEW:nnnBIT.NEW or OVERVIEW:nnnBIT.DAT, either by hand or using LAYBIT;
Run UPDATE to produce a new version of the data file (OVERVIEW:nnn.NEW or OVERVIEW:nnn.DAT) and check that it is identical with your local updated version (using the VMS DIFFERENCES utility);
Update the calendar file OVERVIEW:LBCnnn and run the usual checking procedures (STARE, INQUIST, LBASIC, LIFE etc). That's it!

Missing data

With the exception of second neoplasm sites, ALLC second remission dates and w.b.c. measurements (below), '0' or 'blank' always means 'not (yet) available'.

Dates

Ideally, dates are represented as a 7- or 8-digit number representing DMMYYYY or DDMMYYYY.
Sometimes the day is not available, in which case please just put MYYYY or MMYYYY (ranged right).
Sometimes, too, only the year is available, in which case please just put YYYY (ranged right).
It isn't necessary to guesstimate the missing digits, which can confuse trialists who didn't supply them (missing information is automatically 'best-guessed' by the software when doing analyses).
Because there is a great deal of 'heritage' (pre-2000) information currently on file awaiting update, a 2-digit year 'YY' would currently be interpreted as '19YY'.
There is a special convention for 'date first remission' in the ALLC overview: because there is no separate 'yes/no' variable, '-1' denotes 'definitely no CR' and '-2' denotes 'CR on unknown date'.
When placing dates in the text area, please include leading underscores where necessary to reduce the risk that an incomplete date might be misconstrued as something else, e.g. __MMYYYY or ____YYYY.

Death cause codes

'0' means 'not (yet) available', usually because the trialist neglected to say what the patient died of.
'12' means 'unknown' in the sense that the trialist has tried and failed to find out what the patient died of.
Unless the trialist says 'unknown', please don't put in code '12' as this tells us to abandon hope of pursuing it further. If the trialist says 'unknown but no evidence of (overview) disease at death', please put '10' (i.e. non-disease cause) and enter something like "NED (d. unknown)" in the text field (as described below).
'9' really means 'none of the other categories' whereas '10' just means 'not the overview disease', either because that is all that we know or sometimes because more than one death cause is supplied (in which case please enter the death cause text as described below, so that A.N. Expert can adjudicate later).
[Note added for BC 2000 overview: there is now a proliferation of additional codes for uncertain cases - please consider the coding scheme accordingly].
[Note added for BC 2005 overview: there is now a proliferation of additional codes for uncertain cases - please consider the coding scheme accordingly].

BC additional neoplasms

A site code '199' means 'not (yet) supplied' (same ethos as '0', which could be mistaken for 'no additional neoplasm' if no date is supplied). A site code '1999' means 'unknown' in the sense that the trialist has tried and failed to find out the site.

ER/PR measurements

If the measurement is actually '0' please put '-13'! Otherwise, it would be mistaken for 'missing'. Sorry, this is for historical reasons.

w.b.c. measurements

If the measurement is actually '0' please put '-1'. Otherwise, it would be mistaken for 'missing'.

Text

There is an infinitely-extensible space to put text, e.g. patient name, birth date, hospital number, N.H.S. number, life events, causes of death, multiple ICD codes, sundry comments etc etc.
A convention is applied, mainly to make it easier to scan files by eye but also in some cases to enable software to recognise the information (e.g. birth dates and multiple ICD codes), that these items are entered in the following way:
Order of multiple items, where available (suggestion only):

1.	Patient name	Given Names Surname (Maiden Surname)
2.	Birth date	(b. DDMMYYYY) or (b. __MMYYYY) or (b. ____YYYY)
3.	Hospital number	(XYZ456789A)
4.	N.H.S. number	NHS[ABC/123]
5.	Additional ER measurement	ER[code]
6.	Additional PR measurement	PR[code]
7.	Sentinel nodes	SN[nodes] where nodes might be a count (n), a fraction (n/n) or a percentage (n%)
8.	HER-2 assay	HER2[type of test,HER-2 measurement]
9.	FISH assay (HER-2/CEP-17 ratio)	FISH[type of FISH,FISH measurement]
10.	CISH assay	CISH[type of CISH,CISH measurement]
11.	Life events, e.g. additional malignancies, serious diseases	(event, DDMMYYYY) or event (DDMMYYYY) or (event, __MMYYYY) etc
12.	Multiple ICD codes for additional malignancies	3MAL[n,NNN,DDMMYYYY,n,NNN,DDMMYYYY] or 3MAL[n,NNN.N,DDMMYYYY,n,NNN.N,DDMMYYYY] etc, where n is 7, 8, 9 or 10
13.	If died definitely 'without disease', i.e. recurrence-free	NED (or NSR, if the trialist prefers)
14.	Cause of death	(d. cause, cause) or (d. 1a. cause, 1b. cause, 2. cause)
15.	Multiple ICD codes for death	ICDn[NNN,NNN] or ICDn[NNN.N,NNN,NNN.N] etc, where n is 7, 8, 9 or 10
16.	Other remarks	emigrated to Australia; flagged
17.	Text supplied by trialist, where strange	TEXT[strange illegibilia in tongues]

There is a program ICDADD which will append the text associated with ICD codes to the end of the record, where required, e.g. before printing out raw data for inspection. However, this can make the record too wide for some text editors and needs to be applied with care. It is also a pain if the ICD code is subsequently changed, as the text has to be stripped out again; for these reasons, ICDADD is not applied to the central data set.

Checklist for creating and amending entries

To add a new centre/group

Select the next free serial number in LIST

Allocate a slot in the relevant filing cabinet in the Overviews Office

Create a project file LBCnn, where nn is the centre/group number (the file template is LBC) and add in whatever details are available, including main personages and their contact details, general information and relevant publications; begin the calendar of events and action items (e.g. 'get data')

Add the main personages' contact details to ZAA

Add a banner line in LIST, including the centre/group name, country, main personages and date

Add details for at least one trial/stratum, as described below

Add a generic entry to BRIEF

(Optionally) run TERSE and RGBRIEF to update the YLIST and RGBRIEF files

Run RUGRAT and put the print-out into the filing cabinet together with any other related bumf

To add a new trial/stratum

Select the next free serial number for the relevant centre/group in LIST

Add a banner line in LIST, together with at least two treatment groups - (Unknown) suffices as a place-holder

Add a corresponding entry to BC:PLANET, if relevant, and check that the treatment agent decriptions in LIST are all 'generic', follow the usual machine-readable precepts and are represented in JG$DK:PLANET

Add entries to BRIEF, FINISH, GUESS, LURCHE, SERIAL and UPDATE, insofar as possible and where relevant

If data are available, process and add to the database (as described below)

Add appropriate entries to EXTRAS - at the very least, 'trials with no data' or 'rejects'

(Optionally) run TERSE and RGBRIEF to update the YLIST and RGBRIEF files

Run RUGRAT and put the print-out into the filing cabinet together with any other related bumf

Make sure that all of the treatment comparisons are added to the relevant steering files for subsequent analysis

To add new data

Normally, add a data conditioning module to TRAUMA and iterate with STARE, INQUIST and LIFE until the new data set is as close to perfect as can be achieved without pestering the trialist. Sometimes it is more convenient to write an ad hoc program than to use TRAUMA. In desperate cases, some help might be found by using external tabulation packages. If no individual patient data are available, although some tabular information has been obtained - typically from publications or reports - the module or program would instead generate a set of synthetic patient records to enable the analysis to do its best with whatever information is available

Should any questions arise, or any data errors be spotted, make a carefully-considered list of key points and edited highlights from the STARE and perhaps other print-outs to refer back to the trialist. It is paramount to 'measure twice and cut once' because most trialists and data managers tend to respond best to a few very clearly-framed and minimalist questions - if at all!

Amend all the relevant details in BRIEF, EXTRAS, FINISH, GUESS, LBCnn, LIST, LURCHE, SERIAL, UPDATE and other files mentioned in the 'to add a new trial/stratum' section (above)

(Optionally) run TERSE and RGBRIEF to update the YLIST and RGBRIEF files

Run INQUIST, LBASIC, LIFE, RUGRAT and STARE and put the print-outs into the filing cabinet together with any other related bumf. Send carefully-selected highlights to the relevant trialist(s) and/or data manager(s) unless questions have arisen which need to be dealt with first

To update existing data

The procedure follows the same general course as described in the 'to add new data' section (above), except that the new and existing records for each patient are carefully compared. This enables the quality of all baseline variables to be maintained and helps to prevent errors in the new data from degradating existing, usually well-verified results. Please see the Introduction (above)

Deleting items
The main watchword is 'DON'T'. The nuisance factor of past deletions lives on as a perpetual reminder and can entail the unwelcome tedium of groping through dusty old boxes of papers and near-unintelligible old computer files and media whilst trying to ascertain the provenance of various latter-day data sets and the associated decisions and actions taken down the ages. If the decision to remove a centre/group, trial/stratum or data set is imminent, please consider the alternatives first:

Centre/group: sometimes it is necessary to rationalise a collection of trials/strata which have been allocated to different centres/groups, or a centre/group might be subsumed by another or even cease to exist. Even if there is no possibility of data becoming 'orphaned' when doing so, please consider instead 'copying' all the relevant data and associated information into the new centre/group (don't forget to make certain that the trial/stratum identifiers on the patient records are all changed correctly), then marking the 'dead' trials/strata as 'rejects' (with appropriate pointers to where the 'live' can now be located) and leaving them intact, together with the old project file (LBCnn) and the other associated materials. If the bumf in the filing cabinet is moved and renumbered, please leave a note in the old slot too.

Trial/stratum: sometimes it is necessary to delete a duff stratum during the course of improving the description of a trial, often whilst its data set is being rebuilt. This isn't too bad, unless the serial number is subsequently recycled to refer to a different trial. Normally the number of strata increases during this process, so the problem can easily be avoided. Conversely, if a complete trial is to be removed for any reason, however 'nonexistent', 'ineligible' or 'duplicated' it might be, STOP! There is a real danger that someone in the future will 'rediscover' it and waste a great deal of time putting it back into the system before finally arriving at the same conclusions. Please just leave it in situ and mark as 'reject' with an explanatory note (as described below).

Patient record: of course, the numbers of patients in data sets can go down as well as up - for example, some might prove not to have been properly randomised. Beware when making subsequent updates that such patients are not inadvertently reintroduced just because their information is resupplied by the trialist!

Excluding or 'hiding' items
When a trial/stratum is to be excluded, please note that there are three options in EXTRAS:

Reject studies: these are truly rubbish (e.g. empty trials, duplications, forgeries etc etc).

Studies with dubious randomisation: 'nonstandard' is a polite but meaningless word used in LIST to avoid possible offence should a trialist see it. These include trials with the possibility of foreknowledge of the treatment allocation, including 'date-of-birth' or odd-even' allocations, and trials with significant numbers of missing patients or other serious imbalances between treatment allocations or prognostic variables (whether as a result of deliberate 'exclusions' by the trialists or by mistake or chance, unless exhaustively investigated). Some of these trials are otherwise well-conducted and the data might be useful for incidence and other epidemiological analyses where the allocations are not relevant.

Studies not eligible for overview: these are otherwise properly-conducted randomised trials addressing questions not currently of interest to the politburo, e.g. DCIS, advanced disease, some primary treatments and various non-fatal endpoints. Trials which would be ignored only on account of their arcane treatment comparisons are kept as 'eligible', but they are relegated to appropriately arcane steering files such as BAG.

A trial/stratum may, of course, be marked as a 'reject' in more than one of the above categories. It is extremely helpful to append a brief explanatory phrase to each 'reject' item in the comments field of EXTRAS.
Updating LIST
As you will see, there are various conventions in LIST to enable software to read it:

The centre/group entry line consists of the centre/group's name together with, optionally, the city and, mandatorially, the country, with a section at the end enclosed in {curly brackets} indicating the principal personages and the date any records were last modified, e.g:

11   Texas University, U.S.A. {Buzdar, Gutterman, Vassilopolou-Sellin: DEC-1994 (SECOND REVISED VERSION)}

The trial/stratum entry line consists of the trial/stratum's name with, optionally, the principal personages enclosed in (normal brackets) and followed by a colon. After the colon are listed entry criteria, prognostic factors and other remarks including the range of entry dates in MMM-YYYY format, each item separated by semicolons. At the end is a section enclosed in (normal brackets) indicating the number of patients and where we don't have data. Further optional comments in {curly brackets} can be selectively hidden when viewing or printing the file using CUTTER and other programs. e.g:

1101   Study 77-30A (Buzdar): Entry JUN-1977 to JUL-1980; N+; 2-Way (141 patients were entered)

The treatment entry lines consist of the" full names" of all agents together with doses [in square brackets] and other details (in normal brackets). The "full names", known proprietory names and "official abbreviations" are listed in JG$AK:PLANET [link here]. The point of this is to avoid any possibility of ambiguity (if required, a "compact" version of LIST with abbreviations can be generated using TERSE). Where we really don't know whether an agent was Prednisone or Prednisolone, or Vincristine or Vinblastine, or 5-Fluoro-Uracil or Ftorafur for example, it is necessary to fall back on the "official abbreviation" (Pr, V and F respectively) until the ambiguity can be resolved. e.g:

1   (5-Fluoro-Uracil + Doxorubicin + Cyclophosphamide) x 2yr

Data from multi-centre trials

Often a centre/group has sent us a dataset from their 'local' part of a multi-centre trial, or at least we have made a placeholder in the hope of obtaining data from them. Similarly, we might have received a dataset from the coördinating centre/group for the multi-centre trial, consisting typically of records collected from various participating centres/groups.
What to do when we have both? The safest answer is probably to operate 'parallel universes', as follows. Datasets supplied directly from participating centres/groups are catalogued and stored under serial numbers appropriate to each centre/group, as would normally be done with single trials. Datasets supplied from the coördinating centre/group are catalogued and stored under serial numbers appropriate to the coördinating centre/group with (if possible) a separate stratum for each participating centre/group.
The advantages of holding datasets of different provenance in parallel include the ease of tracking and updating (one deals directly with whoever supplied each set) and the possibility of comparing the results and using the 'better' data in analyses. Of course, if it is possible to compare paired patient records between the datasets a 'best of both' can be constructed.
By harmonising the 'year numbers' (in SERIAL) and the 'brief names' (in BRIEF) one can represent the various parts of a multi-centre trial as a single entity in forest plots, if required, selecting which datasets to use via the 'reject' option (in EXTRAS).

[Return to the Contents List]

[End of document, updated to 27 July 2006]