Zero Filling
There are many occasions when it is desirable to create a template data set containing variables initialised to zero. Where as this technique is possible using traditional data step techniques it can be more efficiently performed by using Proc SQL.
The example we will demonstrate looks at joining quality of life questionnaire data which in this case has a fixed structure of 10 questions. When filled in by 10 subjects in our intent to treat population we would ideally like to end up with a data set with 100 observations. Unfortunately, there may well be situations where a subject only answers 6 or 8 questions and in these circumstances we need to ensure we still end up with 10 observations for that subject, with the remaining questions initialised to 0.
We start with a data set containing our actual quality of life responses):
data qol_actual;
input subjid question_no question;
datalines;
1 1 2
1 2 10
1 3 4
1 4 3
1 5 10
1 6 10
1 7 6
1 8 6
2 1 9
2 2 6
2 3 9
2 4 1
2 5 10
3 1 3
3 2 3
3 3 7
3 4 10
3 5 3
3 6 7
3 7 5
3 8 6
3 9 3
3 10 5
4 1 3
4 2 4
4 3 5
4 4 7
4 5 2
4 6 2
4 7 9
4 8 3
4 9 10
4 10 10
5 1 2
5 2 7
6 1 3
6 2 7
6 3 5
6 4 1
6 5 4
6 6 8
6 7 2
6 8 2
6 9 6
6 10 3
7 1 6
7 2 8
7 3 10
7 4 6
7 5 2
8 1 4
8 2 7
8 3 2
8 4 2
8 5 3
8 6 7
8 7 5
8 8 1
8 9 3
8 10 5
9 1 9
9 2 9
9 3 9
9 4 3
9 5 4
9 6 4
9 7 4
9 8 8
9 9 6
9 10 7
10 1 9
10 2 6
10 3 8
10 4 2
10 5 1
10 6 9
10 7 7
10 8 1
10 9 9
10 10 4
run;
Knowing the total number of questions we can construct a template data set with each question initialised to missing:
data qol_template;
length question_no 8;
retain question 0;
do question_no = 1 to 10;
output;
end;
run;
Quality of life template
question_
Obs no question
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
We can now use Proc SQL to perform a Cartesian product using the CROSS JOIN with our demographic data set. Cartesian products merge each row from one table with each row from the other. This will produce a data set which repeats the structure created in Qol_template for each subject.
data demog;
input age gender $ height weight subjid;
datalines;
22 M 64 60 1
26 F 87 74 2
29 M 55 62 3
28 M 84 74 4
23 F 84 88 5
30 M 87 76 6
34 F 78 72 7
38 F 69 56 8
56 F 52 60 9
52 F 66 69 10
run;
proc sql;
create table allscores as
select d.subjid, q.*
from in.demog as d cross join qol_template as q
order by subjid, question_no;
quit;
Using the data step we can now join our actual quality of life data back onto this template, overwriting the initial values where present but otherwise ensuring that questions not answered are initialised to 0. This ensures that our final table always has 10 observations per subject.
data actualscores;
merge allscores in.Qol_actual;
by subjid question_no;
run;
Actual Scores
question_
Obs subjid no question
1 001 1 2
2 001 2 10
3 001 3 4
4 001 4 3
5 001 5 10
6 001 6 10
7 001 7 6
8 001 8 6
9 001 9 1
10 001 10 1
11 002 1 9
12 002 2 6
13 002 3 9
14 002 4 1
15 002 5 10
16 002 6 0
17 002 7 0
18 002 8 0
19 002 9 0
20 002 10 0
21 003 1 3
22 003 2 3


