Skip to main Content

A View to a Kill: Solutions

Once again, thank you to the SAS community who not only sent in their solutions, but have also been blogging about possible options for solving this problem. 

The problem posed was:

Imagine you work in a department that collects a large volume of data each month.  To analyse the data and feed valuable insight back to your business, you run an analysis procedure which requires that all the data from the current year are available in a single table.

There are no issues in the first few months of the year, however your business is growing and ever more data are accumulated throughout the year.  Recently you have found the cumulative data set is too big for the disk you have available. 

A colleague mentioned that an Amadeus Consultant (naturally) had recommended considering something called a SAS View.
The following program found here illustrates how you have been appending the data together (albeit a very small sample of the problem!). 

Solution 1:  SAS View

A better solution would be a VIEW (i.e. separating the descriptor portion from the data portion)...;

data all_months / view=all_months;
set  mon:;
run;

  

If the 'table' already existed then building a view will error.  Hence, the best defensive way would be to check for its existence To delete (or kill-off) the table first.  Proc DATASETS can perform this task.  One of its options is the KILL keyword.  Use with extreme caution as it will delete all members of the prescribed type in the library!

%macro viewtoakill;
  %if %sysfunc(exist(work.all_months,data)) %then %do;
   
    proc datasets lib=work mt=data /*kill*/ nodetails nolist;
      delete all_months;
    quit;
  %end;
  data all_months / view=all_months;
  set  mon:;
  run
%mend;
%viewtoakill 

Solution 2:  Views with Discussion

The following solution and discussion was sent in and illustrates clear experience of analysing the strengths and weaknesses of using SAS Views as an efficiency method for this problem:

General considerations:
- views cannot be executed on a different platform to the one on which they are created: in a client server architecture a Windows client couldn't access a Unix view created on a Linux sever, for example (a remote session running on the server would be able to access the view since the remote session would be running on the same platform;
- a view reflects the state of the data at run time, rather than at build time – is this desirable;
- a view is reliant on the stability and availability of the mon: datasets on which it depends;
- a view saves disk space, but is likely to increase the amount of processing required of the system since the code must run each time the view is accessed;
- could the storage or performance of the input datasets be improved – could they in turn be a view, could they be aggregated, is their page size optimized, should they be indexed or sorted;
- as usage of a view becomes business as usual consider reviewing its use
   - if users always summarize or sort the view, consider adapting the view to reflect this, or present a view (or table) that produces an aggregation of the data;
   - if joining/merging, rather than concatenation, is required, or is a next step then consider adding this into the view, a hash table lookup may be easiest;
- if the mon: datasets are not in the work library consider defining the source library within the DATA Step view:

   data <out> / view=<out>;
      _rc = libname(<lib>, <path>, <options);
      do until (_done);
         set <lib>.mon: end=_done;
      end;
      stop;
   run;

  

   if we really wanted to we could:
      > test for the existence of source libref using the libref() function,
      > capture the path() referenced by any such library,
      > clear the libref – eg use libref() again
      > assign the libref we want using libname()
      > add an end= variable, as above, either test for this using an if or do until loop
      > clear libref and reassign original library spec when end is reached

Specific considerations:
- I found using the exist() function to be quicker than scanning dictionary.tables or sashelp.vtable, validate in target environment – likely to be the case, especially where a large number of remote libraries is defined in an autoexec, etc;
- the DATA Step view could be made more complex for performance reasons, e.g. a macro generated code with DoW loops, but there is a cost attached to making it harder to read and maintain;
- the complexity in the early stages of the code allows it to be more robust, and should avoid the appearance of ERRORs in the log, as would be seen with something like:
      proc sql; drop table all_months; quit;
      proc sql; drop view all_months; quit;
- the format is probably overkill and could be replaced with conditional logic – I think the underlying idea of giving legible feedback is sound;
- the SQL Proc and the %sysfunc DATA Step both demonstrate different ways of defensively coding SAS to do nothing;
- limited parameterization by macro variables gives some flexibility and reusability, but could be taken further by exposing more parameters (eg input ds name, number of input ds, etc) or wrapping in a macro;
- the memtypes array gives more scalability than hard coding conditional logic, a format may have been more justified here;

Thank you for posting this entertaining quiz.