A Tutorial on SAS language

Features

  • Author: Ajay Ohri
  • Date: 07 Sep 2016
  • Copyright: Image appears courtesy of Getty Images

SAS (pronounced "sass") once stood for "statistical analysis system" but now is known simply as SAS. It is a computer language for statistical computing. Much before the term data science, business analytics and business intelligence was coined, SAS was created at North Carolina State University to do the important and useful task of turning raw data into analysis using code and statistics.

SAS System is a suite of products that SAS Institute has been selling since 1976.

Jim Goodnight has been the CEO and vision behind the growth of SAS language since almost 40 years now. John Sall has made additional contributions to statistics by creating JMP (interviewed here at StatisticsViews )

SAS Institute has consistently ranked as one of the best employers within USA. However recently R and Python languages have challenged traditional share of SAS language in statistical computing, while SPSS has been acquired by IBM.

thumbnail image: A Tutorial on SAS language

Tutorial Overview

This tutorial is here to help a reader with learning the simple  SAS language, and perhaps to inspire other languages to be both simple and responsive to a wide diversity of users from beginners to advanced, from academics to enterprises of various sizes and in geographies. The tutorial is based on SAS Studio interface given for free in the SAS University Edition.

The tutorial is divided into the following parts to facilliate understanding and navigation for the reader.

  • Introduction- Here you would learn about SAS and how to install it
  • Data Input- Here you would learn about getting your data in the SAS system
  • Data Inspection- Here you check the data imported correctly and check for errors especially due to column formats or delimiters
  • Data Manipulation- Here you would learn about shaping the data to the correct format and type required for analysis
  • Advanced Data Querying using Proc SQL
  • Exploratory Data Analysis- Here you would examine the data for exploration and summarization
  • Data Visualization- Here you would explore the data visually
  • Advanced Topics- Here you would learn about additional tools in the SAS language ecosystem and how to stay current on your knowledge

Introduction

 Download- You can download and install SAS University Edition for free from http://www.sas.com/en_in/software/university-edition/download-software.html

It is basically a 2 GB Virtual Machine which I am running using the Oracle VM version. I just imported the appliance in File- Import appliance.

Selection_110.png

The actual SAS Software interface runs within a browser window that is at a specific local address that is shown at startup of the virtual machine. (i.e 192.168.1.34 or localhost:10080 for me if you see the top of the browser in screenshots that follow)

Interface-  SAS Has three basic windows where we can write code, see progress and look at output. These three parts are

  • Code Editor where we write and modify mode
  • Log Window.
  • Log is a record of everything that we do in SAS session or SAS program. It includes the following :-
  • Program statements or code
  • Messages that include NOTE, INFO, WARNING, ERROR, or an error number
  • Processing time
  • Result displays printed output

Apart from this, we have the SAS Explorer. The SAS libraries, folders , files that include data, saved code and saved output can be found here.

A SAS Library is a virtual reference to a place in a file system. SAS libraries contain SAS Datasets. Datasets contain data in the form of observations (rows) and variables (columns).

A SAS script consists of two main types of program steps, known as DATA steps and PROC steps.

 The DATA step is the primary method for creating or manipulating a SAS data set. A DATA step is a group of SAS Language statements that begins with a DATA statement and contains programming statements that create SAS data sets from raw data files or manipulate existing SAS data sets. SAS data sets consist of data stored in particular formats recognised by the SAS System. A SAS data set consists of two key parts, descriptor information and data values. The descriptor information describes the contents of the SAS data set to the SAS System, and the data values are data which have been collected or calculated. The data are organised into a table, which consists of rows, called observations, and columns, called variables. Apart from the Data Step we have the PROC Steps (Procedure) - pre-written rules that analyze and process data in a SAS dataset and then produce a report. Some PROCs that are very useful for the statistical programmer are PROC IMPORT, PROC PRINT, PROC MEANS, PROC SUMMARY, PROC CORR, PROC SORT, PROC UNIVARIATE, PROC SQL,PROC  REG, PROC LOGISTIC, PROC SGPLOT,and PROC EXPORT. We will cover most of these in our tutorial.

A SAS program can consist of a DATA step or a PROC step or any combination of DATA and PROC steps.Data Step and Proc Step are followed by Run Statement.

To Comment use slash followed by a * i.e /*Insert your Comment here*/ (or select your comment and just use Ctrl + /)

SAS is thus different from Object Oriented Languages like Python and R. Each SAS Statement ends with a semi colon (;) which is quite often the source of many mistakes especially by beginners in the language. SAS statements are free-format they can begin and end anywhere on a line   and one statement can continue over several lines.

Data Input

We can input raw data using input and datalines in the DATA Step. The INPUT statement defines the variables to be read in each line of data.The DATALINES statement indicates to SAS that DATA step statements are completed and the next line contains real data.Notice that the lines of data do not end in a semicolon. For charachter variables we used a $ sign in the name.

data first;

input name$ age height;

datalines;

Ajay 38 177

Vijay 35 164

Kumar 43 156

John 26 182

Steve 38 187

;

run;

Congrats! You just wrote your first SAS language program!

How do we check whether our data is correct or not? We use PROC PRINT to print out the dataset. The results are in the RESULTS Window. Note we SUBMIT the program after we write it in the Editor Window using F3 or RUN Button on top .

proc print data=first;

Run;

Obs

name

age

height

1

Ajay

38

177

2

Vijay

35

164

3

Kumar

43

156

4

John

26

182

5

Steve

38

187

Wow! Our data is neatly printed in rows and columns. The column names are variables and the rows are the records as we mentioned before.This created a dataset called first. It is a temporary dataset since we did not assign a permanent library (with reference to actual file path) to it. A temporary dataset is deleted when we close the SAS Session. The temporary datasets can be found by using PROC Datasets ( the temporary library is WORK).

proc datasets lib=work;

run ;

Directory

Libref

WORK

Engine

V9

Physical Name

/tmp/SAS_workD0DB00001605_localhost.localdomain/SAS_work885900001605_localhost.localdomain

Filename

/tmp/SAS_workD0DB00001605_localhost.localdomain/SAS_work885900001605_localhost.localdomain

Inode Number

144775

Access Permission

rwx------

Owner Name

sasdemo

File Size

4KB

File Size (bytes)

4096

#

Name

Member Type

File Size

Last Modified

1

FIRST

DATA

128KB

05/23/2016 20:49:45

2

REGSTRY

ITEMSTOR

32KB

05/23/2016 20:40:44

3

SASGOPT

CATALOG

12KB

05/23/2016 20:49:45

4

SASMAC1

CATALOG

188KB

05/23/2016 20:40:49

 

One complaint some R Users have is that SAS programs print a lot of information. This is just done as a design philosophy that users do not end up looking for that information.

This lists all temporary datasets. But how do we save our data permanently? We assign it a permanent library. First we create a library using the libname statement. Then we assign the temporary dataset to this library.

libname ajay3 '/folders/myfolders/';

run;

data ajay3.first;

set first;

run ;

Did it work? Lets check using  PROC DATASETS ! Notice the physical path of the dataset is there.

It is much easier to write code in the actual SAS Studio interface of SAS University Edition. This is because every time you write a part of the name like (LIB for LIBNAME) it prompts you for the SYNTAX (or the proper language code for the program). You dont have to memorize a tonne of new words.

Secondly the SYNTAX gets automatically color coded thus helping you with understanding the program.

Of course you should write comments at each step using /* comments */. (to automatically comment select some line and press Ctrl + /. To uncomment please do the same again for a comment.

Directory

Libref

AJAY3

Engine

V9

Physical Name

/folders/myfolders

Filename

/folders/myfolders

Inode Number

268650

Access Permission

rwxr-xr-x

Owner Name

sasdemo

File Size

4KB

File Size (bytes)

4096

#

Name

Member Type

File Size

Last Modified

1

FIRST

DATA

128KB

05/23/2016 20:53:45

The Data step used the SET statement to copy one dataset to another. It is quite useful when we want to subset small parts of data or retain only the useful or cleaned parts of data, but it can clutter up our WORK library. However unlike R which is memory dependent, SAS is not so memory hungry. It is a bad programming habit to use SET excessively and NOT delete datasets.

Let's have a look at our CODE Window at this point.You can see how the interface looks, how code is color coded, the three different windows for CODE, LOG and RESULTS, the tabs for Libraries ( and creating Files and Folders). You can also see the important F3 or RUN Button (that is an icon that looks like a running man). To run any code you need to select it and press the running man icon.

Screenshot from 2016-04-26 15:10:14.png

Now lets make our data input more complex since not all the data can be input by manually typing by hand. To get some data in my SAS University Edition

 I do the following-

1) First use the Upload File Option (which is limited to 10 MB since it is free).

2) Then import the data from file (by also tweaking the options for column names/delimiters etc)

Lets try a dataset that you can also practize on . Its a dataset called adult dataset and you can download it from https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data 

Screenshot from 2016-04-26 15:30:51.png

I simply navigate to the local file system ( I am using Ubuntu Linux 32 bit ) and upload the file. It gets uploaded to the file and folder I have selected above.

Selection_115.png

Lets import a data from a file. I go to the top left and select Files and Folders and within that I choose Import. Screenshot from 2016-04-26 15:14:26.png

I then just dragged and dropped the file to the right and voila! My data import is so easily done in SAS. Note this was a .txt file so import was easy. Next I would have to verify I got all my data in and whether the data is in the correct formats. Note the importing code is generated automatically greatly simplifying my task.

Selection_117.png

I unchecked the radio box for Generate SAS variable names since the first row did not have the column names. I then run using the familiar running man icon and voila my data import is complete.

Selection_116.png

Note how I can change some of the parameters by a very easy drop down

 for delimiters (suppose it was a csv file and not a txt file)-

for file types and even for path names and dataset name (using the Change tab above)Screenshot from 2016-05-23 17:21:23.png

Screenshot from 2016-04-26 16:02:59.png

In this section we did DATA Import both manually as well as using the features within SAS Studio (for the SAS University Edition). I can save this file as a .CTL file in case I do a lot of data import from the same data sources. Now lets do some Exploratory Data Analysis.

Data Inspection

I run PROC CONTENTS to see what is in the data, and I use the varnum option to look at observations in order of creation (otherwise they would come alphabetically). I see the variable formats as well as number of observations. This is quivalent to str command in R.

proc contents data=ajay3.adult  varnum;

run;

The CONTENTS Procedure

Data Set Name

AJAY3.ADULT

Observations

32562

Member Type

DATA

Variables

15

Engine

V9

Indexes

0

Created

05/23/2016 17:21:34

Observation Length

184

Last Modified

05/23/2016 17:21:34

Deleted Observations

0

Protection

 

Compressed

NO

Data Set Type

 

Sorted

NO

Label

 

 

 

Data Representation

SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64

 

 

Encoding

utf-8 Unicode (UTF-8)

 

 

Engine/Host Dependent Information

Data Set Page Size

65536

Number of Data Set Pages

92

First Data Page

1

Max Obs per Page

355

Obs in First Data Page

335

Number of Data Set Repairs

0

Filename

/folders/myfolders/adult.sas7bdat

Release Created

9.0401M3

Host Created

Linux

Inode Number

268715

Access Permission

rw-rw-r--

Owner Name

sasdemo

File Size

6MB

File Size (bytes)

6094848

Variables in Creation Order

#

Variable

Type

Len

Format

Informat

1

VAR1

Num

8

NLNUM12.

NLNUM32.

2

VAR2

Char

17

$17.

$17.

3

VAR3

Num

8

NLNUM12.

NLNUM32.

4

VAR4

Char

13

$13.

$13.

5

VAR5

Num

8

NLNUM12.

NLNUM32.

6

VAR6

Char

22

$22.

$22.

7

VAR7

Char

18

$18.

$18.

8

VAR8

Char

14

$14.

$14.

9

VAR9

Char

19

$19.

$19.

10

VAR10

Char

7

$7.

$7.

11

VAR11

Num

8

NLNUM12.

NLNUM32.

12

VAR12

Num

8

NLNUM12.

NLNUM32.

13

VAR13

Num

8

NLNUM12.

NLNUM32.

14

VAR14

Char

14

$14.

$14.

15

VAR15

Char

5

$5.

$5.

After I run PROC CONTENTS , I print out 5 observations. This is done using obs=5 option ( note use this option carefully when using the data /set for copying datasets else you will end up copying only the first few rows).

proc print data=ajay3.adult (obs=5);

run;

Selection_118.png

Now I am getting the  comma , in some data, to change that I need to make the delimiter to , (even though it is a .txt file)

I then rename some variables and then keep only a few variables for my analysis. This is done by the rename and keep options as below. To drop some variables I could use the drop option.

data ajay3.adult2;

set ajay3.adult;

rename var1=age;

rename var2=employer;

rename var4=educational_level;

rename var5=education_in_years;

rename var6=marital_status;

rename var7=occupational_status;

rename var9=racial_status;

rename var10=gender;

rename var14=ethnic_origin;

rename var15=income;

run;

Using obs=5 I can print only five observatiosn not the entire dataset.

proc print data=ajay3.adult2 (obs=5);

run;

data ajay3.adult3 (keep =age gender income);

set ajay3.adult2;

run;

proc print data=ajay3.adult3 (obs=5);

run;

Selection_119.png 

I delete the other intermediate datasets using proc delete.

proc delete data=ajay3.adult ajay3.adult2;run;

Data Manipulation

I can manipulate data types quite easily using these functions.

  • INT function – converts numeric values with decimal places to integers.
  • Substr extracts substring of a variable
  • TRANWRD Function   replaces characters with a string
  • INDEX Function   looks for  a specific string. It returns 0 if that string is not found.
  • The functions lowcase and upcase can be used to change case of a string

A very common data manipulation is text to date and date to numeric. You can use it using datepart and putn function. (hat tip-  Stackoverflow answer for this  http://stackoverflow.com/questions/7741282/converting-format-from-date-to-numeric-using-sas has been viewed 27000 times showing how SAS is still very popular. )

Conditional Manipulation is quite easily done in SAS’s powerful DATA step.

data ajay3.adult3;

set ajay3.adult3;

if income="<=50K" then inc_50k=0; else inc_50k=1;

Run;

 I then check it with another proc freq to see if it was done correctly. Typical errors at this point are not getting the exact condition (i.e instead of putting “<=50K” if I put it “<50k” all values would be 0 and make an error.

proc freq data=ajay3.adult3;

tables income*inc_50k /norow nocol nopercent nocum;

quit;

The FREQ Procedure

Frequency

Table of income by inc_50k

income

inc_50k

0

1

Total

<=50K

24720

0

24720

>50K

0

7841

7841

Total

24720

7841

32561

Frequency Missing = 1

Data Querying using Proc SQL

SAS language can use Proc SQL to use SQL to query the dataset and explore. This greatly simplifies things for people who know SQL and are new to SAS language.

The syntax of PROC SQL is very simple- you simple add PROC SQL in front of the SQL QUERY you would use to query the SAS dataframe as a database table.

proc sql;

create table test

as select mean(age) as age,

gender

from ajay3.adult3

group by gender;

Quit;

proc print data=test;

run;

Obs            age    gender

1 .  

2            36.8582 Female

3            39.4335 Male

Screenshot from 2016-05-28 12:42:21.png

Exploratory Data Analysis

Exploratory Data Analysis was a term coined by the great statistician Tukey. Tukey used to say “Doing statistics is like doing crosswords except that one cannot know for sure whether one has found the solution”.

In SAS language we use PROC FREQ for Frequency Tabulations (including cross tabulations, PROC MEANS for Looking at Measures of Central Tendency, PROC UNIVARIATE for looking at DISTRIBUTIONS).

Now lets do some PROC FREQ for Frequencies. Of course I should use this for Charachter variables only since a numerical variable has lots of unique values for frequencies. I can specify which variables I want in the FREQuency Tabulation using TABLES , I can choose options like norow nocol nocum and nopercent to ignore row, column, cumulative and percentage statistics.

proc freq data=ajay3.adult3;

tables income /norow nocol nocum nopercent;

run;

The FREQ Procedure

income

Frequency

Frequency Missing = 1

<=50K

24720

>50K

7841

Suppose I did the default and did not specify norow nocol etc. This is what the code and output would look like.

proc freq data=ajay3.adult3;

tables income /norow nocol nocum nopercent;

run;

The FREQ Procedure

income

Frequency

Percent

Cumulative

Frequency

Cumulative

Percent

Frequency Missing = 1

<=50K

24720

75.92

24720

75.92

>50K

7841

24.08

32561

100.00

Let's do a cross tabulation between two character variables by putting a * sign between them, and I just want to look at percentages and not the actual numbers. I can do this using the nofreq option.

proc freq data=ajay3.adult3;

tables income*gender /norow nocol nofreq nocum;

quit;

The FREQ Procedure

Percent

Table of income by gender

income

gender

Female

Male

Total

<=50K

29.46

46.46

75.92

>50K

3.62

20.46

24.08

Total

10771

33.08

21790

66.92

32561

100.00

Frequency Missing = 1

I can run PROC MEANS to do numerical analysis. I can specify options like N Nmiss Median, Mean to choose and control my output while I can use VAR Statement to see which variables I want to do an analysis on and a CLASS Statement to a group by analysis.Lets look at the code here.

proc means data=ajay3.adult3 n nmiss mean median min max;

var age;

class gender ;

Run;

The MEANS Procedure

Analysis Variable : age

gender

N Obs

N

N Miss

Mean

Median

Minimum

Maximum

Female

10771

10771

0

36.8582304

35.0000000

17.0000000

90.0000000

Male

21790

21790

0

39.4335475

38.0000000

17.0000000

90.0000000

Suppose I wanted to look at how much age varies for income and I want to look at the distribution at various percentiles and quartiles. I can do proc univariate for it.

Copying and pasting code from output/results window to code window and then comment it out is one way to further speeden up your pace of learning. Some people even copy and paste output to the code window so they have a way to refer to previous results. Of course R has an option for knitting or weaving results and Python has an option for Jupyter Notebooks which keeps output and code in same window.

proc univariate data=ajay3.adult3 ;

var age;

/* class gender ; */

run;

The UNIVARIATE Procedure

Variable: age

Moments

N

32561

Sum Weights

32561

Mean

38.5816468

Sum Observations

1256257

Std Deviation

13.6404326

Variance

186.0614

Skewness

0.55874337

Kurtosis

-0.1661275

Uncorrected SS

54526623

Corrected SS

6058159.19

Coeff Variation

35.3547184

Std Error Mean

0.0755926

Basic Statistical Measures

Location

Variability

Mean

38.58165

Std Deviation

13.64043

Median

37.00000

Variance

186.06140

Mode

36.00000

Range

73.00000

 

 

Interquartile Range

20.00000

Tests for Location: Mu0=0

Test

Statistic

p Value

Student's t

t

510.3892

Pr > |t|

<.0001

Sign

M

16280.5

Pr >= |M|

<.0001

Signed Rank

S

2.6506E8

Pr >= |S|

<.0001

Quantiles (Definition 5)

Level

Quantile

100% Max

90

99%

74

95%

63

90%

58

75% Q3

48

50% Median

37

25% Q1

28

10%

22

5%

19

1%

17

0% Min

17

Extreme Observations

Lowest

Highest

Value

Obs

Value

Obs

17

32448

90

28464

17

32283

90

31031

17

31960

90

31697

17

31865

90

32278

17

31773

90

32368

Missing Values

Missing

Value

Count

Percent Of

All Obs

Missing Obs

.

1

Data Visualization

Data Visualization is the science of visualizing data in the form of easy to understand information concisely. Since human cognitive ability understands graphs faster than tables for trends and outliers, data visualization is an important part of analytics.

Let's get inspired by the movie, Moneyball and do some data visualization using the baseball dataset bundled with SAS Help.

data ajay3.baseball;

set sashelp.baseball;

run;

proc print data=ajay3.BASEBALL (obs=5);

/* var salary; */

quit;

proc means data=ajay3.baseball n nmiss mean sum;

var salary;

class league;

run;

The University Edition / SAS Studio provides ways to do data visualization using the interfacee.

Navigate in the left most margin to TASKS and then to GRAPHS as below

Screenshot from 2016-08-08 19-04-13.png

And

Screenshot from 2016-08-08 19-03-55.png

Let's make a Histogram using this interface.

We set the data (library.dataset), where clause (if conditional is used), the roles (variable), density curves (if needed)  and finally run it.  A very nice data visualization is produced for analysis and the underlying code is also created to help us customize it (proc sgplot).

Screenshot from 2016-08-08 19-12-26.png

Saving code

Click on the Save Icon to save your code

Screenshot from 2016-04-26 18:00:03.png

As a software  SAS has more than 200 components, some of them are –

  • Base SAS – Basic procedures and data management
  • SAS/STAT – Statistical analysis
  • SAS/GRAPH – Graphics and presentation
  • SAS/OR – Operations research
  • SAS/ETS – Econometrics and Time Series Analysis
  • SAS/IML – Interactive matrix language

Though challenged over the years by the SPSS and now by the R language- SAS language remains the most widely used language by businesses. Every year SAS customers vote with their checkbooks to renew their licence to use the SAS language, which is a formidable achievement for 40 years in statistical computing. This is a simple fact- enterprises prefer SAS, PhD’s prefer R, and many are now prefering Python.

What makes this language so popular and enduring? After all everything else in computer science has changed and evolved.  

One key thing is simplicity thanks to the original design. The other key area is how SAS Institute has managed the health of the language ecosystem by a very good and cosistant documentation, timely user support and user ecosystems by conferences, papers and relationships. Enterprise software can have bad interfaces but SAS UIs are always slick including for its free version for students.

 Note this is just a basic tutorial in SAS language . What are other topics that you can learn in SAS? They are as follows-

  • SAS Enterprise Miner
  • JMP Software in combination with R
  • Data Export and ODS
  • SAS Macro language
  • Model Building using Proc Logistic and Proc reg

There are 6620 questions tagged sas on Stack Overflow and the SAS Documenation at  https://support.sas.com/documentation/  is excellent compared to other documentation in terms of interface and ease. To join the SAS Community see https://communities.sas.com/t5/SAS-Analytics-U/bd-p/sas_analytics_u 

This is just the basic tutorial to SAS language for people who want to add this valuable career skill in today’s world  of a shortage of rational data scientists.  

Related Topics

Related Publications

Related Content

Site Footer

Address:

This website is provided by John Wiley & Sons Limited, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ (Company No: 00641132, VAT No: 376766987)

Published features on StatisticsViews.com are checked for statistical accuracy by a panel from the European Network for Business and Industrial Statistics (ENBIS)   to whom Wiley and StatisticsViews.com express their gratitude. This panel are: Ron Kenett, David Steinberg, Shirley Coleman, Irena Ograjenšek, Fabrizio Ruggeri, Rainer Göb, Philippe Castagliola, Xavier Tort-Martorell, Bart De Ketelaere, Antonio Pievatolo, Martina Vandebroek, Lance Mitchell, Gilbert Saporta, Helmut Waldl and Stelios Psarakis.