# Contents

- Preface
- Introduction
- About Stata
- The Stata user interface
- How to communicate with Stata
- Working process for data analyses
- Open dataset
- Explore the dataset
- Get an overview of the whole dataset
- Explore a categorical variable
- Explore two categorical variables simultaneously ("cross tabulating two variables")
- Explore a continuous variable
- Explore a continuous variable under a subgroup of a categorical variable
- About do file
- Data manegement
- Rename a variable name
- Operators in Stata
- Missing values in Stata
- Genearate a new variable
- Categorize a continous variable as a categorical varialble
- Recode a categorical variable
- Drop variables
- Drop observations
- Variable label
- Value label for a categorical variable
- Get help from Stata
- Tips and tricks
- A list of Stata command

# Preface

This document is dedicated to users who has little experience on Stata. However, I assumed that you have already downloaded and installed Stata into your computer. A general recommendation for using this document is that one should read and try examples sequentially from the beginning of the document.

There are certainly mistakes left in this document. Be careful and if you fnd a mistake, drop me a message feedBack@medical-statistics.dk.

Before you start, please make sure you have a big cup of coffee and perhaps some good music. If so, we are ready to start our journey.

## Introduction

### About Stata

### The Stata user interface

When opening Stata, the Stata interface will be slightly different depending on computer's system (Windows or Mac).
- Figure 1: Stata user interface for Windows (The following page can be downloaded)
- Menu
- Toolbar
- Results window
- Command window
- History window
- Variables window
- Propoties window
- Find where the five windows are
- Move your cursor and click on each button of the menu, and a drop-down menu will appear. Skim through the drop-down menu. Pay particular attention to two buttons: File, Help
- Move and hold your cursor from left to right over each button of the toolbar for a moment, and a description of that button will appear. Read the descriptions.
- In the toolbar, find where Do-file Editor is?
### How to communicate with Stata?

Imagine that you just hire a new secretary called Stata because you have heard that Stata can potentially do many complicated things for you. The only problem is that Stata can understand only its own language - commands in English. In order to convey your message to Stata, you have to learn commands. Fortunately, if you can master roughly 10 commands you can do many things, if you could master roughly 20 commands, you would become very powerful.
As users, we could communicate with Stata via three ways
- Clicking on menu
- Typing commands in Command window
- Typing commands in do-file editor
You may type one command in Command window and execute the command, type another command in Command window and execute the command, and continue until to the end. This is how it works when we type commands in Command window.

You may leave a note to Stata. The note can be very short or can be very long but the main point is that the secretary do things according to exactly what you have written. The note contains a list of commands and let Stata execute all of the commands that you have written in the note. This note in Stata is called do-file containing a collection of commands that you ask Stata to do things for you.

Furthermore, commands in do-file can be saved and can be re-used later. The do-file also can serve as a document recording everything you have done. However, commands excuted via Command window will be disappeared as soon as you shut down Stata.

### Working process for data analyses

- Open the dataset
- Investigate the dataset
- Data management or data manipulation
- Date analyses
- Opening a dataset
- Investigating the contents of the dataset
- Data management
- Graphics
- Data analysis

#### What is Stata?

``Stata is a complete, integrated software package that provides all of your data science needsâ€”data manipulation, visualization, statistics, and reproducible reporting.''

#### Why should we learn Stata?

Why should we learn Stata on the top of many things that we have already learned including for example Excel?

Based on my own experiences, there is a big diference between Stata and Excel. It is perfectly ok to use Excel to, for example, make tables and some nice figures. Again, based on my personal experiences, Stata can do many cool and complicated things that Excel either may not able to do or may be not as easy as to do comparing to Stata, especially in terms of doing statistical analyses. You will probably experience the same along the way.

Stata interface starts with

Stata consists of 5 windows:

Figure 3: The data working process

The data working process can be simplified into four steps:

Based on the above working process, this document is organized as the following topics:

However, we will serve graphics and data analysis for another part to avoid put too much burden on you.

Before you continue, please make sure you fill up your coffee. If so, we are ready to move forward.

## Open dataset

- Stata classifies any dataset into two categories
- Internal dataset: a dataset is created/generated by Stata and the extension is .dta
- External dataset: a dataset is not created/generated by Stata, for example Excel Spreadsheet and the extension could be .xls, .xlsx
- Download the dataset for practice: it is very important to know where the dataset is located.
- Open an internal dataset
- Start your Stata
- In Stata, click on the fold symbol in the most left of the toolbar
- Find the dataset (fake_birthcohort.dta)
- Click on Open
- Import an external dataset
- Start your Stata
- In Stata, click on the File in the most left of the menu
- Find and click on Import
- Click on the Excel spreadsheet and Dialog box will pop up
- Insdide the Dialog box, click on Browse and find the dataset (fake_bcohortExcel.xls)
- Select "Import first row as variable names"
- Click on Ok

## Explore the dataset

Whenever open a dataset, it is very important to explore the data thoroughly before doing any analysis.

### Open the internal dataset

### Get an overview of the whole dataset

- Describe the whole dataset
- Stata command: describe
- Example
- In the Command Window type: describe
- Press Enter
- Figure 4: Stata output
- Interpretation
- Number of observations: obs=29
- Number of variables: vars=7
- Variable names, such as id, smoking, etc.
- Variable labels, which provide further information on the variables. For example, smoking means "maternal smoking during pregnancy"
- Command syntax: describe [varlist]
- Browse the whole dataset
- Stata command: browse
- Example
- Move and hold your cursor from left to right over each button of the toolbar for a moment, until you find the button Data Editor (Browse).
- Click on Data Editor (Browse)
- Figure 5: Stata output
- Interpretation
- The main area in the left shows the whole dataset
- The first horizontal line shows all variables. In the top right, you may select/filter the variables by click on the small box before the names of the variabls
- Each vertical line is the values for a single variable.
- Each horizontal line is the values for variable-values associated with a single individual observation.
### Explore a categorical variable

- Stata command: tab1
- Example:
- in the Command Window type: tab1 smoking
- Press Enter
- Figure 6: Stata output
- Interpretation
- The frequency counts and percent for none-smokers are 23 and 79.31%, respectively.
- The frequency counts and percent for smokers are 6 and 20.69%, respectively

tab1 produces a one-way table of frequency counts and percent.

- Command syntax: tab1 [varlist]
### Explore two categorical variables simultaneously ("cross tabulating two variables")

- Stata command: tab2
- Example:
- in the Command Window type: tab2 smoking coffee
- Press Enter
- Figure 7: Stata output
- Interpretation

tab2 produces a two-way table of frequency counts and percent.

- Command syntax: tab2 varname1 varname2
### Explore a continuous variable

- Stata command: summarize
- Example:
- in the Command Window type: summarize weight height
- Press Enter
- Figure 8: Stata output
- Interpretation

summarize calculates and displays a variety of summary statistics, such as frequency counts, mean, standard deviation, minimum value, and maximum values.

- Command syntax: summarize [varlist]
### Explore a continuous variable under a subgroup of a categorical variable

- Stata command: summarize, if
- Example:
- in the Command Window type: summarize birthweight if smoking==0
- Make sure that you have typed a double == instead of a single =
- Press Enter
- in the Command Window type: summarize birthweight if smoking==1
- Press Enter
- Figure 9: Stata output
- Interpretation

summarize combined with if calculates and displays a variety of summary statistic for the specified sub-group observations in the dataset.

### About do file

- Do-file Editor
- Open a new Do-file Editor: In the toolbar, click on Do-file Editor
- Explore the Do-file Editor: click on menus and go through toolbar
- Find two buttons: save and Excute (do) (hint: see the following figures)
- Figure 10: Do-file Editor for PC
- Figure 11: Do-file Editor for Mac
- Explore the dataset via Do-file
- Download the Stata Do-file:
- In the Do-file Editor, click on the first button to Open the Do-file (stataIntro.do)
- The texts in green inside the Do.file are the notes/explanation texts.
- The texts in blue inside the Do.file are the stata commands
- Highlight the first Stata command in blue and click on the Excute (do), and observe what appear in the Results Window
- Repeat the above step

At this moment, I assume that you should be able to open the internal dataset (fake_birthcohort.dta). Please open the data now (hint: see the above "open an internal dataset").

describe produces a summary of the dataset, including

browse produces a Excel-spreadsheet-style data.

## Data manegement

### Import the external data

- Please import the data: fake_bcohortExcel.xls (hint: see the above "import an external dataset").
- Explore the dataset
- The dataset logbook
### Data management via Do-file

### Rename a variable name

- Background
- Stata command: rename
- Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
- describe
- rename var1 id
- rename var2 smoking
- rename var3 coffee
- rename var4 weight
- rename var5 height
- rename var6 gender
- rename var7 birthweight
- describe
- Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button (Do button or Execute(do) in Windows see the above Figure 10 and in Mac see the above figure 11)
- Compare the output from the command describe in the beginning and at the end
### Operators in Stata

### Missing values in Stata

- Background:
It is not uncommon, that there are missing values for a given dataset. Stata handles missing values in several ways. For now, we focus on one of the most common one. For a numerical variable, period "." means the value is missing. It is necessary and important to handle missing values whenever doing data cleaning and data analysis.

- Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
- summarize
- Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
- Go through the frequency number under the "Obs" and think about why 27 for coffee and height while 29 for any other variables?
- browse
- Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
- Go through the two variables (coffee and height) and identify where the missing values occur?
- tab coffee
- count if coffee > 0
- Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
- Why is the frequency number 12 after the "tab coffee" while is the frequency number 14 after the "count if coffee>0" (*hint: in Stata, numeric missing value "." is the largest positive value and of course the any missing value is bigeger than any numertical number)
### Genearate a new variable

- Background:
It is so often that we generate a new variable based on the existing variables, for example, we generate a variable BMI (body mass index) based on height and weight.

- Stata command: generate
- Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
- generate bmi=weight/(height/100)^2
- summarize bmi
### Categorize a continous variable as a categorical varialble

- Background:
It is so ofen that we categorize a continous variable as a categorical variable, for example, we categorize a variable BMI as a categorical variable based on the WHO standard categorization.

- Stata command: generate, replace, if
- generate bmi_g3=.
- replace bmi_g3=1 if bmi<18.5
- replace bmi_g3=2 if bmi>=18.5 & bmi<25
- replace bmi_g3=3 if bmi>=25.0 & bmi!=.
- tab1 bmi_g3
### Recode a categorical variable

- Background:
It is sometimes that we would like to recategorize a categorical variable using different categories.

- Stata command: generate, replace, if
- generate bmi_g2=.
- replace bmi_g2=0 if bmi_g3==1 | bmi_g3==2
- replace bmi_g2=1 if bmi_g3==3
- tab2 bmi_g2 bmi_g3
### Drop variables

- Background:
It is sometimes that a dataset have too many variables that are not necessary. In this case, it is recommended to drop the variables, otherwise it could be very disturbing.

- Stata command: drop
- drop weight height
### Drop observations

- Background:
It is sometimes to drop/remove observations due to errors, outliers, missing values, etc.

- Stata command: drop and if
- drop if weight==.
- drop if coffee==.
### Variable label

- Background:
- Stata command: label variable
- describe
- label variable id "ID number for each child in the cohort"
- label variable smoking "Maternal smoking status during pregnancy"
- label variable coffee "Maternal coffee drinking during prengnancy"
- label variable weight "Maternal weight (kilogram) at the beginning of the pregnancy"
- label variable height "Maternal height (centimeter) at the begging of the pregnancy"
- label variable gender "Gender for the child"
- label variable birthweight "Birthweight for the child"
- describe
- Compare the output from the command describe in the beginning and at the end
### Value label for a categorical variable

- Background:
When we have categorical variables, it is difficult to remember what the values mean? for example, given a categorical variable gender coded as 0 and 1, it is difficult to judge whether 0 is boy/male or 1 is boy/male. Therefore, it is crutial to lable the values of categorical variables.

- Stata command: label define and label value
- tab1 smoking coffee gender
- Step 1: define "label-name together with text explaining the values's meaning" for variables
- lab define labForSmoking 1 "Yes" 0 "No"
- lab define labForCoffee 1 "Yes" 0 "No"
- lab define labForGender 0 "boy" 1 "girl"
- Step 2: make connections between the exsiting variable-name and the label-name
- lab value smoking labForSmoking
- lab value coffee labForCoffee
- lab value gender labForGender
- tab1 smoking coffee gender
- Compare the output from the command tab1 in the beginning and at the end

At this moment, I assume that you are able to import an external dataset. Furthermore, I assume that you are able to explore the dataset.

Current names | Desired names | Value labels | Variable labels |
---|---|---|---|

var1 | id | ID number for each child in the cohort | |

var2 | smoking | 0: no, 1. yes | Maternal smoking status during pregnancy |

var3 | coffee | 0: no, 1. yes | Maternal coffee drinking during prengnancy |

var4 | weight | Maternal weight (km) at the beginning of the pregnancy | |

var5 | height | Maternal height (cm) at the begging of the pregnancy | |

var6 | gender | 0: boy, 1. girl | Gender for the child |

var7 | birthweight | Birthweight for the child (gramme) |

It is not uncommon to rename variables to make the variables being more readable and being more understandable. For example rename var7 as birthweight, one can immediately understand when the variable means.

Arithmetic | Logical | Relational |
---|---|---|

+ addition | & and | > greater than |

- subtraction | | or | < less than |

* multiplication | ! not | >= > or equal |

/ division | ~ not | <= < or equal |

^ power | == equal | |

- negation | != not equal | |

= equal | ~= not equal |

## Get help from Stata

- Review Window (renamed as "History" in Stata 16) Besides google, you get can help via at least three ways:
- At the end of the Stata menu, click on the help. Skim through the drop-down menu
- Geting help for a particular Stata command, in the Command Window, type: help commandname
- Click on Stata Youtube Channel

## Tips and tricks

- Review Window (renamed as "History" in Stata 16)
- Copy a past command from the Review-Window (History) to Command-Window: single-click on the past command in the Review-Window
- Re-run a past command in the Review-Window: double-click on the past commad in in the Review-Window
- Copy past commands to Do-file editor: Select the commands in the Review-Window and right-click, and further click on "Send selected to Do-File Editor"
- Variable Window
- Copy a single variable from the Variable-Window to Command-Window: double-click on the variable in the Variable-Window
- Copy several variables from the Variable-Window to Command-Window: select the variables in the Variable-Window and right-click on the variables, and further click on "Send varlist to Command Window"

## A list of Stata command

The description of the command | Name of command | Example |
---|---|---|

describe | ||

browse | ||

tab1 | ||

tab2 | ||

summarize | ||

summarize, if | ||

rename | ||

generate | ||

replace, if | ||

label variable | ||

lab define, lab value | ||

drop | ||

drop, if | ||