Medical statistics and Data Science: Statistics

How to get started with Stata

     

News:

A new book “An introduction to Directed Acyclic Graph (DAG) for health researchers” has been published in Amazon in the 21st December 2024

An introduction to Directed Acyclic Graph (DAG) for health researchers

The book description:

"Directed acyclic graph (DAG) is increasingly used in modern epidemiology, especially guide researchers to implementing causal inference in observational studies. Casual DAG visually presents causal knowledge and assumptions between variables. Once one can manage the rules, it can facilitate many tasks, such as using DAG makes it easier to understand many concepts for example direct and indirect causal effects, mediation analysis, collider stratification bias, selection bias, and information bias, etc. It also makes easier to recognize and avoid mistakes in analytic decisions such as using the backdoor criterion to select variables to be adjusted."

"More advanced texts on DAGs are readily available in textbooks and in scientific papers, but a simple and comprehensive introduction to DAG is lacking."

"The book thoroughly introduces DAG in a plain language from the scratch, step by step with more simple and accessible language explaining the concepts, terminologies, rules, and potential applications. The book will pave the way for researchers using DAG."

Preface

This document is dedicated to users who has little experience on Stata. However, I assumed that you have already downloaded and installed Stata into your computer. A general recommendation for using this document is that one should read and try examples sequentially from the beginning of the document.

There are certainly mistakes left in this document. Be careful and if you fnd a mistake, drop me a message feedBack@medical-statistics.dk.

Before you start, please make sure you have a big cup of coffee and perhaps some good music. If so, we are ready to start our journey.

Introduction

  1. About Stata

  2. What is Stata?

    ``Stata is a complete, integrated software package that provides all of your data science needs—data manipulation, visualization, statistics, and reproducible reporting.''

    Why should we learn Stata?

    Why should we learn Stata on the top of many things that we have already learned including for example Excel?

    Based on my own experiences, there is a big diference between Stata and Excel. It is perfectly ok to use Excel to, for example, make tables and some nice figures. Again, based on my personal experiences, Stata can do many cool and complicated things that Excel either may not able to do or may be not as easy as to do comparing to Stata, especially in terms of doing statistical analyses. You will probably experience the same along the way.

  3. The Stata user interface

  4. When opening Stata, the Stata interface will be slightly different depending on computer's system (Windows or Mac).
    • Figure 1: Stata user interface for Windows (The following page can be downloaded)
    • Figure 2: Stata user interface for Mac (The following page can be downloaded)
    However, no matter which one of the two systems is going to be, you will notice that:

    Stata interface starts with

    • Menu
    • Toolbar

    Stata consists of 5 windows:

    • Results window
    • Command window
    • History window
    • Variables window
    • Propoties window

    We will go through each part along the way. At the momement, you should, at least, do the follwong three things
    1. Find where the five windows are
    2. Move your cursor and click on each button of the menu, and a drop-down menu will appear. Skim through the drop-down menu. Pay particular attention to two buttons: File, Help
    3. Move and hold your cursor from left to right over each button of the toolbar for a moment, and a description of that button will appear. Read the descriptions.
    4. In the toolbar, find where Do-file Editor is?
  5. How to communicate with Stata?

  6. Imagine that you just hire a new secretary called Stata because you have heard that Stata can potentially do many complicated things for you. The only problem is that Stata can understand only its own language - commands in English. In order to convey your message to Stata, you have to learn commands. Fortunately, if you can master roughly 10 commands you can do many things, if you could master roughly 20 commands, you would become very powerful. As users, we could communicate with Stata via three ways
    1. Clicking on menu
    2. Typing commands in Command window
    3. Typing commands in do-file editor
    My experience is that it is easy to get started by clicking on menu because we are used to moving mouse to click around. However, the more using Stata, the more prefer to type commands because using command in general is faster and more efficient. The main differences between typing commands via Command window and do-file are the followings. Imagine again that Stata is your secretary. When you ask the secretary to do something for you, you can either talk to the secretary words by words or you can leave a note to the secretary. It is the same idea how communicate with Stata.
    1. You may type one command in Command window and execute the command, type another command in Command window and execute the command, and continue until to the end. This is how it works when we type commands in Command window.

    2. You may leave a note to Stata. The note can be very short or can be very long but the main point is that the secretary do things according to exactly what you have written. The note contains a list of commands and let Stata execute all of the commands that you have written in the note. This note in Stata is called do-file containing a collection of commands that you ask Stata to do things for you.

      Furthermore, commands in do-file can be saved and can be re-used later. The do-file also can serve as a document recording everything you have done. However, commands excuted via Command window will be disappeared as soon as you shut down Stata.

    In general, we prefer to type and execute commands via do-file over via Command window because it is much easier to execute a large amount of commands at the same time via do-file. In case do-file still does not make too much sense to you at the moment, do not worry and just keep going. You will certainly experience the difference during practice and we will certainly come back to do-file and I promise that you will fall in love with do-file in the end.
  7. Working process for data analyses

  8. Figure 3: The data working process

    The data working process can be simplified into four steps:

    1. Open the dataset
    2. Investigate the dataset
    3. Data management or data manipulation
    4. Date analyses
    The two steps (Inverstigate data and data management/manipulation) will certainly go back and forth for many times depending on how complex the dataset are.

    Based on the above working process, this document is organized as the following topics:

    1. Opening a dataset
    2. Investigating the contents of the dataset
    3. Data management
    4. Graphics
    5. Data analysis
    There will be brief explanations along the way, but I try my best to balance between simplicity and suficiency and reserve complications to the next level. The key point of this document to guide you going through the whole process and build up confidence.

    However, we will serve graphics and data analysis for another part to avoid put too much burden on you.

    Before you continue, please make sure you fill up your coffee. If so, we are ready to move forward.

Open dataset

  • Stata classifies any dataset into two categories
    • Internal dataset: a dataset is created/generated by Stata and the extension is .dta
    • External dataset: a dataset is not created/generated by Stata, for example Excel Spreadsheet and the extension could be .xls, .xlsx
  • Download the dataset for practice: it is very important to know where the dataset is located.
  • Open an internal dataset
    1. Start your Stata
    2. In Stata, click on the fold symbol in the most left of the toolbar
    3. Find the dataset (fake_birthcohort.dta)
    4. Click on Open
  • Import an external dataset
    1. Start your Stata
    2. In Stata, click on the File in the most left of the menu
    3. Find and click on Import
    4. Click on the Excel spreadsheet and Dialog box will pop up
    5. Insdide the Dialog box, click on Browse and find the dataset (fake_bcohortExcel.xls)
    6. Select "Import first row as variable names"
    7. Click on Ok

Explore the dataset

Whenever open a dataset, it is very important to explore the data thoroughly before doing any analysis.

  • Open the internal dataset

  • At this moment, I assume that you should be able to open the internal dataset (fake_birthcohort.dta). Please open the data now (hint: see the above "open an internal dataset").

  • Get an overview of the whole dataset

    • Describe the whole dataset
      • Stata command: describe
      • Example
        1. In the Command Window type: describe
        2. Press Enter
        3. Figure 4: Stata output
        4. Interpretation
        5. describe produces a summary of the dataset, including

          • Number of observations: obs=29
          • Number of variables: vars=7
          • Variable names, such as id, smoking, etc.
          • Variable labels, which provide further information on the variables. For example, smoking means "maternal smoking during pregnancy"

      • Command syntax: describe [varlist]
    • Browse the whole dataset
      • Stata command: browse
      • Example
        1. Move and hold your cursor from left to right over each button of the toolbar for a moment, until you find the button Data Editor (Browse).
        2. Click on Data Editor (Browse)
        3. Figure 5: Stata output
        4. Interpretation
        5. browse produces a Excel-spreadsheet-style data.

          • The main area in the left shows the whole dataset
          • The first horizontal line shows all variables. In the top right, you may select/filter the variables by click on the small box before the names of the variabls
          • Each vertical line is the values for a single variable.
          • Each horizontal line is the values for variable-values associated with a single individual observation.

  • Explore a categorical variable

    • Stata command: tab1
    • Example:
      1. in the Command Window type: tab1 smoking
      2. Press Enter
      3. Figure 6: Stata output
      4. Interpretation
      5. tab1 produces a one-way table of frequency counts and percent.

        • The frequency counts and percent for none-smokers are 23 and 79.31%, respectively.
        • The frequency counts and percent for smokers are 6 and 20.69%, respectively

    • Command syntax: tab1 [varlist]
  • Explore two categorical variables simultaneously ("cross tabulating two variables")

    • Stata command: tab2
    • Example:
      1. in the Command Window type: tab2 smoking coffee
      2. Press Enter
      3. Figure 7: Stata output
      4. Interpretation
      5. tab2 produces a two-way table of frequency counts and percent.

    • Command syntax: tab2 varname1 varname2
  • Explore a continuous variable

    • Stata command: summarize
    • Example:
      1. in the Command Window type: summarize weight height
      2. Press Enter
      3. Figure 8: Stata output
      4. Interpretation
      5. summarize calculates and displays a variety of summary statistics, such as frequency counts, mean, standard deviation, minimum value, and maximum values.

    • Command syntax: summarize [varlist]
  • Explore a continuous variable under a subgroup of a categorical variable

    • Stata command: summarize, if
    • Example:
      1. in the Command Window type: summarize birthweight if smoking==0
      2. Make sure that you have typed a double == instead of a single =
      3. Press Enter
      4. in the Command Window type: summarize birthweight if smoking==1
      5. Press Enter
      6. Figure 9: Stata output
      7. Interpretation
      8. summarize combined with if calculates and displays a variety of summary statistic for the specified sub-group observations in the dataset.

  • About do file

    • Do-file Editor
      1. Open a new Do-file Editor: In the toolbar, click on Do-file Editor
      2. Explore the Do-file Editor: click on menus and go through toolbar
      3. Find two buttons: save and Excute (do) (hint: see the following figures)
    • Figure 10: Do-file Editor for PC
    • Figure 11: Do-file Editor for Mac
    • Explore the dataset via Do-file
      • Download the Stata Do-file:
      • In the Do-file Editor, click on the first button to Open the Do-file (stataIntro.do)
      • The texts in green inside the Do.file are the notes/explanation texts.
      • The texts in blue inside the Do.file are the stata commands
      • Highlight the first Stata command in blue and click on the Excute (do), and observe what appear in the Results Window
      • Repeat the above step

Data manegement

  • Import the external data

  • At this moment, I assume that you are able to import an external dataset. Furthermore, I assume that you are able to explore the dataset.

    1. Please import the data: fake_bcohortExcel.xls (hint: see the above "import an external dataset").
    2. Explore the dataset
    3. The dataset logbook
    4. Current names Desired names Value labels Variable labels
      var1 id ID number for each child in the cohort
      var2 smoking 0: no, 1. yes Maternal smoking status during pregnancy
      var3 coffee 0: no, 1. yes Maternal coffee drinking during prengnancy
      var4 weight Maternal weight (km) at the beginning of the pregnancy
      var5 height Maternal height (cm) at the begging of the pregnancy
      var6 gender 0: boy, 1. girl Gender for the child
      var7 birthweight Birthweight for the child (gramme)

  • Data management via Do-file

    1. Download the Do-file:
    2. Open a new Do-file
    3. Open the downloaded Do-file:
  • Rename a variable name

    • Background
    • It is not uncommon to rename variables to make the variables being more readable and being more understandable. For example rename var7 as birthweight, one can immediately understand when the variable means.

    • Stata command: rename
    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • describe
      • rename var1 id
      • rename var2 smoking
      • rename var3 coffee
      • rename var4 weight
      • rename var5 height
      • rename var6 gender
      • rename var7 birthweight
      • describe
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button (Do button or Execute(do) in Windows see the above Figure 10 and in Mac see the above figure 11)
    • Compare the output from the command describe in the beginning and at the end
  • Operators in Stata

  • Arithmetic Logical Relational
    + addition & and > greater than
    - subtraction | or < less than
    * multiplication ! not >= > or equal
    / division ~ not <= < or equal
    ^ power == equal
    - negation != not equal
    = equal ~= not equal
  • Missing values in Stata

    • Background:

      It is not uncommon, that there are missing values for a given dataset. Stata handles missing values in several ways. For now, we focus on one of the most common one. For a numerical variable, period "." means the value is missing. It is necessary and important to handle missing values whenever doing data cleaning and data analysis.

    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • summarize
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
    • Go through the frequency number under the "Obs" and think about why 27 for coffee and height while 29 for any other variables?
      • browse
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
    • Go through the two variables (coffee and height) and identify where the missing values occur?
      • tab coffee
      • count if coffee > 0
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
    • Why is the frequency number 12 after the "tab coffee" while is the frequency number 14 after the "count if coffee>0" (*hint: in Stata, numeric missing value "." is the largest positive value and of course the any missing value is bigeger than any numertical number)
  • Genearate a new variable

    • Background:

      It is so often that we generate a new variable based on the existing variables, for example, we generate a variable BMI (body mass index) based on height and weight.

    • Stata command: generate
    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • generate bmi=weight/(height/100)^2
      • summarize bmi
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
  • Categorize a continous variable as a categorical varialble

    • Background:

      It is so ofen that we categorize a continous variable as a categorical variable, for example, we categorize a variable BMI as a categorical variable based on the WHO standard categorization.

    • Stata command: generate, replace, if
    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • generate bmi_g3=.
      • replace bmi_g3=1 if bmi<18.5
      • replace bmi_g3=2 if bmi>=18.5 & bmi<25
      • replace bmi_g3=3 if bmi>=25.0 & bmi!=.
      • tab1 bmi_g3
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
  • Recode a categorical variable

    • Background:

      It is sometimes that we would like to recategorize a categorical variable using different categories.

    • Stata command: generate, replace, if
    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • generate bmi_g2=.
      • replace bmi_g2=0 if bmi_g3==1 | bmi_g3==2
      • replace bmi_g2=1 if bmi_g3==3
      • tab2 bmi_g2 bmi_g3
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
  • Drop variables

    • Background:

      It is sometimes that a dataset have too many variables that are not necessary. In this case, it is recommended to drop the variables, otherwise it could be very disturbing.

    • Stata command: drop
    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • drop weight height
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
  • Drop observations

    • Background:

      It is sometimes to drop/remove observations due to errors, outliers, missing values, etc.

    • Stata command: drop and if
    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • drop if weight==.
      • drop if coffee==.
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
  • Variable label

    • Background:

    • Stata command: label variable
    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • describe
      • label variable id "ID number for each child in the cohort"
      • label variable smoking "Maternal smoking status during pregnancy"
      • label variable coffee "Maternal coffee drinking during prengnancy"
      • label variable weight "Maternal weight (kilogram) at the beginning of the pregnancy"
      • label variable height "Maternal height (centimeter) at the begging of the pregnancy"
      • label variable gender "Gender for the child"
      • label variable birthweight "Birthweight for the child"
      • describe
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
    • Compare the output from the command describe in the beginning and at the end
  • Value label for a categorical variable

    • Background:

      When we have categorical variables, it is difficult to remember what the values mean? for example, given a categorical variable gender coded as 0 and 1, it is difficult to judge whether 0 is boy/male or 1 is boy/male. Therefore, it is crutial to lable the values of categorical variables.

    • Stata command: label define and label value
    • Example: try out the example in the downloaded Do-file or copy, paste, and run the following command lines into a new Do-file Editor
      • tab1 smoking coffee gender
      1. Step 1: define "label-name together with text explaining the values's meaning" for variables
        • lab define labForSmoking 1 "Yes" 0 "No"
        • lab define labForCoffee 1 "Yes" 0 "No"
        • lab define labForGender 0 "boy" 1 "girl"
      2. Step 2: make connections between the exsiting variable-name and the label-name
        • lab value smoking labForSmoking
        • lab value coffee labForCoffee
        • lab value gender labForGender
      • tab1 smoking coffee gender
    • Execute the command lines: use the computer cursor to highlight the above command lines and click on the Do button
    • Compare the output from the command tab1 in the beginning and at the end

Get help from Stata

  • Review Window (renamed as "History" in Stata 16)
  • Besides google, you get can help via at least three ways:
    1. At the end of the Stata menu, click on the help. Skim through the drop-down menu
    2. Geting help for a particular Stata command, in the Command Window, type: help commandname
    3. Click on Stata Youtube Channel

Tips and tricks

  • Review Window (renamed as "History" in Stata 16)
    1. Copy a past command from the Review-Window (History) to Command-Window: single-click on the past command in the Review-Window
    2. Re-run a past command in the Review-Window: double-click on the past commad in in the Review-Window
    3. Copy past commands to Do-file editor: Select the commands in the Review-Window and right-click, and further click on "Send selected to Do-File Editor"
  • Variable Window
    1. Copy a single variable from the Variable-Window to Command-Window: double-click on the variable in the Variable-Window
    2. Copy several variables from the Variable-Window to Command-Window: select the variables in the Variable-Window and right-click on the variables, and further click on "Send varlist to Command Window"

A list of Stata command

The description of the command Name of command Example
describe
browse
tab1
tab2
summarize
summarize, if
rename
generate
replace, if
label variable
lab define, lab value
drop
drop, if