Model (Statistical; Regression)

What It Is

Many clinical or observational research studies involve regression models.  These models employ statistical methods to examine whether particular factors have an association with an outcome.  If there are multiple factors of interest in the model, the statistical methods allow for determining whether they have a relationship with the outcome even after considering the values or effects of the other factors in the model (“controlling for confounding”).  There are different types of statistical models, depending on the outcomes, factors, and study design, e.g., linear regression, logistic regression, Cox proportional hazards model, linear mixed effects model.

Basic regression model structure

Outcome   intercept Part 1 * Part 2 Part 1 * Part 2   error
Y = c  + β1*X1 + … + βj*Xj  + ε

Outcome: Y, the numbers are supplied by the data collected on the outcome during your study.

Intercept:  what Y (outcome) would be if all X’s are zero.  It is a constant (does not change from one person to the next in a model).  Most of the time the intercept does not play a role in interpreting the results of the study.

Part 2:  X, a factor you collected data on during your study (e.g., age, sex, disease severity, treatment given) and which you want to know whether it is related to the outcome.  You can have multiple factors (X’s) in your model, but you can’t have too many.  If there are too many X’s, your model will either not be able to provide results or will provide nonsense results.  The maximum number of X’s depends on certain aspects of a specific study, such as the type of outcome and how many people are in it.  The subscript “1” means that is the first factor in your model.  The subscript “j” means it’s the “last” factor in your model.  The data in your study provides the values for the X’s.

Part 1:  β, the “parameter,” such as a slope, which your model is trying to find (estimate).  The parameter represents the amount and type of influence the factor (X) has on the outcome (Y).  Each β is “paired” with and relates to a specific X.   If the parameter is zero, or has a wide standard error so that it can’t be shown to be different from zero, the conclusion is that X doesn’t have an effect on Y.  At least, based on the data you collected.

Error: ε, the difference between an outcome that was actually observed and what the model predicted the outcome would be.  This is an acknowledgement that even though there may be a relationship between a factor and an outcome, you don’t see exactly the same relationship from one person to the other.  For an individual, the difference in the outcome between what the model said it should be (prediction) and what was actually recorded (observed) in the study is called a “residual.”  The goal of the model is to get the sum of all the residuals to be as close to zero as possible.  Statisticians often examine the residuals to see if they display patterns that indicate that the model is biased or does not adequately represent the data that was “fed” to it.

Why It’s Important

In non-experimental situations many interrelationships exist between a set of factors (such as age, BMI, genetic makeup, smoking behavior, dietary habits) and outcomes of interest.  For example, a factor may appear to have a relationship with an outcome, but the relationship is “fake” or distorted because another factor, that has the “real” relationship with the outcome, is also associated with the first factor.  This is called confounding. Regression models can help to tease apart which relationships are more likely to be real and which are more likely to be fake.  For example, a study could include a factor which measures the amount of gray hair someone has, another factor for the person’s age, and an outcome of cardiovascular disease (the person has it or does not have it).  If a researcher didn’t consider age, a statistical test or model might show a strong relationship between having a lot of gray hair and having cardiovascular disease.  This naïve researcher might conclude, “gray hair causes cardiovascular disease.”  However, if the researcher then adds the person’s age to the statistical model, age would show a strong relationship with cardiovascular disease while having a lot of gray hair would not.

Caveat:  developing statistical regression models is a rather complex undertaking.  The model building process is best done after the researchers have carefully considered the causal pathway from factor(s) of interest to outcome and designed their study accordingly.  The model is only as good as the thoughtfulness and thoroughness of this process.  Models do whatever the statistical programmer tells them to do.  They do not have the ability to give feedback such as “that’s an intermediate outcome and not a confounder,” “you’re missing an important interaction effect that you should account for in the analysis,” “you’ve included too many factors,” or, “you need to specify that each person is contributing more than one observation to the dataset.”  Garbage in garbage out.