Linear modeling is a highly effective tool for biomedical research due to its simplicity and interpretability. In these models, coefficients represent the amount of change expected in the dependent variable for a one-unit change in the corresponding independent variable, while holding all other variables constant.
For example, in a study exploring the relationship between exercise and cholesterol level, the amount of exercise constitutes the independent variable, and cholesterol level is the dependent variable.
Biomedical studies often gather substantial data from a limited number of samples. Consequently, even a minor proportion of outlier samples can exert significant influence on linear modeling.
Outliers are observations that deviate significantly from other values, and can arise from various sources like measurement errors — inaccuracies in lab equipment or human errors during data collection — or from inherent biological variability.
In correlation analysis, outliers can dramatically distort the perceived relationship between two variables. If not properly managed, these outliers can generate a misleading correlation coefficient, overestimating or underestimating the true relationship.
This distortion can lead to incorrect interpretations and potentially misguided subsequent actions.
To address this problem, we have developed the RANSAC-lm webtool, enabling users to undertake correlation analysis that remains unaffected by outliers. RANSAC (RANdom SAmple Consensus) is a robust iterative machine learning algorithm designed to estimate parameters of a mathematical model from observed data that may be contaminated with outliers.
Its operation involves randomly selecting subsets from the original data, fitting a model to each subset, and finally choosing the model that most accurately fits the observed data, based on a defined tolerance for error.