Latest at Infometrix
Thank you for subscribing to our quarterly newsletter, where we bring you current information about chemometrics methods, training and relevant applications for industry, regulation and research. Our feature article is part one of a two-part series discussing robust statistical analysis, What is it and Why do we care
? We also have an excellent tech tip about the modeling process and the use of additional diagnostics in Pirouette. In the After Further Review piece, Brian Rohrback ponders the question of quality analytics in a changing world of instruments and mathematical methods.
We concluded another 3 day chemometrics training course this past October 19th-21st. Infometrix is in its 22nd year of teaching it and we are pleased to offer it again along with our advanced methods course. Below you will find our offerings for 2017. Please mark your calendar and register
now to reserve a spot.
Advanced Spectroscopic Techniques Course
- April 4-5, 2017
General Chemometrics Course
- October 18-20, 2017
In the latest volume of INFORM journal, the article "The Highs and Lows of Cannabis Testing
" references work by Infometrix, Inc. and GW Pharmaceuticals on quality control testing with chemometrics fingerprint analysis.
Infometrix, Inc. recently presented papers at the Gulf Coast Conference
in Houston, TX. Here are the abstracts for the papers.
Abstract # 135
Contrasting Spectroscopy and Chromatography for Motor Fuel Assessments
Michael Roberto, Infometrix, Inc.
Abstract # 147
Making SimDist Faster and More Robust
Brian Rohrback, Infometrix, Inc.
Abstract # 198
Re-engineering Calibration in Optical Spectroscopy
Michael Roberto, Infometrix, Inc
Abstract # 199
The Chemometrics Role in Data Analysis
Brian Rohrback, Infometrix, Inc.
If you would like additional information relating to the presentations, please email your request to email@example.com
Robust Statistical Analysis Part 1 - What is it and Why do We Care?
Robust statistical analysis makes use of measures that accurately describe the majority of the data. In this process, robust statistical analysis more readily reveals outliers, that is, data points that do not conform to the majority of the data. Robust measures are available for the most basic of statistics: for example, the location or central tendency estimate can be given by the median rather than the mean or the estimator of scale and the standard deviation can be replaced by the MAD (median absolute deviation from the median). Robust measures are also available for regression analysis and, more recently, even multivariate calibration algorithms have been robustified.
Why should we care about robust analysis? This question is best addressed via an example. Let’s assume we have 30 values measured from a flow meter and we want to summarize this measure in a single statistic. The standard approach is to compute the mean value. The figure below shows the effect of outliers on the mean and median.
With no outliers the mean and median are almost the same, but with just one outlier the mean has been shifted by 64%. With more outliers there is even a greater effect on the mean while the median is mostly unaffected. In fact, with just one extreme observation the mean value can be made to give values arbitrarily far from the value found without the extreme point included. This same effect is observed for the standard deviation. So we should care about robust analysis if we want to generate reasonable representations of our data when outliers are present. And it is really a question of when and not if outliers are present as most real world data sets contain outliers.
Why is the mean so affected by outliers? The mean is a least squares estimator of location, that is, the point that is a minimum sum of squared differences from all the observations. So in order to provide a minimum sum of squared differences when there are very low or very high valued observations the mean is pulled low or high and thus does not represent the majority of the data. The median simply describes the middle point of the data and is unaffected by how far other data may be from this middle point.
A measure of the robustness of a statistic is the breakdown point. The breakdown point is the smallest fraction of anomalous data that can render the statistic useless. The formal mathematical definition states that the mean has a breakdown of 0%, reflecting the very sensitive nature of this estimator to aberrant data. In fact, all least squares estimators have a breakdown of 0%. The fact that least squares estimators have 0% breakdown is the reason standard regression analysis is so sensitive to outliers. Shown below is an example of a standard least squares regression analysis in which three points 'a', 'b' and 'c' appear from the predicted versus known plot on the left to be aberrant. But which point or points are truly outliers?
A further examination of the standardized residuals vs. predicted plot on the right indicates that observation 'a' is abnormal while 'b' and 'c' are normal. If a robust regression analysis is conducted the following is observed.
Sample 'a' is in fact consistent with the majority of the data while samples 'b' and 'c' are extreme outliers. This is an example of masking, that is, outlier samples work together to hide each another. It is also an example of swamping, which is even more insidious than masking, in which the outlier samples have made the normal observation appear to be abnormal. For simple regression this problem may be apparent from a visual examination, but if the data are multivariate in nature the problems of masking and swamping are much more difficult to detect without the aid of robust techniques. So again, we should care about robust analysis because it more readily reveals outliers that are undoubtedly in your data and further it can prevent us from actually removing good samples in our hunt for outliers when using standard methods.
Next time we will discuss techniques for robust multivariate calibration and the unique approach Infometrix is taking to optimize the use of these robust techniques.
After Further Review...
OK, pop quiz: The check engine light goes on in your car. Truth be told, it isn’t clear how long that light has been on. What should you do?
- Place a square of duct tape over the light so it doesn’t bother you again.
- Blame the car manufacturer for installing a faulty light and ignore it; maybe later, you will pen a strongly-worded letter.
- Buy a new car.
- Take the car to a mechanic to identify what is wrong and fix it.
I talk to a lot of companies about their quality control systems, both those that employ chemometrics and those who should. It is tough to embrace change. Even well-designed quality assessment measures can drift over time and if routine system evaluation and maintenance is not performed, unexpected variance (in the ingredients, in the process, in the instrumentation, in the caretakers) can lead to the equivalent of an illuminated check-engine light. Amazingly, the corporate response to the pop quiz is most commonly one of the first three choices. Why is that? The first two are non-action items and by far the easiest course of (in)action in the short-run. The third is an overreaction, most often tied to a persistent perceived fault in an instrument or class of instruments (e.g., we have not been successful with FTNIR, let’s switch to Raman).
Often the quality assessment is based on an older heuristic that doesn’t really apply to the current manufacturing procedure or the modern slate of products. Sometimes there is a blame game going on that has devolved into an R&D versus manufacturing engineering conflict. It also can be challenging to drag a few unbudgeted dollars (this sort of thing is rarely budgeted) out of management (see any recent Dilbert comic strip) that has trained itself to look harder at expenses than at the benefits. To be fair, it is often difficult to balance reasonably-well-documented costs against the less-certain advantages of an improved set of diagnostics. The knee-jerk reaction virtually always favors saying “No”.
The first step toward solving this dilemma is to recognize when it is happening. It is also helpful to encourage a broader understanding that the analyzer and chemometrics are just like the other parts of the system; they need to be maintained too.
Brian G. Rohrback
We Need Your Help
We want to supply you with valuable information in our quarterly newsletters and monthly updates, and we hope that we have. Your input is very valuable to us. Please help us improve the content of our messages by providing comments, good or bad. Visit our web site, www.infometrix.com
or email us at firstname.lastname@example.org
and tell us how we can better serve you.