The Premier Chemometrics Company

Nov 2016
No. 12

Latest at InfometrixThank you for subscribing to our quarterly newsletter, where we bring you current information about chemometrics methods, training and relevant applications for industry, regulation and research. Our feature article is part one of a two-part series discussing robust statistical analysis, What is it and Why do we care? We also have an excellent tech tip about the modeling process and the use of additional diagnostics in Pirouette. In the After Further Review piece, Brian Rohrback ponders the question of quality analytics in a changing world of instruments and mathematical methods.
We concluded another 3 day chemometrics training course this past October 19th-21st. Infometrix is in its 22nd year of teaching it and we are pleased to offer it again along with our advanced methods course. Below you will find our offerings for 2017. Please mark your calendar and register now to reserve a spot.

Advanced Spectroscopic Techniques Course -  April 4-5, 2017
General Chemometrics Course - October 18-20, 2017
In the latest volume of INFORM journal, the article "The Highs and Lows of Cannabis Testing" references work by Infometrix, Inc. and GW Pharmaceuticals on quality control testing with chemometrics fingerprint analysis.
Infometrix, Inc. recently presented papers at the Gulf Coast Conference in Houston, TX. Here are the abstracts for the papers.

Abstract # 135
Contrasting Spectroscopy and Chromatography for Motor Fuel Assessments
Michael Roberto, Infometrix, Inc.

Abstract # 147
Making SimDist Faster and More Robust
Brian Rohrback, Infometrix, Inc.

Abstract # 198
Re-engineering Calibration in Optical Spectroscopy
Michael Roberto, Infometrix, Inc

Abstract # 199
The Chemometrics Role in Data Analysis
Brian Rohrback, Infometrix, Inc.

If you would like additional information relating to the presentations, please email your request to
Robust Statistical Analysis Part 1 - What is it and Why do We Care?Robust statistical analysis makes use of measures that accurately describe the majority of the data. In this process, robust statistical analysis more readily reveals outliers, that is, data points that do not conform to the majority of the data. Robust measures are available for the most basic of statistics: for example, the location or central tendency estimate can be given by the median rather than the mean or the estimator of scale and the standard deviation can be replaced by the MAD (median absolute deviation from the median). Robust measures are also available for regression analysis and, more recently, even multivariate calibration algorithms have been robustified.

Why should we care about robust analysis? This question is best addressed via an example. Let’s assume we have 30 values measured from a flow meter and we want to summarize this measure in a single statistic. The standard approach is to compute the mean value. The figure below shows the effect of outliers on the mean and median.
With no outliers the mean and median are almost the same, but with just one outlier the mean has been shifted by 64%. With more outliers there is even a greater effect on the mean while the median is mostly unaffected. In fact, with just one extreme observation the mean value can be made to give values arbitrarily far from the value found without the extreme point included. This same effect is observed for the standard deviation. So we should care about robust analysis if we want to generate reasonable representations of our data when outliers are present. And it is really a question of when and not if outliers are present as most real world data sets contain outliers.

Why is the mean so affected by outliers? The mean is a least squares estimator of location, that is, the point that is a minimum sum of squared differences from all the observations. So in order to provide a minimum sum of squared differences when there are very low or very high valued observations the mean is pulled low or high and thus does not represent the majority of the data. The median simply describes the middle point of the data and is unaffected by how far other data may be from this middle point.

A measure of the robustness of a statistic is the breakdown point. The breakdown point is the smallest fraction of anomalous data that can render the statistic useless. The formal mathematical definition states that the mean has a breakdown of 0%, reflecting the very sensitive nature of this estimator to aberrant data. In fact, all least squares estimators have a breakdown of 0%. The fact that least squares estimators have 0% breakdown is the reason standard regression analysis is so sensitive to outliers. Shown below is an example of a standard least squares regression analysis in which three points 'a', 'b' and 'c' appear from the predicted versus known plot on the left to be aberrant. But which point or points are truly outliers?
A further examination of the standardized residuals vs. predicted plot on the right indicates that observation 'a' is abnormal while 'b' and 'c' are normal. If a robust regression analysis is conducted the following is observed.
Sample 'a' is in fact consistent with the majority of the data while samples 'b' and 'c' are extreme outliers. This is an example of masking, that is, outlier samples work together to hide each another. It is also an example of swamping, which is even more insidious than masking, in which the outlier samples have made the normal observation appear to be abnormal. For simple regression this problem may be apparent from a visual examination, but if the data are multivariate in nature the problems of masking and swamping are much more difficult to detect without the aid of robust techniques. So again, we should care about robust analysis because it more readily reveals outliers that are undoubtedly in your data and further it can prevent us from actually removing good samples in our hunt for outliers when using standard methods.

Next time we will discuss techniques for robust multivariate calibration and the unique approach Infometrix is taking to optimize the use of these robust techniques.
After Further Review...OK, pop quiz: The check engine light goes on in your car. Truth be told, it isn’t clear how long that light has been on. What should you do?
  • Place a square of duct tape over the light so it doesn’t bother you again.
  • Blame the car manufacturer for installing a faulty light and ignore it; maybe later, you will pen a strongly-worded letter.
  • Buy a new car.
  • Take the car to a mechanic to identify what is wrong and fix it.
I talk to a lot of companies about their quality control systems, both those that employ chemometrics and those who should. It is tough to embrace change. Even well-designed quality assessment measures can drift over time and if routine system evaluation and maintenance is not performed, unexpected variance (in the ingredients, in the process, in the instrumentation, in the caretakers) can lead to the equivalent of an illuminated check-engine light. Amazingly, the corporate response to the pop quiz is most commonly one of the first three choices. Why is that? The first two are non-action items and by far the easiest course of (in)action in the short-run. The third is an overreaction, most often tied to a persistent perceived fault in an instrument or class of instruments (e.g., we have not been successful with FTNIR, let’s switch to Raman).

Often the quality assessment is based on an older heuristic that doesn’t really apply to the current manufacturing procedure or the modern slate of products. Sometimes there is a blame game going on that has devolved into an R&D versus manufacturing engineering conflict. It also can be challenging to drag a few unbudgeted dollars (this sort of thing is rarely budgeted) out of management (see any recent Dilbert comic strip) that has trained itself to look harder at expenses than at the benefits. To be fair, it is often difficult to balance reasonably-well-documented costs against the less-certain advantages of an improved set of diagnostics. The knee-jerk reaction virtually always favors saying “No”.
The first step toward solving this dilemma is to recognize when it is happening. It is also helpful to encourage a broader understanding that the analyzer and chemometrics are just like the other parts of the system; they need to be maintained too.

Brian G. Rohrback
We Need Your HelpWe want to supply you with valuable information in our quarterly newsletters and monthly updates, and we hope that we have. Your input is very valuable to us. Please help us improve the content of our messages by providing comments, good or bad. Visit our web site, or email us at and tell us how we can better serve you.

In This Newsletter
-Feature: Robust Statistical Analysis Part 1-What is it and Why do We Care?
-Upcoming Events
-Tech Tip: Believe Your Model - or Not
-After Further Review...

Upcoming Event
February 27 - March 2, 2017

Pittcon 2017
March 5-9, 2017

Infometrix Advanced Training Course
April 4-5, 2017

ISA-AD (Instrument Society of America-Analyzer Division)
April 23-27, 2017

CPAC Spring Meeting
May 1-2, 2017

Tech Tip: Believe Your Model - or Not
You’ve collected what you believe are good spectra and you trust that the reference method values are reliable. After evaluating different pre-treatments you feel you should optimize the data by using a second derivative. Removing a couple of outlier samples improves the model even further such that the Y Fit plot (measured vs actual property values) is remarkable.

Maybe. But, maybe not. As with all multivariate methods, it is imperative that you don’t treat the model building phase as a black box. Several important diagnostics are computed during the calibration process and each of them should be perused to be sure that what you see makes sense, in a sense a form of visual validation.

The above scenario actually occurred with a user who felt very pleased with such a remarkable fit. Although the spectra themselves did not have the strongest signal, with very broad bands, the derivative seemed to improve the model. One caveat with using derivatives is that they can reduce the signal to noise by as much as 2 orders of magnitude, for each derivative. In this user’s case, the signal remaining in the second derivative did show some structure but at the expense of dramatically increased noise.

In fact, a quick look at the regression vector—which showed no structure whatsoever, rather a continuous vector of noise spikes—made it clear that this regression was completely spurious. It is known that coupling large numbers of measurements for a small sample set runs a real risk of spurious correlations. Combine that with a set of measurements containing relatively low signal to noise, and the risk is increased even further. Bottom line: even though a calibration may look great, the model must be evaluated thoroughly. This includes spending time with all of the computed diagnostics as well as validating with an external data set.

"The greater danger for most of us lies not in setting our aim too high and falling short; but in setting our aim too low, and achieving our mark."

- Michelangelo, artist (1475-1564)

The Infometrix mission is to provide high quality, easy-to-use software for the handling of multivariate data.

Publications - Application of chemometrics to problems in a variety of research areas



Copyright © 2016 Infometrix, Inc., All rights reserved.