The Premier Chemometrics Company

May 2016
No. 10

Latest with InfometrixIn this issue of Infometrix quarterly newsletter we emphasize the practical use of chemometrics with an article about profiling complex samples for quality in a manufacturing process and review Infometrix’ years of experience with analytical instruments and software.

Infometrix has long been a supplier of consulting services and tools for quality analysis. We have taken our experience with chemometrics tools, calibration and classification modeling services, data organization and project management and have wrapped it up into a project oriented product called the “Profiler”. The benefits of such a product are explained below in “The Objective Profiling of Complex Samples”, where we cover the quality control issues of a pharmaceutical company manufacturing medical marijuana products.

Getting to the Profiler has been a long journey for Infometrix in our 38 year history. Our president, Brian Rohrback explains in his “After Further Review” article how important it has been for us to have had relationships with nearly all of the major instrument manufacturers and software vendors in the analytical fields. That experience from working with data from all types of equipment, in nearly any field, has benefited our knowledge base to the point that we can now create and maintain turnkey quality analysis systems for many food, pharmaceutical, chemical, petroleum and materials manufacturers.

Our tech tip this quarter is on choosing the number of factors in a model and is a must read for anyone maintaining models for a quality control environment. This and other tech tips are covered in detail on our chemometrics training courses offered twice a year. Please go to to find out more about our course offerings and dates.
Objective Profiling of Complex Samples Infometrix has been involved in many projects that require some processing of chromatographic data that goes beyond the identification and quantitation of one or a small number of analytes. One of our favorite collaborations has been with GW Pharmaceuticals, the leading supplier of medicinal cannabinoid products, who posed a question that caused us to rethink how to best process data for a single substance, yet originating from a set of instrumented analyses. The issue here is to merge the information content from a series of chromatographic analyses into a single release metric: automating an objective pass-fail system.

In this case, we were charged with examining both the major and minor constituents of a botanical extract to insure that product composition was “the same” from batch to batch. With several hundred components in several compound classes, a traditional upper/lower control limit scheme is exceedingly difficult (read impossible) to build. We decided to use whole chromatograms of 4 different fractions as if they were spectra. Assembling a collection of previously analyzed high quality batches allows us to build PCA models that define good product.

The first task is to eliminate retention time variability that is intrinsic to all chromatographic runs. LineUp™ performs this function with a combination of technologies, including the chemometric algorithm known as Correlation Optimized Warping. The value is obvious in the chromatograms shown below.

The next step is to create PCA models for each fraction that will be used to spot unusual profiles from subsequent QC runs. We employed both a 95% and a 99% confidence interval to label as “good” samples (inside the 95% zone) and “bad” (outside of the 99% interval). Points falling in between these cutoffs form a warning track, which can assist in identifying manufacturing situations that may lead to unacceptable product. This interpretation is applied to both the in-model and out-of-model diagnostics. Reporting of the two metrics, with the break points from good to warn and warn to bad, is depicted in the dashboard below.

The above represents only one of the four fractions used to release product. To complete the release process, we need to combine the evaluation of every replicate for each of the fractions obtained for the product. In a process known as model fusion, we combine the evaluations and weigh them according to their importance to the drug’s efficacy. A system known as the Profiler™ was designed to perform this task. The Profiler™ represents an approach to chemometrics that will be discussed in a subsequent newsletter. If you want information prior to the next version of this quarterly missive, please contact us and we can give you additional detail.
After Further Review - Support of Analytical Instruments I: A Little HistoryThe first ever instance of chemometrics designed and deployed for a particular analytical instrument was done in 1983. This was the Infometrix software product called MCR-2 (the second version of Multivariate Curve Resolution; no one from Madison Avenue was involved in this name). The product was a follow-on to the VAX-based routine built for the ARTHUR software (Infometrix’ first software package). The idea was to take full advantage of Hewlett Packard’s newly introduced 1040 diode array HPLC detector to give a means of separately identifying and quantitating unresolved peaks. Now, if you can reset your mind for the moment to 1983 and consider the (HP-85) computer firepower that was present, the algorithm was not particularly fast, but it was effective and gave way to the QuickRes product built for HP, Waters and Beckman, thus launching Infometrix on a journey to focus on effective solutions for analytical instruments.

From time to time in this newsletter, we will focus on interesting implementations of chemometrics in the analytical instrument and process analyzer worlds. All of this effort is to highlight the use of computational technology to help process raw data into actionable information. We seek to turn these instruments into appliances that provide you with interpretations, tell you how certain they are that the values they are spewing are correct, flag upcoming maintenance problems, and tie in seamlessly into the work flow of a laboratory or plant. There are great examples in all the expected places: GC, HPLC, Mass spectroscopy, NMR, FTNIR, Raman, … There are also implementations that are not normally associated with the chemometrics field. Let’s see how many we can cover!

Brian G. Rohrback
We Need Your Help As a loyal reader we want to supply you with valuable information in our quarterly newsletters and monthly updates, and we hope that we have. Input from our loyal readers and customers is very valuable to us. Please help us improve the content of our messages by providing comments, good or bad. Visit our web site, or email us at and tell us how we can better serve you.

In This Newsletter
-Feature: Objective Profiling of Complex Samples
-Upcoming Events
-Tech Tip: Choosing Number of Factors
-After Further Review...

Upcoming Event

CAC 2016
June 6-10, 2016

AAPG 2016
June 19-22, 2016

FACSS SciX2016
September 18-23, 2016

Infometrix General Training Course
October 5-7, 2016

Gulf Coast Conference 2016
October 11-12, 2016

Tech Tip: Choosing Number of Factors
When creating a regression model, it is critical to choose the number of factors that best represents the correlation of information in the data block to the property of interest. There are many methods that suggest looking at the shape of--or values in--a plot of eigenvalues: where the shape of the curve bends, like in a hockey stick; or stopping when the eigenvalues drop below 1; or simply assuring that 95% (or other subjective cutoff) of the variance is explained. But these methods are looking only at variance and not necessarily correlation to the Y value. And, they pertain only to modeling and not prediction.

We could look at errors following a prediction on the calibration data: as the PRESS--the prediction residual error sum of squares--or as the RMSEC, the root mean square error of calibration--correcting PRESS for the number of samples and the number of factors in the model. However, this approach is biased and optimistic because the error diagnostic is derived from data used to make the model in the first place. Ideally, we want to evaluate the model size based on how well it predicts a set of unknowns. But, which unknowns?

One way to simplify this selection is to use cross-validation. In this procedure, one or more samples are excluded from the data set, a model is made on the remaining data, then this model is used to predict the excluded samples. Error diagnostics from this prediction are saved, then different samples are excluded and the process repeated. After all samples have been removed once, a summary of the prediction errors can be shown as the root mean square error of cross validation or RMSECV. This diagnostic is less biased than the RMSEC.

Ideally, the shape of the RMSECV curve, as a function of number of factors, will rise when too many factors are chosen because such a model has incorporated noise which cannot be predicted. Such a model is overfitting the data. Thus, choosing the minimum in this curve is a good choice for number of factors.

It is often the case that the RMSECV for a model with fewer factors than that at the minimum of the curve is not statistically different. In general we recommend using the more parsimonious model, which further reduces the risk of overfitting. In addition to looking at the RMSECV, you should also look at other computed objects, such as the loadings, the regression vector and the spectral residuals.

For spectroscopic data, neighboring values of the X data are highly correlated such that the curve is quite smooth. Similarly, the regression vector should be smooth. When too many factors are retained in the model, the regression vector will begin to show a jagged appearance, indicating that noise is being added to the model. Another diagnostic, simply called the jaggedness, attempts to characterize this tendency of the regression vector to become noisy when the model is overfit; it will usually have a U shape where the minimum would be the suggested model optimum. This metric seems to more consistently suggest the optimal number of factors and we now routinely use jaggedness as our preferred metric. Jaggedness can also be computed from the regression vector during calibration, thus does not require a cross-validation to suggest a reasonable number of factors to use in model.

It is not what we do, but also what we do not do, for which we are accountable.

-Moliere, actor and playwright (1622-1673)

The Infometrix mission is to provide high quality, easy-to-use software for the handling of multivariate data.

Publications - Application of chemometrics to problems in a variety of research areas

CAC 2016 - June 6-10, Barcelona


Copyright © 2016 Infometrix, Inc., All rights reserved.