The Premier Chemometrics Company

Aug 2016
No. 11

Latest at InfometrixOur three-day General Chemometrics Training Course is approaching soon. If you haven't already done so, sign up now to reserve a spot. Also, a reminder that the dates for the course have changed from October 5-7 to October 19-21, 2016.

2016 Excellence in Analytical Technical Innovation Award will be presented to Brian Rohrback, President of Infometrix, Inc. at the Fall Leaders' Meeting of the Instrument Society of America. Brian received an award from ISA's (Instrument Society of America) Analytical Division in recognition of what LineUp™ can do to improve the routine use of chromatography, particularly in process control and quality assurance settings.

Join us as we celebrate                                            the brightest                                            stars in automation. The                                            53rd Annual Honors &                                            Awards Gala

The Importance of Data Exploration!Some analysts who have been performing routine analysis for years tend to skip over a critical step in data analysis: data exploration. At Infometrix’ semi-annual training course, we spend 1/3 of the time discussing how to do data exploration, including visualization of data, cluster analysis, principal components analysis, and other techniques. We advocate that every dataset requires at least some amount of data exploration to ensure that the data being used is appropriate for the task. From the analyst’s perspective, this is how you:
  • identify outliers before building a model
  • reveal clustering of data
  • flag potential process upsets
  • understand any time, operator, or plant biases, or variations arising from other causes
Over time, an analyst can see similar patterns for so long that they skip straight to building a model or producing results without taking time to truly investigate each dataset from scratch. Whenever Infometrix joins a project, it is usually this initial data exploration phase that yields important observations, identifies issues that our clients have missed, and determines our approach for modeling. Our lack of familiarity in a project becomes an advantage when an analyst is not actively performing data exploration!

After Further Review...Over the years, I have been involved in a lot of discussions tied to chemometric algorithms and their usefulness for all sorts of applications.  In the early days (dating into the 1980s for me), the focus was always on chemometrics as a set of tools and the emphasis was on balancing the migrating platform variants (both computer and instrument software), training scientists to use the tools appropriately, and proving the value of the techniques in the application space.  Now, well into our 4th decade of software development, training, and servicing applications, a new balancing act is upon us.  Although Infometrix will never ignore the software tools that represent the technical core of our business, we do find that many of the problems to which the chemometrics tools are applied (and are meant to solve) too often are shown to be effective but still languish in an R&D environment without ever seeing action as part of a quality control environment (where there is money to be made).

For chemometrics to play to its potential in manufacturing and production settings, we must focus on its role as a part (sometimes a small part) of the system as a whole.  It takes a systems engineering approach; if we relegate the chemometrics to an isolated task status, deployment of an optimized system and its associated advantages are much less likely to happen.

So, it is not like building a wall.  The better analogy is a Mandelbrot: a fractal pattern that is complex, asymmetrical and allows you to dive in as deep as you want to go to find the optimal view of your process.  The chemometrics segment needs to tie in, sometimes in several different places, and it has to play nice with other system components causing us to consider not just the instrument source but also the database and control components.  Getting the right pieces of data-derived information to the right place at the right time and in the right form (think dashboards and visualizations) is significantly easier given the advances in other areas of technology. Let’s do it.

Brian G. Rohrback
We Need Your Help We want to supply you with valuable information in our quarterly newsletters and monthly updates, and we hope that we have. Your input is very valuable to us. Please help us improve the content of our messages by providing comments, good or bad. Visit our web site, or email us at and tell us how we can better serve you.

In This Newsletter
-Feature: The Importance of Data Exploration!
-Upcoming Events
-Tech Tip: Choosing the Number of Samples
-After Further Review...

Upcoming Event

FACSS SciX2016
September 18-23, 2016

Instrument Society of America Fall Leaders' Meeting
September 24-26, 2016

IFPAC Cortona
October 2-6, 2016

Infometrix General Training Course
October 19-21, 2016

Gulf Coast Conference 2016
October 11-12, 2016

February 27 - March 2, 2017

Pittcon 2017
March 5-9, 2017

ISA-AD (Instrument Society of America-Analyzer Division)
April 23-27, 2017

Tech Tip: Choosing the Number of Samples
Q: With great interest I have read about the NIR application on your website. One of the concerns of my statistician colleagues, however, is the relatively small number of samples and large number of data points that are used in the approach. They insist that we need about 10 times more samples than the number of independent variables while in the octane example it’s the other way around (in our case, 57 calibration samples and 700 variables). How do I reconcile the approach of chemometricians and statisticians?

A: While it is true that in general you need more samples than variables to proceed with a regression, that is a historical rule when working with variables that are not correlated. In the case of spectroscopic data, such as NIR, all variables are correlated, both to their neighboring variables (similar wavelengths) and to wavelengths that are ‘overtones’ of other wavelengths. The result is that the number of spectroscopic variables greatly exceeds the true rank of the data.

Thus, factor-based regression methods attempt to compress the highly correlated spectral data into a smaller number of principal components, and it is assumed that these ‘latent’ variables will be smaller in number than the number of samples, allowing the algorithm to proceed. With PCR, in particular, the principal components are also orthogonal to one another—they are truly independent.

So, back to the original question, how many samples does one need to perform a multivariate regression? Given that the algorithm (PCR or PLS) will effectively reduce the dimensionality to a small number of latent variables, you could, in principle, suggest that you need at least as many samples as the number of factors in the regression model. Yes, this is a little bit of a chicken-and-egg situation because you don’t have the data yet to perform the regression. To better represent the variability in each principal component, it is often suggested as a rule of thumb that you present at least 3 or 4 times as many samples as the number of factors in the model. The statisticians’ suggestion of 10 times more samples than independent variables (read: principal components) is an even more conservative approach but is also reasonable, though it may represent more work to obtain the data. The advantage of collecting more data is that it will allow parsing the data set into calibration and validation sets, allowing model evaluation.

Even if you're on the right track, you'll get run over if you just sit there.

-Will Rodgers, humorist (1879-1935)

The Infometrix mission is to provide high quality, easy-to-use software for the handling of multivariate data.

Publications - Application of chemometrics to problems in a variety of research areas



Copyright © 2016 Infometrix, Inc., All rights reserved.