Tuesday, February 1, 2022
Arrays: Pandas or Numpy?
Context: I’m working on a project to reimplement a MatLab+Simca workflow for 1D and 2D NMR spectra in Python. The 1D and 2D NMR data, combined with an independent variable, is converted to a matrix and processed with principal component analysis (PCA) or orthogonal projection to least squares (OPLS) to determine predictive spectral components for that independent variable. The predictive components (in a matrix) are converted back to a 1D or 2D spectrum, and the peaks most useful in the predictive model can then be identified.
Issue: Several PCA examples use pandas
to create a data frame from spreadsheet data prior to PCA analysis with scikit-learn
. In my case, I am using a numpy
matrix because it is trivially easy to reshape the array for a single spectrum into a row vector which can be appended to the matrix of all spectra. I wondered if that was the optimal approach, given the examples using data frames.
Comparison:
- pandas data frames are a thin wrapper around numpy arrays.
- pandas is optimized to, conceptually, treat arrays like spreadsheet data, in terms of ease of adding dropping rows and columns.
- scikit-learn was designed to work with numpy arrays
Conclusion: If you are comfortable processing numpy
arrays and don’t need other pandas
features, use numpy
arrays for ease in dealing with the data before and after the PCA analysis. It will also be less confusing, switching between data handling styles.