Arrays: Pandas or Numpy? « Hakidashimado

Tuesday, February 1, 2022

Arrays: Pandas or Numpy?

Context: I’m working on a project to reimplement a MatLab+Simca workflow for 1D and 2D NMR spectra in Python. The 1D and 2D NMR data, combined with an independent variable, is converted to a matrix and processed with principal component analysis (PCA) or orthogonal projection to least squares (OPLS) to determine predictive spectral components for that independent variable. The predictive components (in a matrix) are converted back to a 1D or 2D spectrum, and the peaks most useful in the predictive model can then be identified.

Issue: Several PCA examples use pandas to create a data frame from spreadsheet data prior to PCA analysis with scikit-learn. In my case, I am using a numpy matrix because it is trivially easy to reshape the array for a single spectrum into a row vector which can be appended to the matrix of all spectra. I wondered if that was the optimal approach, given the examples using data frames.

Comparison:

pandas data frames are a thin wrapper around numpy arrays.
pandas is optimized to, conceptually, treat arrays like spreadsheet data, in terms of ease of adding dropping rows and columns.
scikit-learn was designed to work with numpy arrays

Conclusion: If you are comfortable processing numpy arrays and don’t need other pandas features, use numpy arrays for ease in dealing with the data before and after the PCA analysis. It will also be less confusing, switching between data handling styles.

Posted by jodellfp at 23:53:02 in

Hakidashimado

Tuesday, February 1, 2022