A summary of tools for data science for Python
- Contact:
- dkuhlman (at) reifywork (dot) com
- Address:
http://www.reifywork.com
- Revision:
- 1.0.1
- Date:
- October 30, 2024
- Copyright:
Copyright (c) 2018 Dave Kuhlman. All Rights Reserved. This software is subject to the provisions of the MIT License http://www.opensource.org/licenses/mit-license.php.
- Abstract:
This document attempts to give a survey of data science tools for Python programming, along with brief introductions to help getting started with some of those tools.
1 Introduction and preliminaries
In this document I'll try to describe and summarize some significant tools that are available to Python programmers for data science, numerical processing, statistics, and visualizing numerical data. For each tool or package, I'll also try to give a brief overview of:
What the tool does.
What to use it for, along with a few use cases.
How to do a few common things that the tool supports.
When appropriate, a comparison with other similar tools.
All these packages are available in the Anaconda distribution of Python, which makes Anaconda a very good option for data analytics and visualization. See:
It's likely that they are also available at http://pypi.python.org and can be installed with pip. If you plan on doing some exploration (and do not want to use the Anaconda distribution), you will want to consider using virtualenv (https://virtualenv.pypa.io/en/stable/) and, for even more convenience in trying out various packages and configurations, look at virtualenvwrapper (https://virtualenvwrapper.readthedocs.io/en/latest/).
More information:
There is another summary of Python packages for data science here: https://elitedatascience.com/r-vs-python-for-data-science. Includes tools for the R programming language, too.
Many on the examples in this document use the somewhat standard import statements, for example:
import numpy as np import scipy as sp import pandas as pd
2 Some helpers
2.1 ipython
IPython is an enhanced interactive Python shell. It has tab completion, gives more convenient access to help for Python modules and objects, enables you to edit and rerun previous commands, and much more.
For more information, see: https://ipython.org.
Anaconda ships with QtConsole that contains IPython for even more convenience.
2.1.1 IPython profiles
If you use IPython, then consider creating a data science profile. Use something like this:
$ ipython profile create datasci
Then, consider putting something like the following in ~/.ipython/profile_datasci/startup/50-config.py:
import sys import numpy as np import scipy as sp def pdir(obj): """Print information about obj, including `dir(obj)`.""" if isinstance(obj, type): print('class: {}'.format(obj.__name__)) else: print('instance class name: {}'.format(obj.__class__.__name__)) if obj.__doc__: print('doc string: {}'.format(obj.__doc__)) else: print('doc string: no doc string') print(dir(obj)) def read_file_contents(filename): with open(filename, 'r') as infile: content = infile.read() return content
You can have multiple startup files. See the startup/README file in your profile directory.
Also, consider doing some customization in ~/.ipython/profile_datasci/ipython_config.py.
And, in order to use that profile, start IPython with this:
$ ipython --profile=datasci
You can find more help with profiles by running something like the following:
$ ipython help profile
Or, see the following: http://ipython.readthedocs.io/en/stable/config/intro.html#profiles
2.1.2 Getting (interactive) help and docs
Inside the standard Python interactive shell, you can get help on some_object with this:
>>> help(some_object)
Inside the IPython interactive shell, you can use the above, or you can do:
In [9]: import scipy.fftpack In [10]: scipy.fftpack? In [11]: In [11]: from scipy import fftpack In [12]: fftpack? In [13]: fftpack.fft?
You can use pydoc to get help at the command line. For example:
$ pydoc numpy.arange
You can also use pydoc to run an HTTP server, and view the documentation in a Web browser. Do the following for help with that:
$ pydoc --help
And, of course, documentation is available for the Scipy suite of tools at: http://www.scipy.org.
2.2 Installing the tools
Unless otherwise noted, each of the tools described in this document can be described with pip install ... (the standard Python install tool) or, for those who are using the Anaconda Python distribution, with conda install ....
2.2.1 pip and virtualenv
If you use pip, I'd recommend using virtualenv, at the least, and even virtualenvwrapper, for extra convenience and flexibility. virtualenv enables you to install Python packages (and therefor, the tools discussed in this document) in a separate environment, separate from your standard Python installation, and without polluting that standard installation. Since that separate installation is in its own directory, you can remove it by simply deleting that directory. virtualenvwrapper extends virtualenv by enabling you to create, manage, and switch between different virtualenv environments easily. For example, you might want to create and switch (1) between one virtualenv for text processing and another for data science or (2) between one installation for Python 2 and another for Python 3. See:
virtualenv -- https://pypi.python.org/pypi/virtualenv
virtualenvwrapper -- https://virtualenvwrapper.readthedocs.io/en/latest/
2.2.2 Anaconda
The Anaconda installation of Python provides most of the tools discussed in this document in the standard Anaconda installation. Additional tools can be installed with conda install ..., and the installation can be kept up-to-date with conda update --all. In the event that you need a Python package that is not provided by Anaconda, you can use pip.
The Anaconda distribution of Python -- https://continuum.io/
conda, the package manager for Anaconda -- https://conda.io/docs/index.html
2.2.3 Other Python distributions for data science
For more options on installing Python with a slant toward data science and scientific programming (but much else besides), see: https://www.scipy.org/install.html.
3 Analytics
3.1 Numpy
Help with Numpy:
See the documentation page: http://www.numpy.org.
A tutorial: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
Some lecture notes: http://www.scipy-lectures.org/intro/numpy/numpy.html
There are (at least) two aspects to Numpy:
Primitive Numpy numeric types or scalars, for example: np.int32, np.int64, np.float32, np.float64, etc. See the following for information on these primitive types and others: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.
Array objects (instances of np.ndarray) along with ways to deal with them.
Operations on Numpy arrays -- For information on these, see the Numpy reference manual: https://docs.scipy.org/doc/numpy/reference/index.html. Here is a quick summary:
Array creation routines -- Create arrays of different kinds, e.g. all ones, all zeros, identity, from an existing array, as a copy of an array, etc.
Array manipulation routines -- Routines that reshape an array, transpose an array, change the number of dimensions, join (concatenate, stack, etc), tiling arrays (create by repeating an array), etc. split arrays, etc.
Binary operations -- Logical binary operations on arrays, packing arrays into bits, bit-shifting operations, etc.
String operations
C-Types Foreign Function Interface (numpy.ctypeslib)
Datetime Support Functions
Data type routines
Optionally Scipy-accelerated routines (numpy.dual)
Mathematical functions with automatic domain (numpy.emath) -- Routines possibly accelerated by Scipy, but available in Numpy if Scipy is not installed. For example, routines for eigenvalues, Fourier transforms, solving linear equations, etc. Use:
>>> from numpy import dual
Floating point error handling
Discrete Fourier Transform (numpy.fft) -- Use:
>>> from numpy import fft
Or, just:
>>> np.fft.fft( ... ) # etc.
Financial functions -- Loan, payment, and interest calculations.
Functional programming -- Routines and classes that assist with doing functional programming. For example, np.vectorize creates a "vectorized" function; np.frompyfunc creates a Numpy ufunc. (Note that vectorized functions and universal functions can be applied to arrays. For help with the difference between vectorized and universal functions, see: https://stackoverflow.com/questions/6768245/difference-between-frompyfunc-and-vectorize-in-numpy.)
Also, remember to look at functools and itertools in the standard Python library: https://docs.python.org/3/library/functional.html
And, if you need parallelism across multiple CPUs and cores, look at ipyparallel: https://ipyparallel.readthedocs.io/en/latest/
Numpy-specific help functions -- Functions for getting information about objects and help with Numpy. (Also, if you are using IPython, the "?" operator gives help with a function or object, for example, enumerate? gives help on the enumerate function.)
Indexing routines
Input and output -- Routines for saving and loading arrays. (But, you may also want to explore HDF5 and h5py or pytables. Both h5py and pytables are in the Anaconda Python distribution.) Also, routines for formatting arrays as strings, converting arrays to and from strings, etc..
Linear algebra (numpy.linalg) -- Routines for the following:
Matrix and vector products
Decompositions
Matrix eigenvalues
Norms and other numbers
Solving equations and inverting matrices
Exceptions
Linear algebra on several matrices at once
Logic functions -- Functions for performing various tests on elements of Numpy arrays.
Masked array operations -- Support for creating and using masked arrays. A masked array is an array with a mask that marks some elements of the array as invalid. You can find some help with masked arrays in this document: http://www.scipy-lectures.org/intro/numpy/numpy.html.
Mathematical functions -- Functions for:
Trigonometric functions
Hyperbolic functions
Rounding
Sums, products, differences
Exponents and logarithms
Other special functions
Floating point routines
Arithmetic operations
Handling complex numbers
etc
Matrix library (numpy.matlib) -- Functions for creating and using matrices, as opposed to numpy.ndarry. Use from numpy import matlib. See this for a bit of help on the differences between arrays and matrices in Numpy: https://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u
Miscellaneous routines
Padding Arrays
Polynomials
Random sampling (numpy.random)
Set routines
Sorting, searching, and counting
Statistics
Test Support (numpy.testing)
Window functions
3.2 Scipy
Note that Scipy, Numpy, Pandas, Matplotlib, IPython, and Sympy are all under the Scipy umbrella. For information about any of these, see: https://www.scipy.org/.
What is Scipy? (1) It is many things to many people. But more seriously, (2) it is a large collection of functions for performing operations on arrays of numerical data. Think of it this way: Numpy (and Pandas) give you ways to structure and manipulate multi-dimensional arrays of numbers; Scipy gives you many functions that perform operations on those multi-dimensional arrays of numbers.
What kinds of operations? Here are some categories with descriptions:
Basic functions
Special functions (scipy.special)
3.2.1 Integration (scipy.integrate)
For help with this set of functions, do the following:
>>> from scipy import integrate >>> help(integrate)
Or, in IPython, do integrate?
Here is the list you will see:
Integrating functions, given function object
quad -- General purpose integration
dblquad -- General purpose double integration
tplquad -- General purpose triple integration
nquad -- General purpose n-dimensional integration
fixed_quad -- Integrate func(x) using Gaussian quadrature of order n
quadrature -- Integrate with given tolerance using Gaussian quadrature
romberg -- Integrate func using Romberg integration
quad_explain -- Print information for use of quad
- newton_cotes -- Weights and error coefficient for Newton-Cotes integration
IntegrationWarning -- Warning on issues during integration
Integrating functions, given fixed samples
trapz -- Use trapezoidal rule to compute integral.
cumtrapz -- Use trapezoidal rule to cumulatively compute integral.
simps -- Use Simpson's rule to compute integral from samples.
romb -- Use Romberg Integration to compute integral from (2**k + 1) evenly-spaced samples.
Solving initial value problems for ODE systems
The solvers are implemented as individual classes which can be used directly (low-level usage) or through a convenience function.
solve_ivp -- Convenient function for ODE integration.
RK23 -- Explicit Runge-Kutta solver of order 3(2).
RK45 -- Explicit Runge-Kutta solver of order 5(4).
Radau -- Implicit Runge-Kutta solver of order 5.
BDF -- Implicit multi-step variable order (1 to 5) solver.
LSODA -- LSODA solver from ODEPACK Fortran package.
OdeSolver -- Base class for ODE solvers.
DenseOutput -- Local interpolant for computing a dense output.
OdeSolution -- Class which represents a continuous ODE solution.
3.2.2 Optimization (scipy.optimize)
Remember that for each the following (or any) functions, you can get help in the usual ways: help(some_func) or (in IPython) some_func?.
Local Optimization:
minimize -- Unified interface for minimizers of multivariate functions
minimize_scalar -- Unified interface for minimizers of univariate functions
OptimizeResult -- The optimization result returned by some optimizers
OptimizeWarning -- The optimization encountered problems
General-purpose multivariate methods:
fmin -- Nelder-Mead Simplex algorithm
fmin_powell -- Powell's (modified) level set method
fmin_cg -- Non-linear (Polak-Ribiere) conjugate gradient algorithm
fmin_bfgs -- Quasi-Newton method (Broydon-Fletcher-Goldfarb-Shanno)
fmin_ncg -- Line-search Newton Conjugate Gradient
Constrained multivariate methods:
fmin_l_bfgs_b -- Zhu, Byrd, and Nocedal's constrained optimizer
fmin_tnc -- Truncated Newton code
fmin_cobyla -- Constrained optimization by linear approximation
fmin_slsqp -- Minimization using sequential least-squares programming
differential_evolution -- stochastic minimization using differential evolution
Univariate (scalar) minimization methods:
fminbound -- Bounded minimization of a scalar function
brent -- 1-D function minimization using Brent method
golden -- 1-D function minimization using Golden Section method
Equation (Local) Minimizers:
leastsq -- Minimize the sum of squares of M equations in N unknowns
least_squares -- Feature-rich least-squares minimization.
nnls -- Linear least-squares problem with non-negativity constraint
lsq_linear -- Linear least-squares problem with bound constraints
Global Optimization:
basinhopping -- Basinhopping stochastic optimizer
brute -- Brute force searching optimizer
differential_evolution -- stochastic minimization using differential evolution
Rosenbrock function:
rosen -- The Rosenbrock function.
rosen_der -- The derivative of the Rosenbrock function.
rosen_hess -- The Hessian matrix of the Rosenbrock function.
rosen_hess_prod -- Product of the Rosenbrock Hessian with a vector.
Fitting:
curve_fit -- Fit curve to a set of points
Root finding -- Scalar functions:
brentq -- quadratic interpolation Brent method
brenth -- Brent method, modified by Harris with hyperbolic extrapolation
ridder -- Ridder's method
bisect -- Bisection method
newton -- Secant method or Newton's method
Fixed point finding:
fixed_point -- Single-variable fixed-point solver
General nonlinear solvers:
root -- Unified interface for nonlinear solvers of multivariate functions
fsolve -- Non-linear multi-variable equation solver
broyden1 -- Broyden's first method
broyden2 -- Broyden's second method
Large-scale nonlinear solvers:
newton_krylov
anderson
Simple iterations:
excitingmixing
linearmixing
diagbroyden
Additional information on the nonlinear solvers can be obtained from the help on scipy.optimize.nonlin.
Linear Programming -- General linear programming solver:
linprog -- Unified interface for minimizers of linear programming problems
The simplex method supports callback functions, such as:
linprog_verbose_callback -- Sample callback function for linprog (simplex)
Assignment problems:
linear_sum_assignment -- Solves the linear-sum assignment problem
Utilities:
approx_fprime -- Approximate the gradient of a scalar function
bracket -- Bracket a minimum, given two starting points
check_grad -- Check the supplied derivative using finite differences
line_search -- Return a step that satisfies the strong Wolfe conditions
show_options -- Show specific options optimization solvers
LbfgsInvHessProduct -- Linear operator for L-BFGS approximate inverse Hessian
3.2.3 Interpolation (scipy.interpolate)
Sub-package for objects used in interpolation.
As listed below, this sub-package contains spline functions and classes, one-dimensional and multi-dimensional (univariate and multivariate) interpolation classes, Lagrange and Taylor polynomial interpolators, and wrappers for FITPACK and DFITPACK functions.
Univariate interpolation
interp1d
BarycentricInterpolator
KroghInterpolator
PchipInterpolator
barycentric_interpolate
krogh_interpolate
pchip_interpolate
Akima1DInterpolator
CubicSpline
PPoly
BPoly
Multivariate interpolation
Unstructured data:
griddata
LinearNDInterpolator
NearestNDInterpolator
CloughTocher2DInterpolator
Rbf
interp2d
For data on a grid:
interpn
RegularGridInterpolator
RectBivariateSpline
See also: scipy.ndimage.map_coordinates
Tensor product polynomials:
NdPPoly
1-D Splines
BSpline
make_interp_spline
make_lsq_spline
Functional interface to FITPACK routines:
splrep
splprep
splev
splint
sproot
spalde
splder
splantider
insert
Object-oriented FITPACK interface:
UnivariateSpline
InterpolatedUnivariateSpline
LSQUnivariateSpline
2-D Splines
For data on a grid:
RectBivariateSpline
RectSphereBivariateSpline
For unstructured data:
BivariateSpline
SmoothBivariateSpline
SmoothSphereBivariateSpline
LSQBivariateSpline
LSQSphereBivariateSpline
Low-level interface to FITPACK functions:
bisplrep
bisplev
Additional tools
lagrange
approximate_taylor_polynomial
pade
See also:
scipy.ndimage.map_coordinates,
scipy.ndimage.spline_filter,
scipy.signal.resample,
scipy.signal.bspline,
scipy.signal.gauss_spline,
scipy.signal.qspline1d,
scipy.signal.cspline1d,
scipy.signal.qspline1d_eval,
scipy.signal.cspline1d_eval,
scipy.signal.qspline2d,
scipy.signal.cspline2d.
Functions existing for backward compatibility (should not be used in new code):
spleval
spline
splmake
spltopp
pchip
3.2.4 Fourier Transforms (scipy.fftpack)
There is help and a number of examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/fftpack.html.
Here is an example, copied from the documentation in the above link:
import numpy as np from scipy.fftpack import fft def test(): # Number of sample points N = 600 # sample spacing T = 1.0 / 800.0 x = np.linspace(0.0, N * T, N) y = np.sin(50.0 * 2.0 * np.pi * x) + 0.5 * np.sin(80.0 * 2.0 * np.pi * x) yf = fft(y) from scipy.signal import blackman w = blackman(N) ywf = fft(y * w) xf = np.linspace(0.0, 1.0 / (2.0 * T), N / 2) import matplotlib.pyplot as plt plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(yf[1:N // 2]), '-b') plt.semilogy(xf[1:N // 2], 2.0 / N * np.abs(ywf[1:N // 2]), '-r') plt.legend(['FFT', 'FFT w. window']) plt.grid() plt.show() test()
Here is a summary of the Discrete Fourier transforms support in scipy.fftpack:
Fast Fourier Transforms (FFTs)
fft - Fast (discrete) Fourier Transform (FFT)
ifft - Inverse FFT
fft2 - Two dimensional FFT
ifft2 - Two dimensional inverse FFT
fftn - n-dimensional FFT
ifftn - n-dimensional inverse FFT
rfft - FFT of strictly real-valued sequence
irfft - Inverse of rfft
dct - Discrete cosine transform
idct - Inverse discrete cosine transform
dctn - n-dimensional Discrete cosine transform
idctn - n-dimensional Inverse discrete cosine transform
dst - Discrete sine transform
idst - Inverse discrete sine transform
dstn - n-dimensional Discrete sine transform
idstn - n-dimensional Inverse discrete sine transform
Differential and pseudo-differential operators
diff - Differentiation and integration of periodic sequences
tilbert - Tilbert transform: cs_diff(x,h,h)
itilbert - Inverse Tilbert transform: sc_diff(x,h,h)
hilbert - Hilbert transform: cs_diff(x,inf,inf)
ihilbert - Inverse Hilbert transform: sc_diff(x,inf,inf)
cs_diff - cosh/sinh pseudo-derivative of periodic sequences
sc_diff - sinh/cosh pseudo-derivative of periodic sequences
ss_diff - sinh/sinh pseudo-derivative of periodic sequences
cc_diff - cosh/cosh pseudo-derivative of periodic sequences
shift - Shift periodic sequences
Helper functions
fftshift - Shift the zero-frequency component to the center of the spectrum
ifftshift - The inverse of fftshift
fftfreq - Return the Discrete Fourier Transform sample frequencies
rfftfreq - DFT sample frequencies (for usage with rfft, irfft)
next_fast_len - Find the optimal length to zero-pad an FFT for speed
Convolutions (scipy.fftpack.convolve)
convolve
convolve_z
init_convolution_kernel
destroy_convolve_cache
3.2.5 Signal Processing (scipy.signal)
Use this module with either of the following:
>>> import scipy.signal >>> from scipy import signal
Here is some summary:
Convolution
convolve -- N-dimensional convolution.
correlate -- N-dimensional correlation.
fftconvolve -- N-dimensional convolution using the FFT.
convolve2d -- 2-dimensional convolution (more options).
correlate2d -- 2-dimensional correlation (more options).
sepfir2d -- Convolve with a 2-D separable FIR filter.
choose_conv_method -- Chooses faster of FFT and direct convolution methods.
B-splines
bspline -- B-spline basis function of order n.
cubic -- B-spline basis function of order 3.
quadratic -- B-spline basis function of order 2.
gauss_spline -- Gaussian approximation to the B-spline basis function.
cspline1d -- Coefficients for 1-D cubic (3rd order) B-spline.
qspline1d -- Coefficients for 1-D quadratic (2nd order) B-spline.
cspline2d -- Coefficients for 2-D cubic (3rd order) B-spline.
qspline2d -- Coefficients for 2-D quadratic (2nd order) B-spline.
cspline1d_eval -- Evaluate a cubic spline at the given points.
qspline1d_eval -- Evaluate a quadratic spline at the given points.
spline_filter -- Smoothing spline (cubic) filtering of a rank-2 array.
Filtering
order_filter -- N-dimensional order filter.
medfilt -- N-dimensional median filter.
medfilt2d -- 2-dimensional median filter (faster).
wiener -- N-dimensional wiener filter.
symiirorder1 -- 2nd-order IIR filter (cascade of first-order systems).
symiirorder2 -- 4th-order IIR filter (cascade of second-order systems).
lfilter -- 1-dimensional FIR and IIR digital linear filtering.
lfiltic -- Construct initial conditions for lfilter.
lfilter_zi -- Compute an initial state zi for the lfilter function that corresponds to the steady state of the step response.
filtfilt -- A forward-backward filter.
savgol_filter -- Filter a signal using the Savitzky-Golay filter.
deconvolve -- 1-d deconvolution using lfilter.
sosfilt -- 1-dimensional IIR digital linear filtering using a second-order sections filter representation.
sosfilt_zi -- Compute an initial state zi for the sosfilt function that corresponds to the steady state of the step response.
sosfiltfilt -- A forward-backward filter for second-order sections.
hilbert -- Compute 1-D analytic signal, using the Hilbert transform.
hilbert2 -- Compute 2-D analytic signal, using the Hilbert transform.
decimate -- Downsample a signal.
detrend -- Remove linear and/or constant trends from data.
resample -- Resample using Fourier method.
resample_poly -- Resample using polyphase filtering method.
upfirdn -- Upsample, apply FIR filter, downsample.
Filter design
bilinear -- Digital filter from an analog filter using the bilinear transform.
findfreqs -- Find array of frequencies for computing filter response.
firls -- FIR filter design using least-squares error minimization.
firwin -- Windowed FIR filter design, with frequency response defined as pass and stop bands.
firwin2 -- Windowed FIR filter design, with arbitrary frequency response.
freqs -- Analog filter frequency response from TF coefficients.
freqs_zpk -- Analog filter frequency response from ZPK coefficients.
freqz -- Digital filter frequency response from TF coefficients.
freqz_zpk -- Digital filter frequency response from ZPK coefficients.
sosfreqz -- Digital filter frequency response for SOS format filter.
group_delay -- Digital filter group delay.
iirdesign -- IIR filter design given bands and gains.
iirfilter -- IIR filter design given order and critical frequencies.
kaiser_atten -- Compute the attenuation of a Kaiser FIR filter, given the number of taps and the transition width at discontinuities in the frequency response.
kaiser_beta -- Compute the Kaiser parameter beta, given the desired FIR filter attenuation.
kaiserord -- Design a Kaiser window to limit ripple and width of transition region.
minimum_phase -- Convert a linear phase FIR filter to minimum phase.
savgol_coeffs -- Compute the FIR filter coefficients for a Savitzky-Golay filter.
remez -- Optimal FIR filter design.
unique_roots -- Unique roots and their multiplicities.
residue -- Partial fraction expansion of b(s) / a(s).
residuez -- Partial fraction expansion of b(z) / a(z).
invres -- Inverse partial fraction expansion for analog filter.
invresz -- Inverse partial fraction expansion for digital filter.
BadCoefficients -- Warning on badly conditioned filter coefficients
Lower-level filter design functions:
abcd_normalize -- Check state-space matrices and ensure they are rank-2.
band_stop_obj -- Band Stop Objective Function for order minimization.
besselap -- Return (z,p,k) for analog prototype of Bessel filter.
buttap -- Return (z,p,k) for analog prototype of Butterworth filter.
cheb1ap -- Return (z,p,k) for type I Chebyshev filter.
cheb2ap -- Return (z,p,k) for type II Chebyshev filter.
cmplx_sort -- Sort roots based on magnitude.
ellipap -- Return (z,p,k) for analog prototype of elliptic filter.
lp2bp -- Transform a lowpass filter prototype to a bandpass filter.
lp2bs -- Transform a lowpass filter prototype to a bandstop filter.
lp2hp -- Transform a lowpass filter prototype to a highpass filter.
lp2lp -- Transform a lowpass filter prototype to a lowpass filter.
normalize -- Normalize polynomial representation of a transfer function.
Matlab-style IIR filter design
butter -- Butterworth
buttord
cheby1 -- Chebyshev Type I
cheb1ord
cheby2 -- Chebyshev Type II
cheb2ord
ellip -- Elliptic (Cauer)
ellipord
bessel -- Bessel (no order selection available -- try butterod)
iirnotch -- Design second-order IIR notch digital filter.
iirpeak -- Design second-order IIR peak (resonant) digital filter.
Continuous-Time Linear Systems
lti -- Continuous-time linear time invariant system base class.
StateSpace -- Linear time invariant system in state space form.
TransferFunction -- Linear time invariant system in transfer function form.
ZerosPolesGain -- Linear time invariant system in zeros, poles, gain form.
lsim -- continuous-time simulation of output to linear system.
lsim2 -- like lsim, but scipy.integrate.odeint is used.
impulse -- impulse response of linear, time-invariant (LTI) system.
impulse2 -- like impulse, but scipy.integrate.odeint is used.
step -- step response of continous-time LTI system.
step2 -- like step, but scipy.integrate.odeint is used.
freqresp -- frequency response of a continuous-time LTI system.
bode -- Bode magnitude and phase data (continuous-time LTI).
Discrete-Time Linear Systems
dlti -- Discrete-time linear time invariant system base class.
StateSpace -- Linear time invariant system in state space form.
TransferFunction -- Linear time invariant system in transfer function form.
ZerosPolesGain -- Linear time invariant system in zeros, poles, gain form.
dlsim -- simulation of output to a discrete-time linear system.
dimpulse -- impulse response of a discrete-time LTI system.
dstep -- step response of a discrete-time LTI system.
dfreqresp -- frequency response of a discrete-time LTI system.
dbode -- Bode magnitude and phase data (discrete-time LTI).
LTI Representations
tf2zpk -- transfer function to zero-pole-gain.
tf2sos -- transfer function to second-order sections.
tf2ss -- transfer function to state-space.
zpk2tf -- zero-pole-gain to transfer function.
zpk2sos -- zero-pole-gain to second-order sections.
zpk2ss -- zero-pole-gain to state-space.
ss2tf -- state-pace to transfer function.
ss2zpk -- state-space to pole-zero-gain.
sos2zpk -- second-order sections to zero-pole-gain.
sos2tf -- second-order sections to transfer function.
cont2discrete -- continuous-time to discrete-time LTI conversion.
place_poles -- pole placement.
Waveforms
chirp -- Frequency swept cosine signal, with several freq functions.
gausspulse -- Gaussian modulated sinusoid
max_len_seq -- Maximum length sequence
sawtooth -- Periodic sawtooth
square -- Square wave
sweep_poly -- Frequency swept cosine signal; freq is arbitrary polynomial
unit_impulse -- Discrete unit impulse
Window functions
get_window -- Return a window of a given length and type.
barthann -- Bartlett-Hann window
bartlett -- Bartlett window
blackman -- Blackman window
blackmanharris -- Minimum 4-term Blackman-Harris window
bohman -- Bohman window
boxcar -- Boxcar window
chebwin -- Dolph-Chebyshev window
cosine -- Cosine window
exponential -- Exponential window
flattop -- Flat top window
gaussian -- Gaussian window
general_gaussian -- Generalized Gaussian window
hamming -- Hamming window
hann -- Hann window
hanning -- Hann window
kaiser -- Kaiser window
nuttall -- Nuttall's minimum 4-term Blackman-Harris window
parzen -- Parzen window
slepian -- Slepian window
triang -- Triangular window
tukey -- Tukey window
Wavelets
cascade -- compute scaling function and wavelet from coefficients
daub -- return low-pass
morlet -- Complex Morlet wavelet.
qmf -- return quadrature mirror filter from low-pass
ricker -- return ricker wavelet
cwt -- perform continuous wavelet transform
Peak finding
find_peaks_cwt -- Attempt to find the peaks in the given 1-D array
argrelmin -- Calculate the relative minima of data
argrelmax -- Calculate the relative maxima of data
argrelextrema -- Calculate the relative extrema of data
Spectral Analysis
periodogram -- Compute a (modified) periodogram
welch -- Compute a periodogram using Welch's method
csd -- Compute the cross spectral density, using Welch's method
coherence -- Compute the magnitude squared coherence, using Welch's method
spectrogram -- Compute the spectrogram
lombscargle -- Computes the Lomb-Scargle periodogram
vectorstrength -- Computes the vector strength
stft -- Compute the Short Time Fourier Transform
istft -- Compute the Inverse Short Time Fourier Transform
check_COLA -- Check the COLA constraint for iSTFT reconstruction
3.2.6 Linear Algebra (scipy.linalg)
Use this module with either of the following:
>>> import scipy.linalg >>> from scipy import linalg
Here is some summary:
Basics
inv -- Find the inverse of a square matrix
solve -- Solve a linear system of equations
solve_banded -- Solve a banded linear system
solveh_banded -- Solve a Hermitian or symmetric banded system
solve_circulant -- Solve a circulant system
solve_triangular -- Solve a triangular matrix
solve_toeplitz -- Solve a toeplitz matrix
det -- Find the determinant of a square matrix
norm -- Matrix and vector norm
lstsq -- Solve a linear least-squares problem
pinv -- Pseudo-inverse (Moore-Penrose) using lstsq
pinv2 -- Pseudo-inverse using svd
pinvh -- Pseudo-inverse of hermitian matrix
kron -- Kronecker product of two arrays
tril -- Construct a lower-triangular matrix from a given matrix
triu -- Construct an upper-triangular matrix from a given matrix orthogonal_procrustes -- Solve an orthogonal Procrustes problem matrix_balance -- Balance matrix entries with a similarity transformation subspace_angles -- Compute the subspace angles between two matrices
LinAlgError -- Generic Python-exception-derived object raised by linalg functions.
Eigenvalue Problems
eig -- Find the eigenvalues and eigenvectors of a square matrix
eigvals -- Find just the eigenvalues of a square matrix
eigh -- Find the e-vals and e-vectors of a Hermitian or symmetric matrix
eigvalsh -- Find just the eigenvalues of a Hermitian or symmetric matrix
eig_banded -- Find the eigenvalues and eigenvectors of a banded matrix
eigvals_banded -- Find just the eigenvalues of a banded matrix
eigh_tridiagonal -- Find the eigenvalues and eigenvectors of a tridiagonal matrix
eigvalsh_tridiagonal -- Find just the eigenvalues of a tridiagonal matrix
Decompositions
lu -- LU decomposition of a matrix
lu_factor -- LU decomposition returning unordered matrix and pivots
lu_solve -- Solve Ax=b using back substitution with output of lu_factor
svd -- Singular value decomposition of a matrix
svdvals -- Singular values of a matrix
diagsvd -- Construct matrix of singular values from output of svd
orth -- Construct orthonormal basis for the range of A using svd
cholesky -- Cholesky decomposition of a matrix
cholesky_banded -- Cholesky decomp. of a sym. or Hermitian banded matrix
cho_factor -- Cholesky decomposition for use in solving a linear system
cho_solve -- Solve previously factored linear system
cho_solve_banded -- Solve previously factored banded linear system
polar -- Compute the polar decomposition.
qr -- QR decomposition of a matrix
qr_multiply -- QR decomposition and multiplication by Q
qr_update -- Rank k QR update
qr_delete -- QR downdate on row or column deletion
qr_insert -- QR update on row or column insertion
rq -- RQ decomposition of a matrix
qz -- QZ decomposition of a pair of matrices
ordqz -- QZ decomposition of a pair of matrices with reordering
schur -- Schur decomposition of a matrix
rsf2csf -- Real to complex Schur form
hessenberg -- Hessenberg form of a matrix
See also: scipy.linalg.interpolative -- Interpolative matrix decompositions
Matrix Functions
expm -- Matrix exponential
logm -- Matrix logarithm
cosm -- Matrix cosine
sinm -- Matrix sine
tanm -- Matrix tangent
coshm -- Matrix hyperbolic cosine
sinhm -- Matrix hyperbolic sine
tanhm -- Matrix hyperbolic tangent
signm -- Matrix sign
sqrtm -- Matrix square root
funm -- Evaluating an arbitrary matrix function
expm_frechet -- Frechet derivative of the matrix exponential
expm_cond -- Relative condition number of expm in the Frobenius norm
fractional_matrix_power -- Fractional matrix power
Matrix Equation Solvers
solve_sylvester -- Solve the Sylvester matrix equation
solve_continuous_are -- Solve the continuous-time algebraic Riccati equation
solve_discrete_are -- Solve the discrete-time algebraic Riccati equation
solve_continuous_lyapunov -- Solve the continous-time Lyapunov equation
solve_discrete_lyapunov -- Solve the discrete-time Lyapunov equation
Sketches and Random Projections
clarkson_woodruff_transform -- Applies the Clarkson Woodruff Sketch (a.k.a CountMin Sketch)
Special Matrices
block_diag -- Construct a block diagonal matrix from submatrices
circulant -- Circulant matrix
companion -- Companion matrix
dft -- Discrete Fourier transform matrix
hadamard -- Hadamard matrix of order 2**n
hankel -- Hankel matrix
helmert -- Helmert matrix
hilbert -- Hilbert matrix
invhilbert -- Inverse Hilbert matrix
leslie -- Leslie matrix
pascal -- Pascal matrix
invpascal -- Inverse Pascal matrix
toeplitz -- Toeplitz matrix
tri -- Construct a matrix filled with ones at and below a given diagonal
Low-level routines
get_blas_funcs
get_lapack_funcs
find_best_blas_type
See also:
scipy.linalg.blas -- Low-level BLAS functions
scipy.linalg.lapack -- Low-level LAPACK functions
scipy.linalg.cython_blas -- Low-level BLAS functions for Cython
scipy.linalg.cython_lapack -- Low-level LAPACK functions for Cython
3.2.7 Sparse Eigenvalue Problems with ARPACK
There are examples in the Scipy documentation, here: https://docs.scipy.org/doc/scipy/reference/tutorial/arpack.html
And, here is a summary copied from that document:
"ARPACK is a Fortran package which provides routines for quickly finding a few eigenvalues/eigenvectors of large sparse matrices. In order to find these solutions, it requires only left-multiplication by the matrix in question. This operation is performed through a reverse-communication interface. The result of this structure is that ARPACK is able to find eigenvalues and eigenvectors of any linear function mapping a vector to a vector.
"All of the functionality provided in ARPACK is contained within the two high-level interfaces scipy.sparse.linalg.eigs and scipy.sparse.linalg.eigsh. eigs provides interfaces to find the eigenvalues/vectors of real or complex nonsymmetric square matrices, while eigsh provides interfaces for real-symmetric or complex-hermitian matrices."
3.2.8 Compressed Sparse Graph Routines (scipy.sparse.csgraph)
There is an example that implements a search for the shortest path between two words (of equal) length in a word ladder (i.e. changing just one letter in each step) in the Scipy documentation. You can find it here: https://docs.scipy.org/doc/scipy/reference/tutorial/csgraph.html.
You can get documentation with the following:
$ pydoc scipy.sparse.csgraph
And, in IPython, do something like this:
In [41]: from scipy.sparse import csgraph In [42]: csgraph.connected_components?
Here is a summary of the contents:
connected_components -- determine connected components of a graph.
laplacian -- compute the laplacian of a graph.
shortest_path -- compute the shortest path between points on a positive graph.
dijkstra -- use Dijkstra's algorithm for shortest path.
floyd_warshall -- use the Floyd-Warshall algorithm for shortest path.
bellman_ford -- use the Bellman-Ford algorithm for shortest path.
johnson -- use Johnson's algorithm for shortest path.
breadth_first_order -- compute a breadth-first order of nodes.
depth_first_order -- compute a depth-first order of nodes.
breadth_first_tree -- construct the breadth-first tree from a given node.
depth_first_tree -- construct a depth-first tree from a given node.
minimum_spanning_tree -- construct the minimum spanning tree of a graph.
reverse_cuthill_mckee -- compute permutation for reverse Cuthill-McKee ordering.
maximum_bipartite_matching -- compute permutation to make diagonal zero free.
structural_rank -- compute the structural rank of a graph.
construct_dist_matrix -- Construct distance matrix from a predecessor matrix.
csgraph_from_dense -- Construct a CSR-format sparse graph from a dense matrix.
csgraph_from_masked -- Construct a CSR-format graph from a masked array.
csgraph_masked_from_dense -- Construct a CSR-format sparse graph from a dense matrix.
csgraph_to_dense -- Convert a sparse graph representation to a dense representation.
csgraph_to_masked -- Convert a sparse graph representation to a masked array representation.
reconstruct_path -- Construct a tree from a graph and a predecessor list.
NegativeCycleError -- Common base class for all non-exit exceptions
Note that there are other sparse graph libraries for Python. One is Another Python Graph Library: https://pythonhosted.org/apgl/index.html.
3.2.9 Spatial data structures and algorithms (scipy.spatial)
Provides spatial algorithms and data structures.
Here is an example, copied from the documentation:
import numpy as np from scipy.spatial import Delaunay import matplotlib.pyplot as plt def test(): points = np.array([[0, 0], [0, 1.1], [1, 0], [1, 1]]) tri = Delaunay(points) # # We can visualize it: plt.triplot(points[:, 0], points[:, 1], tri.simplices.copy()) plt.plot(points[:, 0], points[:, 1], 'o') # # And add some further decorations: for j, p in enumerate(points): # label the points plt.text(p[0] - 0.03, p[1] + 0.03, j, ha='right') for j, s in enumerate(tri.simplices): p = points[s].mean(axis=0) # label triangles plt.text(p[0], p[1], '#%d' % j, ha='center') plt.xlim(-0.5, 1.5) plt.ylim(-0.5, 1.5) plt.show() # # The structure of the triangulation is encoded in the following way: the # simplices attribute contains the indices of the points in the # points array # that make up the triangle. For instance: i = 1 tri.simplices[i, :] points[tri.simplices[i, :]] return tri, points
Here is a summary of the contents of scipy.spatial (obtained by doing $ pydoc scipy.spatial):
Nearest-neighbor Queries:
KDTree -- class for efficient nearest-neighbor queries
cKDTree -- class for efficient nearest-neighbor queries (faster impl.)
distance -- module containing many different distance measures
Rectangle -- Hyperrectangle class. Represents a Cartesian product of intervals.
Delaunay Triangulation, Convex Hulls, and Voronoi Diagrams:
Delaunay -- compute Delaunay triangulation of input points
ConvexHull -- compute a convex hull for input points
Voronoi -- compute a Voronoi diagram hull from input points
SphericalVoronoi -- compute a Voronoi diagram from input points on the surface of a sphere
HalfspaceIntersection -- compute the intersection points of input halfspaces
Plotting Helpers:
delaunay_plot_2d -- plot 2-D triangulation
convex_hull_plot_2d -- plot 2-D convex hull
voronoi_plot_2d -- plot 2-D voronoi diagram
Simplex representation:
The simplices (triangles, tetrahedra, ...) appearing in the Delaunay tesselation (N-dim simplices), convex hull facets, and Voronoi ridges (N-1 dim simplices) are represented in the following scheme:
tess = Delaunay(points) hull = ConvexHull(points) voro = Voronoi(points) # coordinates of the j-th vertex of the i-th simplex tess.points[tess.simplices[i, j], :] # tesselation element hull.points[hull.simplices[i, j], :] # convex hull facet voro.vertices[voro.ridge_vertices[i, j], :] # ridge between Voronoi cells
For Delaunay triangulations and convex hulls, the neighborhood structure of the simplices satisfies the condition:
tess.neighbors[i,j] is the neighboring simplex of the i-th simplex, opposite to the j-vertex. It is -1 in case of no neighbor.
Convex hull facets also define a hyperplane equation:
(hull.equations[i,:-1] * coord).sum() + hull.equations[i,-1] == 0
Similar hyperplane equations for the Delaunay triangulation correspond to the convex hull facets on the corresponding N+1 dimensional paraboloid.
The Delaunay triangulation objects offer a method for locating the simplex containing a given point, and barycentric coordinate computations.
Functions:
tsearch
distance_matrix
minkowski_distance
minkowski_distance_p
procrustes
3.2.10 Statistics (scipy.stats)
This module contains a large number of probability distributions as well as a growing library of statistical functions.
Each univariate distribution is an instance of a subclass of rv_continuous (rv_discrete for discrete distributions):
rv_continuous
rv_discrete
rv_histogram
Here is a summary of the items in scipy.stats:
Continuous distributions
alpha -- Alpha
anglit -- Anglit
arcsine -- Arcsine
argus -- Argus
beta -- Beta
betaprime -- Beta Prime
bradford -- Bradford
burr -- Burr (Type III)
burr12 -- Burr (Type XII)
cauchy -- Cauchy
chi -- Chi
chi2 -- Chi-squared
cosine -- Cosine
crystalball -- Crystalball
dgamma -- Double Gamma
dweibull -- Double Weibull
erlang -- Erlang
expon -- Exponential
exponnorm -- Exponentially Modified Normal
exponweib -- Exponentiated Weibull
exponpow -- Exponential Power
f -- F (Snecdor F)
fatiguelife -- Fatigue Life (Birnbaum-Saunders)
fisk -- Fisk
foldcauchy -- Folded Cauchy
foldnorm -- Folded Normal
frechet_r -- Deprecated. Alias for weibull_min
frechet_l -- Deprecated. Alias for weibull_max
genlogistic -- Generalized Logistic
gennorm -- Generalized normal
genpareto -- Generalized Pareto
genexpon -- Generalized Exponential
genextreme -- Generalized Extreme Value
gausshyper -- Gauss Hypergeometric
gamma -- Gamma
gengamma -- Generalized gamma
genhalflogistic -- Generalized Half Logistic
gilbrat -- Gilbrat
gompertz -- Gompertz (Truncated Gumbel)
gumbel_r -- Right Sided Gumbel, Log-Weibull, Fisher-Tippett, Extreme Value Type I
gumbel_l -- Left Sided Gumbel, etc.
halfcauchy -- Half Cauchy
halflogistic -- Half Logistic
halfnorm -- Half Normal
halfgennorm -- Generalized Half Normal
hypsecant -- Hyperbolic Secant
invgamma -- Inverse Gamma
invgauss -- Inverse Gaussian
invweibull -- Inverse Weibull
johnsonsb -- Johnson SB
johnsonsu -- Johnson SU
kappa4 -- Kappa 4 parameter
kappa3 -- Kappa 3 parameter
ksone -- Kolmogorov-Smirnov one-sided (no stats)
kstwobign -- Kolmogorov-Smirnov two-sided test for Large N (no stats)
laplace -- Laplace
levy -- Levy
levy_l
levy_stable
logistic -- Logistic
loggamma -- Log-Gamma
loglaplace -- Log-Laplace (Log Double Exponential)
lognorm -- Log-Normal
lomax -- Lomax (Pareto of the second kind)
maxwell -- Maxwell
mielke -- Mielke's Beta-Kappa
nakagami -- Nakagami
ncx2 -- Non-central chi-squared
ncf -- Non-central F
nct -- Non-central Student's T
norm -- Normal (Gaussian)
pareto -- Pareto
pearson3 -- Pearson type III
powerlaw -- Power-function
powerlognorm -- Power log normal
powernorm -- Power normal
rdist -- R-distribution
reciprocal -- Reciprocal
rayleigh -- Rayleigh
rice -- Rice
recipinvgauss -- Reciprocal Inverse Gaussian
semicircular -- Semicircular
skewnorm -- Skew normal
t -- Student's T
trapz -- Trapezoidal
triang -- Triangular
truncexpon -- Truncated Exponential
truncnorm -- Truncated Normal
tukeylambda -- Tukey-Lambda
uniform -- Uniform
vonmises -- Von-Mises (Circular)
vonmises_line -- Von-Mises (Line)
wald -- Wald
weibull_min -- Minimum Weibull (see Frechet)
weibull_max -- Maximum Weibull (see Frechet)
wrapcauchy -- Wrapped Cauchy
Multivariate distributions
multivariate_normal -- Multivariate normal distribution
matrix_normal -- Matrix normal distribution
dirichlet -- Dirichlet
wishart -- Wishart
invwishart -- Inverse Wishart
multinomial -- Multinomial distribution
special_ortho_group -- SO(N) group
ortho_group -- O(N) group
unitary_group -- U(N) gropu
random_correlation -- random correlation matrices
Discrete distributions
bernoulli -- Bernoulli
binom -- Binomial
boltzmann -- Boltzmann (Truncated Discrete Exponential)
dlaplace -- Discrete Laplacian
geom -- Geometric
hypergeom -- Hypergeometric
logser -- Logarithmic (Log-Series, Series)
nbinom -- Negative Binomial
planck -- Planck (Discrete Exponential)
poisson -- Poisson
randint -- Discrete Uniform
skellam -- Skellam
zipf -- Zipf
Statistical functions -- Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays.
describe -- Descriptive statistics
gmean -- Geometric mean
hmean -- Harmonic mean
kurtosis -- Fisher or Pearson kurtosis
kurtosistest -- Test whether a dataset has normal kurtosis.
mode -- Modal value
moment -- Central moment
normaltest --
skew -- Skewness
skewtest --
kstat --
kstatvar --
tmean -- Truncated arithmetic mean
tvar -- Truncated variance
tmin --
tmax --
tstd --
tsem --
variation -- Coefficient of variation
find_repeats
trim_mean
cumfreq
itemfreq
percentileofscore
scoreatpercentile
relfreq
binned_statistic -- Compute a binned statistic for a set of data.
binned_statistic_2d -- Compute a 2-D binned statistic for a set of data.
binned_statistic_dd -- Compute a d-D binned statistic for a set of data.
obrientransform
bayes_mvs
mvsdist
sem
zmap
zscore
iqr
sigmaclip
trimboth
trim1
f_oneway
pearsonr
spearmanr
pointbiserialr
kendalltau
weightedtau
linregress
theilslopes
ttest_1samp
ttest_ind
ttest_ind_from_stats
ttest_rel
kstest
chisquare
power_divergence
ks_2samp
mannwhitneyu
tiecorrect
rankdata
ranksums
wilcoxon
kruskal
friedmanchisquare
combine_pvalues
jarque_bera
ansari
bartlett
levene
shapiro
anderson
anderson_ksamp
binom_test
fligner
median_test
mood
boxcox
boxcox_normmax
boxcox_llf
entropy
wasserstein_distance
energy_distance
Circular statistical functions
circmean
circvar
circstd
Contingency table functions
chi2_contingency
contingency expected_freq
contingency margins
fisher_exact
Plot-tests
ppcc_max
ppcc_plot
probplot
boxcox_normplot
Masked statistics functions -- Module scipy.stats.mstats contains statistical functions for masked arrays.
For more information in IPython, do:
In [1]: from scipy.stats import mstats In [2]: mstats?
Or, from the command line do $ pydoc scipy.stats.mstats.
Univariate and multivariate kernel density estimation (scipy.stats.kde)
gaussian_kde -- Representation of a kernel-density estimate using Gaussian kernels.
Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable in a non-parametric way. gaussian_kde works for both uni-variate and multi-variate data. It includes automatic bandwidth determination. The estimation works best for a unimodal distribution; bimodal or multi-modal distributions tend to be oversmoothed.
For many more stat related functions install the software R and the interface package rpy`.
3.2.11 Multidimensional image processing (scipy.ndimage)
The module scipy.ndimage contains various functions for multi-dimensional image processing.
For information on these functions, do (for example, in IPython):
In [6]: from scipy import ndimage In [7]: ndimage? In [8]: ndimage.convolve?
Or, from the command line, do: $ pydoc scipy.ndimage.convolve.
Here is an example -- It computes the multi-dimensional convolution of an Numpy ndarray:
import numpy as np from scipy import ndimage def test(): a = np.array([[1, 2, 0, 0], [5, 3, 0, 4], [0, 0, 0, 7], [9, 3, 0, 0]]) k = np.array([[1, 1, 1], [1, 1, 0], [1, 0, 0]]) result = ndimage.convolve(a, k, mode='constant', cval=0.0) return result
Here is a summary of the contents of scipy.ndimage:
Filters
convolve -- Multi-dimensional convolution
convolve1d -- 1-D convolution along the given axis
correlate -- Multi-dimensional correlation
correlate1d -- 1-D correlation along the given axis
gaussian_filter -
gaussian_filter1d -
gaussian_gradient_magnitude -
gaussian_laplace -
generic_filter -- Multi-dimensional filter using a given function
generic_filter1d -- 1-D generic filter along the given axis
generic_gradient_magnitude
generic_laplace
laplace -- n-D Laplace filter based on approximate second derivatives
maximum_filter
maximum_filter1d
median_filter -- Calculates a multi-dimensional median filter
minimum_filter
minimum_filter1d
percentile_filter -- Calculates a multi-dimensional percentile filter
prewitt
rank_filter -- Calculates a multi-dimensional rank filter
sobel
uniform_filter -- Multi-dimensional uniform filter
uniform_filter1d -- 1-D uniform filter along the given axis
Fourier filters
fourier_ellipsoid
fourier_gaussian
fourier_shift
fourier_uniform
Interpolation
affine_transform -- Apply an affine transformation
geometric_transform -- Apply an arbritrary geometric transform
map_coordinates -- Map input array to new coordinates by interpolation
rotate -- Rotate an array
shift -- Shift an array
spline_filter
spline_filter1d
zoom -- Zoom an array
Measurements
center_of_mass -- The center of mass of the values of an array at labels
extrema -- Min's and max's of an array at labels, with their positions
find_objects -- Find objects in a labeled array
histogram -- Histogram of the values of an array, optionally at labels
label -- Label features in an array
labeled_comprehension
maximum
maximum_position
mean -- Mean of the values of an array at labels
median
minimum
minimum_position
standard_deviation -- Standard deviation of an n-D image array
sum -- Sum of the values of the array
variance -- Variance of the values of an n-D image array
watershed_ift
Morphology
binary_closing
binary_dilation
binary_erosion
binary_fill_holes
binary_hit_or_miss
binary_opening
binary_propagation
black_tophat
distance_transform_bf
distance_transform_cdt
distance_transform_edt
generate_binary_structure
grey_closing
grey_dilation
grey_erosion
grey_opening
iterate_structure
morphological_gradient
morphological_laplace
white_tophat
Utility
imread -- Load an image from a file
3.2.12 File IO (scipy.io)
Scipy provides routines to read/write a number of special file formats. Here are some of them:
MATLAB® files:
loadmat -- Read a MATLAB style mat file (version 4 through 7.1)
savemat -- Write a MATLAB style mat file (version 4 through 7.1)
whosmat -- List contents of a MATLAB style mat file (version 4 through 7.1)
IDL® files:
readsav -- Read an IDL 'save' file
Matrix Market files:
mminfo -- Query matrix info from Matrix Market formatted file
mmread -- Read matrix from Matrix Market formatted file
mmwrite -- Write matrix to Matrix Market formatted file
Unformatted Fortran files:
FortranFile -- A file object for unformatted sequential Fortran files
Netcdf:
netcdf_file -- A file object for NetCDF data
netcdf_variable -- A data object for the netcdf module
Harwell-Boeing files:
hb_read -- read H-B file
hb_write -- write H-B file
Wav sound files (scipy.io.wavfile):
read -- Return the sample rate (in samples/sec) and data from a WAV file.
write -- Write a numpy array as a WAV file.
WavFileWarning -- Base class for warnings generated by user code.
Arff files (scipy.io.arff):
loadarff -- Read an arff file.
MetaData -- Small container to keep useful information on a ARFF dataset.
ArffError -- Base class for I/O related errors.
ParseArffError -- Base class for I/O related errors.
3.3 Pandas
Pandas vs. Numpy -- Pandas raises Numpy data structures to a higher level. In particular, see the DataFrame object.
For documentation on Pandas, see: http://pandas.pydata.org/pandas-docs/stable/. There are tutorials, get-started guides, cookbook docs, and more.
10 Minutes to pandas seems especially helpful, although it does contain an lot more than 10 minutes worth of material. It gives basic instructions on how to use Pandas data types.
And, be sure to look at the various Pandas tutorials.
There are also cookbooks full of code snippets:
Perhaps it's advisable to view Pandas as just as much about learning techniques for (1) cleaning up your data; (2) exploring and finding significant aspects of your data, and (3) viewing and displaying your data, as it is about performing calculations and analysis on it. Panda contains and provides such a rich set of techniques for working with your data that you should expect to take a reasonable amount of time learning to do the tasks you need, rather than just quickly learn some small set of functions.
3.3.1 Create Pandas data structures
Here is an example that creates several of the Pandas data structures that are used in the "10 Minutes to pandas" document referenced above:
def make_sample_dataframe(): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) return dates, df
And, here is an example of the use of the above function:
In [117]: import utils01 In [118]: dates, df = utils01.make_sample_dataframe() In [119]: In [119]: dates Out[119]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [120]: In [120]: df Out[120]: A B C D 2013-01-01 0.521515 1.006002 -1.408913 -0.218981 2013-01-02 -0.517541 -0.190499 0.397701 0.895858 2013-01-03 0.068253 0.499286 -1.098401 -1.323183 2013-01-04 -0.086779 0.025269 0.459892 0.588754 2013-01-05 1.384825 -1.141312 0.097294 0.169665 2013-01-06 -0.391738 -0.072600 0.196514 0.799174
3.3.2 View Pandas data structures
View the first and last rows of a DataFrame:
In [34]: df.head(n=2) Out[34]: A B C D 2013-01-01 -0.557541 1.016474 0.933149 -0.524661 2013-01-02 1.682318 -1.605635 -0.324727 2.057636 In [35]: In [35]: df.tail(n=3) Out[35]: A B C D 2013-01-04 0.696414 0.538999 1.131596 -0.960681 2013-01-05 -0.175765 -0.494210 1.111779 -0.670209 2013-01-06 -1.615098 0.018027 0.584815 -1.508152
Get the shape, column (labels), and actual data from a DataFrame:
In [38]: df.shape Out[38]: (6, 4) In [39]: df.columns Out[39]: Index(['A', 'B', 'C', 'D'], dtype='object') In [40]: df.values Out[40]: array([[-0.55754086, 1.01647419, 0.93314867, -0.52466119], [ 1.68231758, -1.60563477, -0.32472655, 2.05763649], [-0.45481149, -0.09087637, -1.1383327 , -0.7950994 ], [ 0.69641379, 0.53899898, 1.13159619, -0.96068123], [-0.17576451, -0.49421043, 1.11177912, -0.67020918], [-1.61509837, 0.01802738, 0.58481469, -1.50815216]]) In [41]: type(df.values) Out[41]: numpy.ndarray
Note that df.values returns an ndarray.
3.3.3 Access the contents of a DataFrame
Access a row or range of rows -- Use .iloc with a single index or a slice. Examples:
In [72]: df.iloc[1] Out[72]: A 0.721339 B 0.733763 C -1.153457 D -1.345582 Name: 2013-01-02 00:00:00, dtype: float64 In [73]: df.iloc[1:2] Out[73]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 In [74]: df.iloc[1:4] Out[74]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913 2013-01-04 0.737643 -1.539155 0.410927 0.038682
Access a row or range of rows -- Use .loc with index labels. Examples:
In [64]: df.loc[dates[1]] Out[64]: A 0.721339 B 0.733763 C -1.153457 D -1.345582 Name: 2013-01-02 00:00:00, dtype: float64 In [65]: df.loc[dates[1]:dates[2]] Out[65]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913 In [66]: df.loc[dates[1]:dates[1]] Out[66]: A B C D 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 In [67]: df.loc['2013-01-01'] Out[67]: A 1.373992 B -0.080698 C -0.018425 D -0.424205 Name: 2013-01-01 00:00:00, dtype: float64 In [68]: df.loc['2013-01-01':'2013-01-03'] Out[68]: A B C D 2013-01-01 1.373992 -0.080698 -0.018425 -0.424205 2013-01-02 0.721339 0.733763 -1.153457 -1.345582 2013-01-03 2.047318 0.406103 -1.893892 0.065913
Notes:
dates was used to create the index for df:
def make_sample_dataframe1(): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=6) df = pd.DataFrame( np.random.randn(6, 4), index=dates, columns=list('ABCD')) return dates, df
Access the rows where the content of a item (column) in that row satisfies a condition or test:
In [10]: df.loc[df.B > 0].head() Out[10]: Unnamed: 0 A B C D 2 2013-01-03 0.986316 1.870495 -1.598345 -2.551315 5 2013-01-06 1.385534 1.328005 1.741578 -0.409209 7 2013-01-08 -0.820344 0.318531 0.278434 -0.898119 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 20 2013-01-21 -0.567319 1.784550 -0.114723 0.315661
Or:
In [9]: df.loc[df.B.apply(lambda x: x > 0)].head() Out[9]: Unnamed: 0 A B C D 2 2013-01-03 0.986316 1.870495 -1.598345 -2.551315 5 2013-01-06 1.385534 1.328005 1.741578 -0.409209 7 2013-01-08 -0.820344 0.318531 0.278434 -0.898119 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 20 2013-01-21 -0.567319 1.784550 -0.114723 0.315661
Notes:
The use of .apply() along with lambda (or a named Python function) enables us to select rows based on an arbitrarily complex condition.
Also, consider using functools.partial(). The following selects rows where the value in column B is in the range -0.1 to 0.1:
In [25]: import functools In [26]: f = functools.partial(lambda x, y, z: z > x and z < y, -0.1, 0.1) In [27]: In [27]: df.loc[df.B.apply(f)].head() Out[27]: Unnamed: 0 A B C D 9 2013-01-10 -2.342766 0.048417 -0.352930 -0.134832 27 2013-01-28 -0.673330 0.075427 -0.477715 -0.475463 33 2013-02-03 -0.776301 0.015220 0.518606 -0.286090 38 2013-02-08 0.894722 0.005027 -0.763636 -0.150279 44 2013-02-14 -0.403519 -0.059570 0.929560 -1.065283
Access a column or several columns -- Use the Python indexing operator ([]), with a column label or iterable of column labels. Or, for a single column, use dot notation. Examples:
In [98]: df['B'] Out[98]: 2013-01-01 -0.080698 2013-01-02 0.733763 2013-01-03 0.406103 2013-01-04 -1.539155 2013-01-05 -0.963585 2013-01-06 0.934215 Freq: D, Name: B, dtype: float64 In [99]: df[['B', 'D']] Out[99]: B D 2013-01-01 -0.080698 -0.424205 2013-01-02 0.733763 -1.345582 2013-01-03 0.406103 0.065913 2013-01-04 -1.539155 0.038682 2013-01-05 -0.963585 -0.449162 2013-01-06 0.934215 1.473294 In [100]: In [100]: df.C Out[100]: 2013-01-01 -0.018425 2013-01-02 -1.153457 2013-01-03 -1.893892 2013-01-04 0.410927 2013-01-05 -1.627970 2013-01-06 0.240306 Freq: D, Name: C, dtype: float64
Access individual elements by index relative to zero -- Use .iloc[r, c]:
In [42]: df.iloc[0] Out[42]: A 1.373992 B -0.080698 C -0.018425 D -0.424205 Name: 2013-01-01 00:00:00, dtype: float64 In [43]: df.iloc[0, 1] Out[43]: -0.08069801201343964 In [44]: df.iloc[0, 1:3] Out[44]: B -0.080698 C -0.018425 Name: 2013-01-01 00:00:00, dtype: float64 In [45]: df.iloc[0:4, 1] Out[45]: 2013-01-01 -0.080698 2013-01-02 0.733763 2013-01-03 0.406103 2013-01-04 -1.539155 Freq: D, Name: B, dtype: float64 In [46]: df.iloc[0:4, 1:-1] Out[46]: B C 2013-01-01 -0.080698 -0.018425 2013-01-02 0.733763 -1.153457 2013-01-03 0.406103 -1.893892 2013-01-04 -1.539155 0.410927 In [47]: df.iloc[0:4, 1:] Out[47]: B C D 2013-01-01 -0.080698 -0.018425 -0.424205 2013-01-02 0.733763 -1.153457 -1.345582 2013-01-03 0.406103 -1.893892 0.065913 2013-01-04 -1.539155 0.410927 0.038682
3.3.4 Iterate over a DataFrame
There are several ways to do this. Here are some examples:
import utils01 def test(): dates, df = utils01.make_sample_dataframe1() # iterate over column labels. print("*\n* column labels --\n*") print([x for x in df]) # iterate over items print("*\n* items --\n*") print([x for x in df.head(n=2).iteritems()]) # iterate over rows print("*\n* rows --\n*") print([x for x in df.head(n=2).iterrows()]) # iterate over rows as named tuples. print("*\n* named tuples --\n*") print([x for x in df.head(n=2).itertuples()]) # iterate over rows as named tuples returning one column from each tuple. print("*\n* column \"B\" from named tuple --\n*") print([x.B for x in df.head(n=2).itertuples()])
Here is the output from the above function:
In [67]: test() * * column labels -- * ['A', 'B', 'C', 'D'] * * items -- * [('A', 2013-01-01 -2.443710 2013-01-02 -1.003475 Freq: D, Name: A, dtype: float64), ('B', 2013-01-01 -0.320540 2013-01-02 -1.020769 Freq: D, Name: B, dtype: float64), ('C', 2013-01-01 0.010302 2013-01-02 0.115615 Freq: D, Name: C, dtype: float64), ('D', 2013-01-01 0.935831 2013-01-02 -0.514601 Freq: D, Name: D, dtype: float64)] * * rows -- * [(Timestamp('2013-01-01 00:00:00', freq='D'), A -2.443710 B -0.320540 C 0.010302 D 0.935831 Name: 2013-01-01 00:00:00, dtype: float64), (Timestamp('2013-01-02 00:00:00', freq='D'), A -1.003475 B -1.020769 C 0.115615 D -0.514601 Name: 2013-01-02 00:00:00, dtype: float64)] * * named tuples -- * [Pandas(Index=Timestamp('2013-01-01 00:00:00', freq='D'), A=-2.4437103289150857, B=-0.32054023603910436, C=0.01030189942471081, D=0.9358311337233644), Pandas(Index=Timestamp('2013-01-02 00:00:00', freq='D'), A=-1.0034752077816913, B=-1.0207687970125863, C=0.11561494820245698, D=-0.5146012044818192)] * * column "B" from named tuple -- * [-0.32054023603910436, -1.0207687970125863]
While iterating over a pandas.DataFrame produces the column label, which can be used to access the columns of the DataFrame. Example:
In [92]: for column in df: ...: print("{}[0]: {:7.3f}".format(column, getattr(df, column)[0])) ...: A[0]: -0.368 B[0]: 1.122 C[0]: -0.890 D[0]: 0.076
An easier (and cleaner?) way to access a column would be: df[column].
In contrast, iterating over a pandas.Series, produces the items in the Series. Example (note that dates is a Series):
In [112]: for date in dates: ...: print('date: {}/{}/{}'.format(date.month, date.day, date.year)) ...: date: 1/1/2013 date: 1/2/2013 date: 1/3/2013 date: 1/4/2013 date: 1/5/2013 date: 1/6/2013
Here is a simple bit of code that iterates over each of the items (cells) in a Pandas DataFrame. This function prints out elements column by column:
def show_df(df): for idx1, label in enumerate(df): print('{}. Column: {}'.format(idx1, label)) for idx2, item in enumerate(df[label]): print(' {}.{}. {:+6.4f}'.format(idx1, idx2, item))
And, here is what the above (function show_df) might display:
In [78]: show_df(df.head(n=2)) 0. Column: A 0.0. +0.9590 0.1. -3.6568 1. Column: B 1.0. +1.1409 1.1. -0.4395 2. Column: C 2.0. +1.2634 2.1. -0.3644 3. Column: D 3.0. +0.0824 3.1. +1.1789
And, here is a function that prints out elements row by row (i.e. one row after another):
def show_df_by_rows(df): columns = df.columns for row, index in enumerate(df.index): print('{}. Row: {}'.format(row, index)) for idx, item in enumerate(df.loc[index]): print(' {}.{}. {:+6.4f}'.format(idx, columns[idx], item))
Here is a sample printout from the above function:
0. Row: 2013-01-01 00:00:00 0.A. +0.9590 1.B. +1.1409 2.C. +1.2634 3.D. +0.0824 1. Row: 2013-01-02 00:00:00 0.A. -3.6568 1.B. -0.4395 2.C. -0.3644 3.D. +1.1789
You can do something analogous with list comprehensions or generator expressions. For example, consider this code:
def show_dataframe(df): generator = ((index, b.items()) for (index, b) in ((index, df.loc[index]) for index in df.index)) for date, data in generator: print('date: {}'.format(date)) for col, item in data: print(' col: {} item: {:12.4f}'.format(col, item))
When we run the above, calling show_dataframe, we might see:
In [90]: show_dataframe(df.tail(2)) date: 2013-01-05 00:00:00 col: A item: 0.2175 col: B item: 0.1573 col: C item: -0.2240 col: D item: 0.2395 date: 2013-01-06 00:00:00 col: A item: 0.1440 col: B item: -0.9796 col: C item: -2.2432 col: D item: -0.7182
Notes:
In the above example, we produced generator expressions. Note the parentheses around the outer expression and inner expression used to produce generator. If we had used square brackets instead of parentheses, that expression would have produced lists.
The function show_items contains a nested loop whose outer loop iterates over the outer generator expression and within that outer loop, an inner loop iterates over each nested inner generator expression.
3.3.5 Grouping items in a DataFrame
You can group items in a DataFrame according to some criteria, then process only items in that group. For example:
In [363]: dates, df = utils01.make_sample_dataframe1() In [364]: df Out[364]: A B C D 2013-01-01 0.286823 -0.490076 1.876985 0.900970 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 In [365]: f1 = lambda x: 0 if x < 0.0 else 1 In [366]: df["E"] = [f1(x) for x in df.A] In [367]: df Out[367]: A B C D E 2013-01-01 0.286823 -0.490076 1.876985 0.900970 1 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 1 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 0 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 1 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 0 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 1 In [368]: groups = df.groupby("E") In [369]: In [369]: len(groups) Out[369]: 2 In [371]: groups.get_group(0) Out[371]: A B C D E 2013-01-03 -1.045215 -0.155277 -0.238831 0.763586 0 2013-01-05 -0.424031 -0.396694 -1.260573 1.912463 0 In [372]: In [372]: groups.get_group(1) Out[372]: A B C D E 2013-01-01 0.286823 -0.490076 1.876985 0.900970 1 2013-01-02 0.338896 -0.111205 -1.516295 1.344511 1 2013-01-04 0.911923 0.383383 -1.838096 -0.233212 1 2013-01-06 1.198149 -0.729439 1.578052 -1.139293 1
Notes:
We use the function/lambda f1 to distinguish between values that are less than zero and those that are greater than or equal to zero.
We create a list of keys depending on the values in column "A".
We create a new column in our DataFrame containing these keys.
We group the DataFrame depending on the values in this new column.
Next we can determine the number of groups (using len(df)).
And we can access each group individually (with df.get_group(n)).
Notice that all the items in the first group have negative values in column "A", and all the items in the second group have positive values in column "A".
An alternative way to do the above task would pass a function to the .groupby method. That function could assign or select rows in arbitrarily complex ways. For example, the following function could assign items to two groups depending on whether the value in column "A" is negative or positive:
In [33]: def f1(index): ...: return 1 if df.loc[index].A < 0.0 else 0 ...: ...: In [34]: In [34]: a = df.groupby(f1) In [35]: In [35]: len(a) Out[35]: 2 In [36]: In [36]: a.get_group(0) Out[36]: A B C D E 2013-01-01 0.823745 1.259863 0.099038 2.401296 0 2013-01-03 1.067624 1.106958 1.616902 0.939021 0 2013-01-04 1.152899 0.190998 -0.062540 -1.786131 0 2013-01-06 0.680271 1.307369 -0.024296 -0.973855 0 In [37]: In [37]: a.get_group(1) Out[37]: A B C D E 2013-01-02 -0.358235 -1.920455 -0.553173 0.580201 1 2013-01-05 -0.226727 0.180529 0.900700 -1.835082 1
3.3.6 Applying functions to a DataFrame
You can do this in a variety of ways:
Element-wise -- Use .map for Series and .applymap for DataFrame:
In [171]: dates.map(lambda x: x.day) Out[171]: Int64Index([1, 2, 3, 4, 5, 6], dtype='int64') In [172]: df.applymap(lambda x: 0.0 if x < 0.0 else x * 10.0) Out[172]: A B C D 2013-01-01 0.000000 11.222224 0.000000 0.764820 2013-01-02 8.165304 0.000000 8.425176 0.000000 2013-01-03 0.000000 7.066568 10.162480 0.000000 2013-01-04 7.097722 0.000000 10.544352 2.593139 2013-01-05 0.000000 0.000000 10.031058 6.354610 2013-01-06 5.629199 1.180783 0.000000 0.000000
Row-wise and column-wise -- Use one of:
df.apply(fn) -- Apply function to each column.
df.apply(fn, axis=1 -- Apply function to each row.
For functions that take and return a DataFrame or that take and return a Series, use .pipe. Example:
In [197]: fn = lambda x: np.abs(x) In [198]: df.pipe(fn) Out[198]: A B C D 2013-01-01 0.368409 1.122222 0.889764 0.076482 2013-01-02 0.816530 0.963447 0.842518 1.371106 2013-01-03 0.164827 0.706657 1.016248 0.474849 2013-01-04 0.709772 1.695648 1.054435 0.259314 2013-01-05 0.057673 0.713738 1.003106 0.635461 2013-01-06 0.562920 0.118078 1.904701 0.149196
And, remember that there may be use cases where it is useful to create a "vectorized" function with numpy.vectorize.
3.3.7 Sorting a DataFrame or a Series
You can sort by index, value, etc. See: http://pandas.pydata.org/pandas-docs/stable/basics.html#sorting.
3.3.8 Statistical analysis
You can do preliminary and rudimentary statistical analysis. See: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics.
For more complex work, consider using the Scipy tools.
Examples:
In [65]: df.describe() Out[65]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.255717 -0.067143 0.211290 -0.127586 std 1.102925 0.651381 0.663725 0.691202 min -0.746677 -1.277578 -0.445694 -1.101834 25% -0.415984 -0.110226 -0.142937 -0.473979 50% -0.111748 0.004162 -0.060588 -0.210746 75% 0.545268 0.374949 0.470344 0.363150 max 2.257601 0.516208 1.357676 0.765088 In [66]: In [66]: sp.mean(df.A) Out[66]: 0.2557174574376679 In [67]: In [67]: sp.std(df.A, ddof=1) Out[67]: 1.102925321931004
4 Visualization and graphing
4.1 Matplotlib
4.2 Bokeh
See: https://bokeh.pydata.org/en/latest/
Here are Bokeh examples taken from the documentaion:
#!/usr/bin/env python from bokeh.plotting import figure, output_file, show def test01(): # prepare some data x = [1, 2, 3, 4, 5] y = [6, 7, 2, 4, 5] # output to static HTML file output_file("lines.html") # create a new plot with a title and axis labels p = figure(title="simple line example", x_axis_label='x', y_axis_label='y') # add a line renderer with legend and line thickness p.line(x, y, legend="Temp.", line_width=2) # show the results show(p) def test02(): # prepare some data x = [0.1, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0] y0 = [i**2 for i in x] y1 = [10**i for i in x] y2 = [10**(i**2) for i in x] # output to static HTML file output_file("log_lines.html") # create a new plot p = figure( tools="pan,box_zoom,reset,save", y_axis_type="log", y_range=[0.001, 10**11], title="log axis example", x_axis_label='sections', y_axis_label='particles' ) # add some renderers p.line(x, x, legend="y=x") p.circle(x, x, legend="y=x", fill_color="white", size=8) p.line(x, y0, legend="y=x^2", line_width=3) p.line(x, y1, legend="y=10^x", line_color="red") p.circle( x, y1, legend="y=10^x", fill_color="red", line_color="red", size=6) p.line(x, y2, legend="y=10^x^2", line_color="orange", line_dash="4 4") # show the results #show(p, browser="firefox") show(p) def main(): test01() test02() if __name__ == '__main__': main()
There are more examples in the Bokeh "Quickstart" document: https://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart
4.3 Altair
See: https://pypi.python.org/pypi/altair
Note that Altair is not in the Anaconda distribution, but is easy to install with pip.
5 Optimization, parallel processing, access to C/C++, etc.
5.1 Numba
See: http://numba.pydata.org/numba-doc/dev/index.html.
And, here is a interesting article related to Numba: https://www.anaconda.com/blog/developer-blog/parallel-python-with-numba-and-parallelaccelerator/.
From the Numba docs:
From the Numba user manual:
Numba is a compiler for Python array and numerical functions that gives you the power to speed up your applications with high performance functions written directly in Python. Numba generates optimized machine code from pure Python code using the LLVM compiler infrastructure. With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters. Numba’s main features are: * on-the-fly code generation (at import time or runtime, at the user’s preference) * native code generation for the CPU (default) and GPU hardware * integration with the Python scientific software stack (thanks to Numpy)
Here is some sample test code, copied from the Numba documentation:
# file: numba_test01.py import numba @numba.jit def sum2d(arr): M, N = arr.shape result = 0.0 for i in range(M): for j in range(N): result += arr[i, j] return result def plain_sum2d(arr): M, N = arr.shape result = 0.0 for i in range(M): for j in range(N): result += arr[i, j] return result
And, here is an example that calls the two above functions, one optimized by Numba and the other not. Notice the timings. The Numba optimized version is more than two orders of magnitude faster:
In [30]: import numba_test01 as nt In [31]: a = np.ones((1000, 1200)) In [32]: time nt.plain_sum2d(a) CPU times: user 621 ms, sys: 0 ns, total: 621 ms Wall time: 622 ms Out[32]: 1200000.0 In [33]: time nt.sum2d(a) CPU times: user 3.68 ms, sys: 0 ns, total: 3.68 ms Wall time: 3.7 ms Out[33]: 1200000.0
There is lots more that can be done with Numba in the way of optimizing code. See the docs.
5.2 Dask
The documentation on Dask can be found here: http://dask.pydata.org/en/latest/docs.html.
This summary of Dask is from the Dask documentation:
Dask is a flexible parallel computing library for analytic computing. Dask is composed of two components: 1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads. 2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collectiont run on top of the dynamic task schedulers.
If you are beginning to learn Dask, you might want some sample data:
The dask tutorial contains a script for generating sample data files. You can find the tutorial repository here: https://github.com/dask/dask-tutorial.
And, here is a script that will generate a few HDF5 files. I copied it from the Dask Web site (http://dask.pydata.org/en/latest/examples/dataframe-hdf5.html), and made a few minor modifications:
#!/usr/bin/env python """ synopsis: generate sample dask data files. usage: python generate_dask_data.py <file_prefix> options: -h, --help Display this help. """ import sys import string import random import pandas as pd import numpy as np def generate(prefix): # dict to keep track of hdf5 filename and each key fileKeys = {} for i in range(10): # randomly pick letter as dataset key groupkey = random.choice(list(string.ascii_lowercase)) # randomly pick a number as hdf5 filename filename = prefix + str(np.random.randint(100)) + '.h5' # Make a dataframe; 26 rows, 2 columns df = pd.DataFrame({'x': np.random.randint(1, 1000, 26), 'y': np.random.randint(1, 1000, 26)}, index=list(string.ascii_lowercase)) # Write hdf5 to current directory df.to_hdf(filename, key='/' + groupkey, format='table') fileKeys[filename] = groupkey # prints hdf5 filenames and keys for each print(fileKeys) def main(): args = sys.argv[1:] if len(args) != 1: sys.exit(__doc__) if args[0] in ('-h', '--help'): sys.exit(__doc__) prefix = args[0] generate(prefix) if __name__ == '__main__': main()
I used the above script to build sample data files as follows:
$ ./generate_dask_data.py "data02/sample_"
Then I read these HDF5 files into a Dask DataFrame by using the following:
In [38]: df = dd.read_hdf('./data02/sample_*.h5', key='/*') In [39]: df Out[39]: Dask DataFrame Structure: x y npartitions=10 int64 int64 ... ... ... ... ... ... ... ... ... Dask Name: concat, 22 tasks In [40]:
After which, I can do the following, for example:
In [40]: df.x.mean().compute() Out[40]: 501.53076923076924
We can do something that indicates how our data has been broken down into separate partitions. I can use this function:
def test(df): results = [] for idx in range(df.npartitions): mean = df.get_partition(idx).x.mean().compute() print('partition: {} mean: {}'.format(idx, mean)) results.append((idx, mean)) return results
Which produces something like the following:
In [10]: test(df) idx: 0 mean: 473.7692307692308 idx: 1 mean: 436.5769230769231 idx: 2 mean: 501.2692307692308 idx: 3 mean: 565.4230769230769 idx: 4 mean: 516.8846153846154 idx: 5 mean: 501.34615384615387 idx: 6 mean: 531.3076923076923 idx: 7 mean: 428.61538461538464 idx: 8 mean: 565.2307692307693 idx: 9 mean: 494.88461538461536 Out[10]: [(0, 473.7692307692308), (1, 436.5769230769231), (2, 501.2692307692308), (3, 565.4230769230769), (4, 516.8846153846154), (5, 501.34615384615387), (6, 531.3076923076923), (7, 428.61538461538464), (8, 565.2307692307693), (9, 494.88461538461536)]
5.2.1 Dask for big data
Dask enables you to divide a large data structure or data set, for example, a Pandas DataFrame, into smaller structures, for example, smaller DataFrames, then load those smaller chunks from disk and process them.
Example:
First we'll create a data set, a Pandas DataFrame, that we can divide up into smaller chunks. Here is a Python script that we can use to create a sample CSV (comma separated values) file:
#!/usr/bin/env python # file: write_csv.py """ synopsis: Write sample CSV file from Pandas DataFrame. usage: python write_csv.py <outfilename> <num_rows> example: python write_csv.py test_data.csv 200 """ import sys import numpy as np import pandas as pd def make_sample_dataframe(periods): """Make sample dates and DataFrame. Returns (dates, df).""" dates = pd.date_range('20130101', periods=periods) df = pd.DataFrame( np.random.randn(periods, 4), index=dates, columns=list('ABCD')) return dates, df def create_data(outfilename, count): dates, df = make_sample_dataframe(count) df.to_csv(outfilename) def main(): args = sys.argv[1:] if len(args) != 2: sys.exit(__doc__) outfilename = args[0] count = int(args[1]) create_data(outfilename, count) if __name__ == '__main__': main()
And, from within IPython, we can run it to create a CSV file as follows:
In [113]: %run write_csv.py tmp2.csv 200
Now, we can read that file to create a Dask DataFrame with the following:
In [115]: import dask.dataframe as dd In [116]: daskdf = dd.read_csv('tmp2.csv')
We can look at our data with df.head() and df.tail():
In [117]: daskdf.head() Out[117]: Unnamed: 0 A B C D 0 2013-01-01 1.719008 0.168998 -0.582670 -0.199597 1 2013-01-02 0.947192 1.449137 -0.701263 0.342353 2 2013-01-03 1.321397 0.035692 0.147275 1.551782 3 2013-01-04 -0.286258 0.592772 1.770504 1.752572 4 2013-01-05 1.695924 0.159782 2.150698 -0.060106 In [118]: daskdf.tail() Out[118]: Unnamed: 0 A B C D 195 2013-07-15 0.303020 0.710051 -0.904407 -0.451793 196 2013-07-16 -0.703248 -0.973423 -0.830585 0.183094 197 2013-07-17 0.886046 1.530008 1.319875 -0.318807 198 2013-07-18 0.021749 2.570984 0.572013 1.249558 199 2013-07-19 -0.570810 -0.240768 2.203662 -0.014111
Also see the Pandas section for ways to view structures, for example: View Pandas data structures
Next, we'll divide it up -- This is an important capability of Dask; it enables us to process Dataframes/arrays that are either too large to fit comfortably in memory or which we are only interested in sub-slices. In this case, we'll specify a block size (or a partition size) when we read the CSV file and create a Dask DataFrame:
In [58]: %run write_csv.py tmp4.csv 500 In [59]: In [59]: df3 = dd.read_csv('tmp3.csv', blocksize=600) In [60]: In [60]: df3.head() Out[60]: Unnamed: 0 A B C D 0 2013-01-01 1.907704 0.317188 0.779075 0.327731 1 2013-01-02 -0.936242 -0.679869 -0.817254 -0.810020 2 2013-01-03 -1.465717 -0.775163 -0.621830 -0.171773 3 2013-01-04 0.878534 -0.910678 -0.363762 0.462970 4 2013-01-05 -0.182779 0.174225 -1.483841 -0.062528 In [61]: df3.tail() Out[61]: Unnamed: 0 A B C D 0 2013-07-15 0.426699 -2.126057 -0.784172 0.780982 1 2013-07-16 -0.727647 -1.552699 0.750276 -0.788475 2 2013-07-17 0.452168 -0.525214 0.003892 -0.029953 3 2013-07-18 -1.135117 0.626181 -0.895456 2.096875 4 2013-07-19 1.365505 -0.208806 0.115254 -1.210855 In [62]: In [62]: df3.A.mean().compute() Out[62]: 0.04365032375682896 In [63]:
And, now, we'll process that data chunk by chunk:
In [63]: for idx in range(df3.npartitions): ...: data = df3.get_partition(idx) ...: mean = data.A.mean().compute() ...: print('partition: {} mean: {}'.format(idx, mean)) ...: partition: 0 mean: 0.1307434691610682 partition: 1 mean: -0.10723637021736673 partition: 2 mean: 0.47059788011488657 partition: 3 mean: -0.029706498960742605 partition: 4 mean: 0.06754303873144374 partition: 5 mean: 0.1604556981338858 partition: 6 mean: -0.4161510144675041 partition: 7 mean: 0.6799116374415602 partition: 8 mean: 0.6303390153859068 partition: 9 mean: 0.6517677726166038 partition: 10 mean: -0.02111769936010994 o o o In [64]:
Notes:
Keep in mind that Dask is capable of "parallelizing" the above operation. It can process multiple partitions in parallel on a multi-core/multi-CPU machine. See the next section for help with that.
5.2.2 Dask for optimized (and parallel) computing
Dask enables you to describe a complex process in terms of an execution graph: a digraph (directed graph) whose nodes are sub-processes. The valuable thing about being able to do so is that Dask can schedule the execution of that larger process so that some sub-processes are executed in parallel. On multi-CPU/multi-core hardware, this can be a big win.
Dask supports parallel processing on both a single machine and one multiple, distributed machines. In what follows, however, I will discuss parallel computation on a single machine.
To learn more about this, you will want to read the following:
Scheduling -- http://dask.pydata.org/en/latest/scheduling.html
Single Machine with Dask.distributed -- http://dask.pydata.org/en/latest/setup/single-distributed.html
Dask.distributed -- https://distributed.readthedocs.io/en/latest/index.html
Controlling parallelism in Dask requires understanding Dask schedulers, how they are used by Dask, and how to use them.
Note that Dask has default schedulers. If you do nothing to change or set the scheduler, you will be using the default, which is most ofter what you want. The notes that follow will attempt to help you determine when and under what conditions you might want to use a different scheduler and how to do that.
Also, keep in mind two concepts that are both related to optimization in Dask: (1) Parallelism is what you want when you have multiple tasks and want to speed them up by running/computing them in parallel. (2) Breaking your data and your Dask data collections into chunks is what you want when your data set is very large and will not fit in memory. You should keep in mind that breaking your data into chunks may slow down processing. Here is something that shows some of those differences:
In [57]: df1 = dd.read_csv('tmp5.csv', blocksize=1000000) In [58]: df2 = dd.read_csv('tmp5.csv', blocksize=8000) In [59]: In [59]: df1.npartitions Out[59]: 1 In [60]: df2.npartitions Out[60]: 12 In [61]: df1.get_partition(0).size.compute() Out[61]: 5000 In [62]: df2.get_partition(0).size.compute() Out[62]: 450 In [63]: In [63]: time df1.A.mean().compute() CPU times: user 15.8 ms, sys: 7.5 ms, total: 23.3 ms Wall time: 22.3 ms Out[63]: 0.02893067882172706 In [64]: time df2.A.mean().compute() CPU times: user 167 ms, sys: 9.85 ms, total: 177 ms Wall time: 164 ms Out[64]: 0.028930678821727045 In [65]:
Notes:
We create df1 with a single partition (or chunk) and df2 with multiple partitions (in this case 12).
The size of a single partition of df1 is much larger than the first partition of df2 (5000 vs 450).
Computing the mean of a single column of df1 takes significantly less time than the same operation on df2.
Synchronous processing on the local machine -- The default scheduler does that.
Let's figure out how to do that in parallel, for example, we'll try to compute the mean of each of the columns of our dataframe (four columns: "A", "B", "C", and "D") in parallel.
Here are two functions. One computes the mean for each column in our DataFrame, one column after another. The other attempts to use dask.distributed to schedule these four tasks so that they make use of more than one CPU core:
def compute_means_sequential(df): """ Sequentially compute the means of columns of dataframe. Args: df (dask.dataframe.DataFrame) -- A dataframe containing columns A, B, C, and D. Return: The means """ meanA = df.A.mean().compute() meanB = df.B.mean().compute() meanC = df.C.mean().compute() meanD = df.D.mean().compute() return meanA, meanB, meanC, meanD def compute_means_parallel(client, df): """ Compute in parallel the means of columns of dataframe. Args: client (dask.distributed.Client) -- The client to schedule the computation. df (dask.dataframe.DataFrame) -- A dataframe containing columns A, B, C, and D. Return: The means """ meanA = client.submit(df.A.mean().compute) meanB = client.submit(df.B.mean().compute) meanC = client.submit(df.C.mean().compute) meanD = client.submit(df.D.mean().compute) client.gather((meanA, meanB, meanC, meanD)) return meanA.result(), meanB.result(), meanC.result(), meanD.result()
You can find a file containing these snippets here: snippets.py.
Here is a test that uses the above on a 2-core machine:
In [17]: time snippets.compute_means_sequential(df1) CPU times: user 167 ms, sys: 21.3 ms, total: 189 ms Wall time: 379 ms Out[17]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945) In [18]: time snippets.compute_means_parallel(client, df1) CPU times: user 189 ms, sys: 16.9 ms, total: 206 ms Wall time: 281 ms Out[18]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945)
Here is a test that uses the above on a 4-core machine:
In [15]: time snippets.compute_means_sequential(df1) CPU times: user 160 ms, sys: 9.5 ms, total: 169 ms Wall time: 303 ms Out[15]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945) In [16]: In [16]: time snippets.compute_means_parallel(client, df1) CPU times: user 164 ms, sys: 5.03 ms, total: 169 ms Wall time: 224 ms Out[16]: (0.02893067882172706, -0.05704419047235241, -0.03281851829891229, -0.029845199428518945)
Notes:
Parallel execution on a 4-core machine takes measurably less time. On a large data structure, this might be significant and noticeable.
My original test had four calls to print() in each of the above two functions. That partially masked the time difference between calls to these functions.
As with any work on optimization, you will need to test with your data, your machine, your configuration, etc. YMMV (your mileage my vary).
5.3 Cython
See: http://cython.org/.
Cython enables us to write or produce C code while writing code in the style of Python. There's more to it than that, but you get the idea. We can write code that looks a lot like Python code, and then use Cython to turn it into C code.
Cython has another important use -- Because (1) Cython gives us easy access to libraries of compiled C code and (2) it is easy to write functions in Cython that can be called from Python, we can use it to easily "wrap" C functions for use in Python. In fact, if you look inside some Python packages, for example Lxml, you will see wrappers for underlying C code that were produced with Cython; Lxml makes calls into the libxml XML libraries provided by http://www.xmlsoft.org.
Here is a bit more description from http://cython.org/:
"Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language (based on Pyrex). It makes writing C extensions for Python as easy as Python itself.
- "Cython gives you the combined power of Python and C to let you
write Python code that calls back and forth from and to C or C++ code natively at any point.
easily tune readable Python code into plain C performance by adding static type declarations.
use combined source code level debugging to find bugs in your Python, Cython and C code.
interact efficiently with large data sets, e.g. using multi-dimensional NumPy arrays.
quickly build your applications within the large, mature and widely used CPython ecosystem.
integrate natively with existing code and data from legacy, low-level or high-performance libraries and applications."
6 Machine learning
6.1 Scikit-Learn
And, the scikit-learn documentation page is here: http://scikit-learn.org/stable/user_guide.html.
EliteDataScience has an introduction to machine learning here: https://elitedatascience.com/learn-machine-learning
EliteDataScience has provided a Scikit-Learn tutorial here: https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn.
6.2 tensorflow
Question: Is there support for tensorflow in Anaconda? Answer: Yes, but currently, installing it is tricky. For example, see this: https://gist.github.com/johndpope/187b0dd996d16152ace2f842d43e3990
7 Multiprocessing and parallization
7.1 ipyparallel
7.2 Dask and Dask schedulers
Also see the section on Dask elsewhere in the current document: Dask for optimized (and parallel) computing.
8 Data store -- HDF5, h5py, Pytables, asdf, etc
8.1 HDF5
8.1.1 h5py
You can store Panda DataFrames and Dask DataFrames in HDF5 archives with h5py. You can read about h5py here:
Also see: https://dask.pydata.org/en/doc-test-build/array-overview.html#construct
Here is an example that saves and retrieves a Dask DataFrame:
In [62]: df1, df2 = snippets.read_csv_files('tmp5.csv') In [63]: df1.to_hdf('tmp01.hdf5', '/Version1/tmp5') Out[63]: ['tmp01.hdf5'] In [64]: In [64]: df1a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5') In [65]: In [65]: df1.A.mean().compute() Out[65]: 0.02893067882172706 In [66]: df1a.A.mean().compute() Out[66]: 0.02893067882172706 In [68]: df2.to_hdf('tmp01.hdf5', '/Version1/tmp5_2') Out[68]: ['tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5', 'tmp01.hdf5'] In [69]: In [69]: df2a = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2') In [70]: In [70]: df2.npartitions Out[70]: 12 In [71]: df2a.npartitions Out[71]: 1 In [72]: df2.B.su df2.B.sub df2.B.sum In [72]: df2.B.sum().compute() Out[72]: -57.04419047235241 In [73]: df2a.B.sum().compute() Out[73]: -57.04419047235241
Notes:
We load a Dask DataFrame (df1), then read it back into a separate variable (df1a).
We compute the mean of column A of both DataFrames so as to show that the one we wrote to HDF5 and the one we read back in from HDF5 contain the same data.
Notice that in the case of df2 and df2a, read_hdf function did not preserve the chunk size and number of partitions. However, the read_hdf function has an optional parameter that enables you to read a DataFrame from HDF5 creating multiple partitions and a smaller chunk size. Example:
In [80]: df2b = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2') In [81]: df2b.npartitions Out[81]: 1 In [82]: df2c = dd.read_hdf('tmp01.hdf5', '/Version1/tmp5_2', chunksize=100) In [83]: df2c.npartitions Out[83]: 10
8.1.2 h5serv
There is also an HTTP server for HDF5 archives. It presents a REST-ful interface that enables you to add, list, and retrieve data objects from HDF5 archives on a remote machine. The data returned in response to a retrieval request is formatted as JSON.
Yot can learn more about h5serv here: http://h5serv.readthedocs.io/en/latest/.
And, you can learn about the JSON representation of HDF5 here: http://hdf5-json.readthedocs.io/en/latest/index.html.
8.1.3 Pytables
8.2 asdf
The documentation is here: https://asdf.readthedocs.io/en/latest/.
And, a bit more documentation: https://www.sciencedirect.com/science/article/pii/S2213133715000645
8.3 CSV -- comma separated values
A CSV module is in the Python standart library. See: https://docs.python.org/3/library/csv.html