A collection of important python modules for data scientists

This is a part of Python Knowledge and Resources List

  1. Pandas

    Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas is free software released under the three-clause BSD license.

    Website: http://pandas.pydata.org/


    Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.

    The easiest way to install pandas is to install it as part of the Anaconda distribution. 

    pandas can be installed via pip from PyPI.

    pip install pandas

    This will likely require the installation of a number of dependencies, including NumPy, will require a compiler to compile required bits of code, and can take a few minutes to complete.

  2. Statsmodels

    Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator.

    Website: http://statsmodels.sourceforge.net/


    You can obtain source distributions and Windows binaries from PyPi. Alternatively, you can use setuptools to install statsmodels:

    easy_install statsmodels

    or upgrade with:

    easy_install -U statsmodels

    Statsmodels can be installed from source the usual way with the command 

    python setup.py install

  3. scikit-learn

    scikit-learn is an open source library for the Python. It features various classification, regression and clustering algorithms including support vector machines, logistic regression, naive Bayes, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

    Website: http://scikit-learn.org/stable/


    At this time scikit-learn does not provide official binary packages for Linux so you have to build from source.

    Installing from source requires you to have installed the scikit-learn runtime dependencies, Python development headers and a working C/C++ compiler. Under Debian-based operating systems, which include Ubuntu, if you have Python 2 you can install all these requirements by issuing:

    sudo apt-get install build-essential python-dev python-setuptools \

                         python-numpy python-scipy \

                         libatlas-dev libatlas3gf-base

  4. Mlpy

    Mlpy is a Python machine learning library built on top of NumPy/SciPy, the GNU Scientific Library. mlpy provides a wide range of  machine learning methods for supervised and unsupervised problem.mlpy is multi platform, it works with Python 2 and 3.

    Website: http://mlpy.sourceforge.net/


    Download latest version for your OS from http://sourceforge.net/projects/mlpy/files/

    you need GCC,Python,Numpy,SciPy,GSL preinstalled

    then, from the terminal run

    python setup.py install

  5. NumPy

    NumPy is an open source extension module for Python. The module NumPy provides fast precompiled functions for numerical routines.

    It adds support to Python for large, multi-dimensional arrays and matrices. Besides that it supplies a large library of high-level mathematical functions to operate on these arrays

    Website: http://www.numpy.org/


    Most of the major linux distributions provide packages for NumPy, but these can lag behind the most recent NumPy release. Pre-built binary packages for Ubuntu are available on the scipy ppa. Redhat binaries are available in the Enthought Canopy.

  6. SciPy

    SciPy is widely used in scientific and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

    Website: http://www.scipy.org/


    Users on Linux can quickly install the necessary packages from repositories.

    for example ubuntu users can install dependencied by runnung

    sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose

  7. matplotlib

    matplotlib is a plotting library for NumPy.

    Website: http://matplotlib.org/


    sudo apt-get install python-matplotlib

  8. NLTK

    The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs statistical natural language processing (NLP) for the Python. NLTK includes graphical demonstrations and sample data.NLTK has been used successfully as a platform for prototyping and building research systems

    Website: http://www.nltk.org/


    for ubuntu

    sudo pip install -U nump

    sudo pip install -U nltk

  9. Theano

    Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently

    Website: http://deeplearning.net/software/theano/

  10. nolearn

    This package contains a number of utility modules that are helpful with machine learning tasks. Most of the modules work together with scikit-learn, others are more generally useful.


  11. PyBrain
    PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library
    Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms.
  12. Orange

    Orange is a component-based data mining and machine learning software suite, featuring a visual programming front-end for explorative data analysis and visualization, and Python bindings and libraries for scripting. It includes a set of components for data preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration techniques. It is implemented in C++ and Python. Its graphical user interface builds upon the cross-platform Qt framework
    Unlike its competitors scikit-learn and mlpy, Orange does not tie into NumPy and its ecosystem of tools; it focuses on traditional, symbolic algorithms, more than numeric onesjjj

  13. Keras
    Keras is a minimalist, highly modular neural network library in the spirit of Torch, written in Python, that uses Theano under the hood for fast tensor manipulation on GPU and CPU. It was developed with a focus on enabling fast experimentation.
  14. Hebel
    Hebel is a library for deep learning with neural networks in Python using GPU acceleration with CUDA through PyCUDA. It implements the most important types of neural network models and offers a variety of different activation functions and training methods such as momentum, Nesterov momentum, dropout, and early stopping.
Add a Resource to this List
Not more than 250 characters.