In this article, Sunila Gollapudi, author of Practical Machine Learning, introduces the key aspects of machine learning semantics and various toolkit options in Python.
Machine learning has been around for many years now and all of us, at some point in time, have been consumers of machine learning technology. One of the most common examples is facial recognition software, which can identify if a digital photograph includes a particular person. Today, Facebook users can see automatic suggestions to tag their friends in their uploaded photos. Some cameras and software such as iPhoto also have this capability.
What is learning?
Let’s spend some time understanding what the “learning” in machine learning means. We are referring to learning from some kind of observation or data to automatically carry out further actions. An intelligent system cannot be built without using learning to get there. The following are some questions that you’ll need to answer to define your learning problem:
- What do you want to learn?
- What is the required data and where does it come from?
- Is the complete data available in one shot?
- What is the goal of learning or why should there be learning at all?
Before we plunge into understanding the internals of each learning type, let’s quickly understand a simple predictive analytics process for building and validating models that solve a problem with maximum accuracy:
- Identify whether the raw dataset is validated or cleansed and is broken into training, testing, and evaluation datasets.
- Pick a model that best suits and has an error function that will be minimized over the training set.
- Make sure this model works on the testing set.
- Iterate this process with other machine learning algorithms and/or attributes until there is a reasonable performance on the test set.
- This result can now be used to apply for new inputs and predict the output.
The following diagram depicts how learning can be applied to predict behavior:
Key aspects of machine learning semantics
The following concept map shows the key aspects of machine learning semantics:
Python is one of the most highly adopted programming or scripting languages in the field of machine learning and data science. Python is known for its ease of learning, implementation, and maintenance. Python is highly portable and can run on Unix, Windows, and Mac platforms. With the availability of libraries such as Pydoop and SciPy, its relevance in the world of big data analytics has tremendously increased.
Some of the key reasons for the popularity of Python in solving machine learning problems are as follows:
- Python is well suited for data analysis
- It is a versatile scripting language that can be used to write some basic, quick and dirty scripts to test some basic functions or can be used in real-time applications leveraging its full-featured toolkits
- Python comes with mature machine learning packages and can be used in a plug-and-play manner
Toolkit options in Python
Before we go deeper into what toolkit options we have in Python, let’s first understand what toolkit option trade-offs should be considered before choosing one:
- What are my performance priorities? Do I need offline or real-time processing implementations?
- How transparent are the toolkits? Can I customize the library myself?
- What is the community status? How fast are bugs fixed and how is the community support and expert communication availability?
There are three options in Python:
- Python external bindings. These are interfaces to popular packages in the market such as Matlab, R, Octave, and so on. This option will work well if you already have existing implementations in these frameworks.
- Python-based toolkits. There are a number of toolkits written in Python which come with a bunch of algorithms.
- Write your own logic/toolkit.
Python has two core toolkits that are more like building blocks. Almost all the following specialized toolkits use these core ones:
- NumPy: Fast and efficient arrays built in Python
- SciPy: A bunch of algorithms for standard operations built on NumPy
- There are also C/C++ based implementations such as LIBLINEAR, LIBSVM, OpenCV, and others.
Some of the most popular Python toolkits are as follows:
- nltk: The natural language toolkit. This focuses on natural language processing (NLP).
- mlpy: The machine learning algorithms toolkit that comes with support for some key machine learning algorithms such as classifications, regression, and clustering, among others.
- PyML: This toolkit focuses on support vector machine (SVM).
- PyBrain: This toolkit focuses on neural network and related functions.
- mdp-toolkit: The focus of this toolkit is on data processing and it supports scheduling and parallelizing the processing.
- scikit-learn: This is one of the most popular toolkits and has been highly adopted by data scientists in the recent past. It has support for supervised and unsupervised learning and some special support for feature selection and visualizations. There is a large team that is actively building this toolkit and is known for its excellent documentation.
- PyDoop: Python integration with the Hadoop platform. PyDoop and SciPy are heavily deployed in big data analytics.