An Introduction to Machine Learning
Machine Learning has become one of the most discussed and important fields in existence. It has proven to be indispensable to both the growth and the efficiency of other fields of study and application both within and outside the tech world. Its use in our day-to-day lives is evident and its relevance is increasing daily. In recent years, its growth, evident use in our day-to-day lives and increase in relevance has caused machine learning to become a field of interest to bodies and individuals from different walks of life. This piece hopes to simplify and explain the basic requirements needed to begin the journey into machine learning.
For individuals hoping to start the journey into machine learning, as a career or as a hobby, you'll be needing a few prerequisites.You must have the knowledge of an appropriate programming language. The most popular language for machine learning is Python, so the basics of Python must be learnt. Also, the basic knowledge of statistics is an important prerequisite as well.
Machine Learning can be divided into three processes: Data Collection, Data Modelling, and Deployment.
Data Collection is the process of gathering, collating, and grouping information. This could be done via observations, interviews, questionnaires, and records.
After the data has been collated, the next step is Data Modelling. This is the process of creating a model for the collated data. A machine learning model is a file that has been trained to recognize certain types of patterns. A model is trained over a set of data, providing it an algorithm that it can use to reason over and learn from that data. Now, this is a critical, if not the most important, process in Machine Learning. To create a model, there are six major considerations to be made. They are:
1. Problem Definition.
This is a concise description of the problem to be solved. It is important to define the problem you are trying to solve. There are four different kinds of Machine Learning depending on the type of problem. They are:
The nature of the problem has to be defined to know which algorithm/model is to be used and what method to evaluate the data given.
Since Machine Learning requires algorithms to find patterns in data, data is the basis for any Machine Learning Project. Data comes in many shapes and sizes but there are two main kinds of data: Structured and Unstructured Data. Structured Data is data organized into rows and columns. It is in the format of .csv or .excel files while Unstructured Data is unorganized data consisting of images or audio. Data can be viewed with notebooks such as the Jupyter Notebook.
Before modeling, there is a need to set a target accuracy to define the expectation of your model. A feasible accuracy should be set for the problem given because a model cannot be 100% accurate but it can be trained to give its best accuracy. The process of determining the accuracy of a model is called evaluation For example, a 95% accurate model may work best in some areas but when predicting heart disease, you might want a more accurate model. Evaluation metrics can be put in place to measure how well a Machine Learning algorithm predicts the future. As progress is made on the project, the evaluation metrics might change due to certain circumstances.
This is another word for different forms of data. What is known about the data given? Insights are drawn from features for Data Analysis. For example, in a car sales .csv file, the column names (e.g type, odometer, color, etc) are all features of the car sales data. They are also referred to as feature variables. Feature variables are used to predict the target variables. A feature variable could be Numerical or Categorical. The process of deriving features out of given data is called Feature Engineering.
Based on the problem statement and data, a model would be picked and this would further be divided into three parts:
- Choosing and Training.
When it is time to model, data is often split into three parts: Training, Validation, and Testing. The ability of a Machine Learning model to perform well on data it hasn’t seen before is called Generalization. There are several kinds of algorithms to use when modeling. Some algorithms work better than others depending on the type of data. When choosing a model, certain options like the size and type of data come into play. For Structured Data, algorithms like Xgboost and Random forest are used while in Unstructured Data, Deep Learning and Transfer Learning can be used. Training may take a while depending on how complex the model is and the algorithm used.
Tuning takes place on the validation data split and a model can be tuned for different kinds of data. Hyper-parameters are used to tune the algorithms to suit the model.
A good model would yield a similar result on the validation and test set during comparison. During comparison, the model might not be able to generalize well and this could be caused by data leakage or mismatch. Corrections can be made to fix such problems.
This is an iterative step of steps 2 - 5. Here, you search for what else can be used to improve the model and other models that can be tried to improve its accuracy and make it better.
After the model has been built, the next step is deployment. Deployment of a Machine Learning model is simply the integration of the model into a production environment so that it would be able to take input and return output to be used to make decisions. It could be deployed as APIs with Python Frameworks such as Flask and Django or to the frontend with Tensorflow.js. There are so many other ways to do this and it depends on where it is needed.
Machine Learning isn’t the solution to all problems. For simple problems that can be easily fixed with a few lines of code, Machine Learning isn’t needed as it can bombard the system with unnecessary lines of code.
As time goes on, I'll write more articles on the steps above. I hope you enjoyed this!