Code

Building Your First Machine Learning Model: Using Colab, Pandas, and Sklearn

Building Your First Machine Learning Model: Using Colab, Pandas, and Sklearn

Free Python course ➞ A mini-course for both beginners and experienced coders. 4 cool projects in the portfolio, live communication with the speaker. Click and find out what you can learn in the course.

Learn more

Using data on tourist preferences, we can train an effective machine learning model and make accurate predictions. This approach allows us to create recommendation systems similar to those developed by professional data scientists. Predictions based on data analysis will help improve the user experience and offer tourists the most suitable vacation options.

To achieve this goal, several key aspects must be considered. First, it is essential to conduct a thorough analysis of the target audience to understand their needs and interests. This will help create content that is not only informative but also useful for users.

Second, you should pay attention to the selection of keywords and phrases that are most relevant to your topic. Properly optimizing your text using these keywords will increase your content's visibility in search engines.

Also, pay attention to the structure of your text. A logical and consistent flow of information improves reader comprehension. Use subheadings and lists to simplify navigation.

Don't forget about the quality of your content. Unique and original materials attract attention and help retain your audience. Regularly updating your information will also help maintain interest in your site.

Finally, it's important to actively promote your content through social media and other channels to increase its reach and attract more visitors. By following these guidelines, you can create effective and SEO-optimized text.

  • We'll create a training dataset in the form of Pandas dataframes (read our article about this library).
  • We'll train a model from the Sklearn library on the resulting dataset.
  • We'll write Python code for further predictions (read our article about the Python minimum for data science).

This table contains data on a thousand tourists, including information about their age, income, and preferences. The main column, labeled "target," contains information about the tourist's chosen city for their trip. Our model will be trained to predict this value for new tourists, which will allow us to better understand their preferences and needs.

The main part of the table includes only numerical values, which makes it a convenient model for analysis. For example, if the city_Yekaterinburg column contains a value of 1, and the other columns with names beginning with city_ contain zeros, this indicates that the tourist is a resident of Yekaterinburg. This approach makes it easy to identify the origin of tourists and effectively process the data for further analysis.

Download the table from the provided link and upload it to Google Colab—a convenient online service for writing code and analyzing data without installing software on your computer. We recommend first reading the article, which describes the basic principles of working with Google Colab.

Reading the data

To work with data, you must first read it from a file and convert it to a format convenient for further processing. In Google Colab, add a new code cell by clicking the "+ Code" button at the top of the interface. In this cell, you will be able to write the corresponding code to perform this task.

We started by importing the popular Pandas library, which is widely used for working with tabular data in the field of data analysis. For convenience and shortening the code, following the traditions of data science, we will use the abbreviation pd instead of the full name of the library. This makes the code easier to write and more readable.

In the second line, we declared a variable named df. Then, using the Pandas library, abbreviated as pd, we loaded the data from the trips_data_for_ML.xlsx file using the .read_excel() function. The index_col parameter was set to 0, which means that the index column in the df table will be column number 0 from the loaded Excel table. This allows for efficient data management and convenient access to table rows based on their indices.

In the third line, we used the .head() method for our df variable. This method allows you to get the first few rows of data, which facilitates preliminary analysis and allows you to quickly assess the structure and content of the DataFrame.

This method displays the first rows of the newly created DataFrame, showing 5 rows by default. This is a useful feature for quickly checking the correctness of the loaded data and its structure. Using this feature allows you to ensure that the data was read correctly and matches the expected format.

Creating a dataset

Now we need to transform the dataframe with our data into a dataset that will be used to train the machine learning model. This transformation will allow us to efficiently prepare the data and ensure that it meets the requirements of the training algorithms.

Let's split the dataframe into two parts, denoting them as X and y. X will contain all data about tourists, except for the target column, which contains information about the cities they selected. y will contain only the target column with these cities. This approach allows us to clearly identify the features for analysis and the target variable for further forecasting.

The text has a problem book structure, where one extensive section presents the problem statements containing data on tourists, and another, more compact section contains the correct answers corresponding to the selected cities. The model will be trained based on this "problem book", which will allow it to effectively process the information and produce correct results.

Add a code cell and enter the desired code. This will allow you to integrate functionality or display data on your page. It is important to ensure that the code is written correctly to avoid errors during execution. Pay attention to compatibility with the technologies and standards used. Optimizing the code will also help improve page loading speed and overall performance. When adding code, ensure that it meets SEO requirements, including correct tags and data structuring. This will not only increase the visibility of your content in search engines but also improve the user experience.

In the first line of code, we created the variable X, into which we placed part of our df data frame. Using the .drop() method, we removed the target column. The axis parameter, which in this context is equal to one, indicates that the deletion occurs along the vertical axis, meaning the target column is completely discarded. This process allows us to focus on the remaining data for analysis and modeling.

In the second line of code, we created a variable y, into which we placed the data from the target column of our DataFrame df. This allows us to isolate the target variable needed for further analysis and model building.

Now our source data is split into two DataFrames, which prepares them for model training. This allows for more efficient data processing and analysis, ensuring better training quality and increased model accuracy.

Creating a Model

Sklearn, also known as scikit-learn, is one of the most popular machine learning libraries in Python. It is the second most popular library used by data scientists after Pandas. The library provides a wide range of tools for solving classification, regression, and clustering problems, and also includes functions for data processing and model evaluation. Due to its ease of use and extensive documentation, Sklearn is an ideal choice for both beginners and experienced machine learning professionals.

This code imports the Random Forest classifier model builder from the ensemble section of the sklearn library. This algorithm, known as a "random forest", is used to solve classification and regression problems. By importing Random Forest, we can apply powerful ensemble learning techniques to improve the prediction accuracy of our project.

We created a variable named model, which contains the Random Forest Classifier model with default parameters. This variable represents our machine learning model, ready to be trained and used to classify data.

Training the Model

The process of training a model is simple and straightforward—it all boils down to one line. We present the model with a dataset X, which we know corresponds to a column of y values. We then formulate the problem as: "If X, then y. Got it?" If the model has learned the information, it will provide its own set of parameters, although we won't go into their interpretation today. This approach allows us to effectively train models based on the relationship between input data and target values.

We begin the prediction process.

A tourist is a dictionary

According to our model, the ideal tourist is represented as follows

The example variable is a dictionary, a Python data structure made up of key-value pairs. In this case, the key "age" has the value [31], indicating that the tourist is 31 years old. The key "city_Krasnodar" has the value one, while the other keys, which are city keys, have the value zero. This allows us to conclude that our new tourist is from Krasnodar.

Copy the code above into a new cell and execute it. This will declare the variable "example" and initialize it with a dictionary containing information about the new tourist.

Prediction

This line of code converts the dictionary "example" to a pandas DataFrame and stores it in the variable "example_df". This process allows you to efficiently process and analyze data using the powerful pandas tools for working with tables and time series.

We present a prediction that can impact the future. We analyze current trends and forecast developments. Pay attention to key aspects that can change your life. We will monitor the dynamics of changes and assess their impact on our environment. This forecast will form the basis for further decisions and actions. Prepare for possible changes and stay informed.

We applied the .predict() method to our model using data from the example_df variable. I wonder where a 31-year-old resident of Krasnodar, who enjoys shopping and cars, will go? Only those who worked on the model with us know the answer to this question.

By changing parameters such as age, city, and other values ​​in the example dictionary, you can get model predictions for different tourists. Combine all three commands into a single cell and run it every time you make a change to the data and need a new prediction. This will allow you to effectively adapt the model to different scenarios and get up-to-date analysis results.