Code

Math for Juniors: What to Review Before a Data Science Interview

Math for Juniors: What to Review Before a Data Science Interview

Free Python Course ➞ Mini-course for beginners and experienced coders. 4 cool projects in the portfolio, live communication with the speaker. Click and find out what you can learn in the course.

Learn More

In the future, data scientists may not need deep knowledge of mathematics, and their work will simply consist of clicking the "Data Analysis" button. However, currently, to successfully start in the field of Data Science, it is necessary to have at least basic mathematical knowledge. Understanding mathematical concepts is the basis for data analysis and model building, which in turn leads to more accurate and effective conclusions.

The main reason is that the skill level of applicants for junior positions is often similar. Therefore, when choosing between two candidates, preference will most likely be given to the one familiar with the fundamental principles underlying data analysis, machine learning, and deep learning tools. This knowledge allows you to better navigate modern technologies and increases your chances of successful employment.

We have selected seven key topics that are often asked in interview questions for the Junior Data Scientist position. These topics will help you prepare for the interview and better understand what is expected of the candidate for this role.

  • probability theory,
  • linear algebra,
  • mathematical analysis,
  • mathematical statistics.

Short cheat sheet answers have been created for your convenience. Now you have access to the necessary information in a concise form.

3. Gradient is a direction

In today's world, where information is available anytime and anywhere, the importance of quality content cannot be overestimated. Creating unique and useful text helps not only attract the attention of users but also improve the site's position in search engines. SEO content optimization involves keyword usage, text structuring, and creating a readable format.

Quality content should address user needs by providing relevant and valuable information. This can be achieved through an in-depth analysis of the target audience and their needs. It's important to remember that search engines prefer original texts that contain relevant data and answer user questions.

Optimizing headings and subheadings, using meta tags, and creating internal links also contribute to improving website visibility. Ultimately, well-planned content not only increases traffic but also helps build brand trust. A proper approach to content creation is key to successful online promotion.

A gradient can be simply explained as a pointer vector that indicates the direction and rate of change of a given quantity. The vector's direction indicates the greatest increase in that quantity, and its length reflects the rate of this change. Gradients play an important role in mathematics and physics, particularly in function analysis and optimization. Understanding gradients properly helps in applications ranging from machine learning to engineering.
Height gradient: the faster the height of a mountain increases, the longer the vector

Vectors are more often The data are studied not at a single point, but in the context of the entire area, allowing for a more complete picture. Using points covering the area of ​​interest, gradients can be drawn that add depth and volume. When color is additionally used to emphasize the size and direction of vectors, this leads to the creation of interesting abstract compositions that can attract attention and generate interest in viewers.

Wikipedia

In deep In neural networks, the learning process involves repeatedly adjusting the weights of neurons, which optimizes their influence on subsequent layers. Each neuron's weight, or authority, changes depending on the magnitude of the error it transmits. These errors form a surface in the multidimensional space of weights, and the main task is to find the minimum of this surface—the "hole" or "trough" where the errors are at their lowest. This process is necessary to improve the accuracy of the model and improve its ability to generalize information to new data.

Wikipedia

The gradient is used to determine the minimum of the function. Although it points towards the maximum, nothing prevents movement in the opposite direction—towards the antigradient. This is why these methods are called gradient descent. Gradient descent is an effective optimization approach that allows you to find the minimum values ​​of various functions.

4. Bayes' Theorem: The effect indicates the cause

The task complexity is one-third. This may mean that the task is fairly simple and does not require deep knowledge or significant effort to complete. Given the low level of complexity, novice specialists or those just beginning to master this topic will be able to cope with it without any particular problems. This level of complexity is suitable for the first steps in learning, allowing you to accumulate experience and confidence in your abilities.

This theorem explains how our expectations change when we receive new information or observe new evidence. It highlights the importance of updating our knowledge and adjusting our predictions based on new data, which is a key aspect in making informed decisions.

An entomologist has discovered a beetle with a unique pattern on its body and hypothesizes that it may be a rare subspecies. His hypothesis is based on statistics: beetles of the rare subspecies have a pattern 98% of the time, while among common beetles, the pattern is found in only 5%. The rare subspecies is indeed remarkable for its rarity, as it makes up only 0.1% of the entire population. This discovery is intriguing and highlights the importance of further research to confirm the identification of the beetle as a rare subspecies.

The likelihood that a beetle with a unique pattern found belongs to a rare subspecies depends on several factors. First, the geographic distribution of the subspecies must be considered. If a beetle is found in an area where a rare subspecies has previously been recorded, the chances of finding it increase significantly. Secondly, it's important to identify the species based on morphological features and genetic analysis. The presence of a characteristic pattern can be a sign of belonging to a rare subspecies, but it's not the only criterion. The general state of the population and the threat level for a given subspecies in the wild should also be considered. Research and observation of local fauna will help more accurately determine the likelihood of finding a rare beetle subspecies.

Bayes' theorem and the formula for total probability are important concepts in probability theory, and they are best mastered through specific problems like this one. The main challenges you'll face will be calculating fractions and constructing tree diagrams. For a deeper understanding of the topic, including proofs of the formula and many examples, you can familiarize yourself with additional materials at the link.

In any unclear situation, draw a diagram of total probability. Entomologist, forgive me, but the beetle is most likely ordinary. Wikipedia

Assessing potential borrowers through credit scoring, determining the accuracy of medical tests, identifying signs of bank fraud, and protecting corporate networks from hacks—all these applied data science tasks are effectively solved using a Bayesian approach. Bayesian methods allow not only to analyze and interpret data but also to make informed predictions, which makes them indispensable in various fields, from finance to medicine. The use of Bayesian models helps improve the quality of decisions and minimize risks, ensuring higher accuracy and reliability in data analysis.

7. Statistical significance - confidence threshold

The task complexity is 2 out of 3 possible points.

A statistically significant value is a value that is highly unlikely to be due to chance. Determining the degree of randomness and inconsistency occurs during the study preparation stage and is based on established data analysis methods. Such values ​​play a key role in statistics, allowing researchers to draw informed conclusions and make decisions based on the data obtained. Correct interpretation of statistical significance helps avoid errors and misunderstandings in scientific research and practical applications.

Statistical significance plays a key role in testing statistical hypotheses. Consider an example: we hypothesize that green groans predominantly grunt. This will become our primary hypothesis. As an alternative hypothesis, we will assume that green groans, on the contrary, do not grunt, but poke. For this test, we will set the statistical significance level at 5%. This means that if the probability of obtaining the observed data with the null hypothesis true is less than 5%, we can reject the primary hypothesis in favor of the alternative. Statistical significance allows us to make informed conclusions and evaluate the reliability of our assumptions based on the data collected.

Find the grunting greenling. Wikipedia

Consider the results of observations of green groans. If the proportion of green groans is 95% or more, we can say that the main hypothesis has been statistically tested at a significance level of α (alpha) = 0.05. In this case, the number of alternative (sniffing) green groans will not be considered statistically significant.

The α level represents the probability of rejecting the null hypothesis. If the null hypothesis is true, that is, if green groans actually grunt, then the α level also reflects the probability of making a Type I error, that is, erroneously rejecting the hypothesis, which leads to a false alarm. If the null hypothesis is false and more green groans than grunts, the α level will denote the probability of not rejecting the false hypothesis, which leads to a Type II error, known as a Type II error or an omission. Understanding these concepts is critical to correctly interpreting statistical data and making informed decisions based on research findings.