Data cleaning is a crucial step in algorithmic trading, ensuring that stock market datasets are accurate and reliable. Descriptive statistics help analyze the cleaned data, providing insights for better trading decisions.
In this blog, we’ll cover handling missing values, data standardization, and statistical analysis using Python.
Why Data Cleaning is Essential in Algorithmic Trading?
Before applying descriptive statistics to financial data, cleaning is necessary. Poor data quality can lead to:
- Inaccurate trading signals, causing wrong investment decisions.
- Higher risk due to inconsistent or missing data.
- Incorrect trend analysis, affecting stock price predictions.
- To avoid these issues, traders clean and preprocess their data using Python.
So when we apply a value count on this category column what will happen we get a table like this right so it will show me that how many times this negative is appearing in this column that is 54 times and you will see that we have a nice looking frequency table it means we are getting 63 times positive and 61 days negative right so now you can see that we have a sced BL Hello friends welcome back to day 37 of the 100 days of H with python algo trading today we will be understanding the univariate bivariate and
multivariate Analysis so first we will understand the univariate and bivariate and then after understanding some more Concepts we will proceed to multivariate so in previous session we understood the statistics then we understood the types of Statistics which is descriptive and inferential correct then we various measures of central tendency and measures of dispersion and this video is the continuation of the previous video okay so let’s first understand that what is univariate analysis and what is by
variate analysis so basically when we have a data so in trading generally we have a table of data right which includes rows and columns uh or we can say a data frame so in data frame we have various columns right and each column we can consider as a variable so let me show you a table here for example we have this data frame which I imported from the Yao Finance so I imported until volume only o lcv data right o lcv which I explained in the previous videos so here you can consider each column as a
single variable open column is a variable high is also a variable and you have multiple variables here right so when we talk about univariate analysis that means we are analyzing any single column of this data frame correct and when we talk about the b variate means we are analyzing two columns right and when we talk about the multivariate means we are analyzing more than two columns right that should be very clear for now right okay so let’s talk about the univariate analysis univariate means.






We have we have Amazon and Google correct if I write B Difference A B minus a then again we will remove the common element which is Tesla and we are left with Amazon and Microsoft it was Apple here in the a negative B right this was Apple so I can write apple and when we perform B difference a we can write we will ignore this T element and we are left with Amazon and Microsoft so msft that’s the basic idea of difference next we have is also very easy that is symmetric difference right so we can write like symmetric difference so in symmetric difference what you do you just ignore the common element and then you write all the remaining elements so here when we check that Tesla is the common so we ignore that and we we are left with apple Google Amazon and Microsoft so we’ll write a symmetric difference B and then we will write the remaining Elements
Which is Apple then we have Google and Amazon and we have the Microsoft so that is the symmetric difference so now we have few more properties like if someone ask is disjoint so in disjoint set what happens if there are no common elements then we can say yes true right but we know that this set is false so if we have a question whether these sets are disjoined so we will say false correct a Boolean value false we have another property which is it subset so when we have is as prefix we just have to think in terms of true and false so when it is asked that is subset in that what happens like we have two sets first is [Music] Apple second element is Google correct then the second set we have is Apple Google then let’s say Amazon so if someone ask whether a is subset of B then
We can say true why because all the elements of set a are existing in set B right next we have superet so if we have a question is superet so similarly someone ask B is super set of a then we can say true why because of all the elements of a are existing in B correct so it’s very easy yeah we can say that b is the superet of a similarly next we have unique so is unique and this what happens for example we have this set B so I’ll just I’ll just add another element in this and let’s say this is Apple if now someone ask if B is unique then we can say it is false because it is not unique it contains the Apple element so when we have duplicates we can say that that particular set is not unique so with that being said that is the basic theory of sets and I hope that you understood it very well and if you still have any doubt you can connect with us now we quickly perform some examples and then we’ll proceed with the dictionaries now we quickly perform some examples so it will be better for understanding so let’s say.
And the difference between a histogram and a bar chart is that in bar chart these bars are not attached but in histogram these are uh attached like very close and why because we had the ranges it was ending at 11 0 and it was starting at 110 that’s why these old bars are attached so that’s the basic difference between the bar chart and the histogram so in univariate analysis you can have two types of data uh categorical data and numerical data in categorical data you can have a frequency table or a bar chart and in numerical data you can have a histogram right we will understand these in uh the code also and shortly okay so let’s talk about by variate analysis so in B variate we can have three types of data we can have categorical categorical we can have categorical numerical and we can also have numerical numerical right so let me show you how.
We can analyze this so basically what we are doing here we are trying to find a pattern we are trying to find a relation between two variables or a two columns right so let’s say we have this table here and here we have this this categorical column so let’s say we have this table and in this table we have this categorical column and we have these are the numerical columns right so let’s try to find out the relation between these two variables so when we have two categorical column we can create a contingency table right and it shows the frequency distribution of variables right so we can say that when we have two categorical columns we can create a contingency table or we can say a cross tab right we will understand this in the code shortly but for now let’s understand this okay then we have two numerical columns so in that case we can create a scatter plot right and scatter plot looks like this you can see here the relationship between the two numerical columns right I’ll explain you that how we are creating these Scatter Plots with the help of python but for now just try to understand that which kind of plots.
We can create with which kind of data that’s very important that when we have a categorical and a numerical column so in that case we can create a box plot so box plot looks something like this and to understand this I need to make you understand few concept and then only you’ll be able to understand this block plot but for now just try to remember these things okay that on which kind of data we are creating which kind of plot because this will really help you in your real life scenarios okay and last we have multivariate so multiv also we’ll cover later before covering few topics which

Are important to understand uh this analysis right so in multiv also we create the box plot and before that what we will cover we will cover the percentile topic and in that I’ll show you the IQR inter quartile range and so that you’ll be able to understand the box plot right so what we’ll do first we will understand up to this scatter plot okay in B variate we will understand up to this one then we will understand the concept and then we will proceed with the rest of the analysis right so let’s get started okay let’s try to understand the graphs in Python so what we’ll do first of all I have to import the uh data from the yaho finance so I’ll write YF download and tickers we will take as apple and we can start from let’s say 204 January to January right then we we can end at 24 June of 30th okay so when I had shift enter you will see that we have o lcv data along with the adjacent close so I’ll just store it in the variable data and so first of all we discussed that in univariate we can have two types of data we can have categorical and we can have numerical but here we do not have any categorical data so for that what we will do.
We just try to create that so what I’ll do on close I apply the function percentage change and you will see that we have another column and we can store this in the let’s say daily return right or instead what we do we make the category so on any day if the percentage return is greater than zero then it will be positive else it will be negative so what I will do I’ll apply or let’s store it for for now uh I’ll just go and store in Daily returns right and when I check the data frame you will see that we have another column with the daily returns now I need to categorize this that on which day it was more than zero then it will be positive and if less than zero then it will be negative so I’ll apply uh on daily Returns the Lambda function apply and here we can give Lambda X will be positive if x is greater than zero else we can say it will be negative correct so when I hit shift enter you will see that we have negative and positive so for this also we can create another column and let me give it the name as category of daily returns right and now when I check the data frame you will see that we have another column correct so on this column we can apply the value count let me do that value counts and now we have a table right and this is known as the frequency table let’s make it a little bit more clear so I’ll just reset the index for this reset index and you will see that.
We have a nice looking frequency table it means we are getting 63 times positive and 61 days negative right because we import the data from January to June means around 180 days 6 months and in that we have few off hold right the weekends also so so ultimately we’ll be having 124 days and out of those 63 positive and 61 negative so in that way you can use the uh frequency table right now how to find the relative frequency so for that what you can do you can simply let’s store it in a variable let’s say frequency frck okay and let me print it here frck and you will see that we have here also now we need to find the relative frequency of this table so for that what you can do you can simply copy this code from here and I’ll just paste here for now I’ll remove this and we are getting this right here you can just give normalize equals to true and you will get and proportion is only the relative frequency and if you want the cumulative frequency the meaning of cumulative frequency is that you will add these values right you will combine that so for that you have a function very good function that is come suum and and when you apply that you will get the cumulative frequency right and if you want you can also draw the plots so let’s say uh for the frequency which was here so what you can do you can just write FR do plot and you can give a kind and you will have a bar chart so like if you have more categories then it will be like a good looking and you can give the X label and Y label that.
We have discussed previously in the mat c section and for the relative frequency you can draw a pi chart so let’s draw that block and in kind you can just write pi and uh it says okay let me remove this for now for this let’s not reset the index and directly plot and kind equals to Pi I need to okay yeah now you have a pie chart and in that you can see the distribution the positive and negative percent is and it’s very easy like when we have multiple categories then you can have like a good visualization but for the less one it is not required that much okay so we have discussed the univariate so in univariate we have discussed the categorical and in categoric we have discussed the bar chart the pie chart and relative frequency and the cumulative frequency now let’s discuss that what we can do when when we have a numerical colum right so let’s say we have this uh data frame okay now let’s discuss that we have a numerical column that is close we will try to cut this into few pieces and let’s see what happens so I’ll take the close right this one and let’s apply a function that is pd. cut and here I can give this column and I can also give the bins which I will take uh 10 and you will see that we have 10 bins and when you perform the value count on this you will have the number of bins here and it is also a kind of frequency distribution table so.
If you apply a reset index here you will see a similar table which we have seen in the categorical column right so here also you can draw the bar chart and pie chart so let’s draw the bar chart so plot kind we can give a bar and you will see a bar chart right hope it’s very clear okay now let’s understand that we can also draw a histogram here right so so let me do that so for that what I will do I’ll take uh again this column right and I’ll use the function plot. Hest and here I can pass this column and in bins I can also give 10 and when we hit shift enter you will see that we have a histogram so this histogram uh depends on that how much bins you are giving it’s completely up if you give more then the histogram will like not give a good results you can see here and if you give it will be very very like wider let me show you you will see that it has bigger bars so you need to have a balance between that so let’s try with six yeah it is little bit okay so it’s like depends on the situation that which kind of plot you want so this was just the demonstration and now it’s up to you that how you use your creativity and you create and you use this functionality correct okay now let’s understand that how we can plot the graphs when we are applying the by variant analysis right so in by variate we can have categorical and categorical categorical numerical and numerical numerical columns right or variables so how can I create another categorical column so for that what I’ll do I’ll create the days so for that data do index and we’ll get all the dates and here if we use the day name then we’ll get all the days right and simply we can create another column so what I’ll do I’ll just write uh data here and here I’ll give the new column as let’s say day name when I check the data frame again you will see that.
We have the days that on which day we get negative or positive so if we count the values we can create a contingency table a cross tab right so for that what I’ll do I’ll simply write pd. cross tab this is the function and here first you can give this category daily returns and then day name so I’ll write category and then the second column which is data and uh day name right and you will get a contingency table which clearly shows that on Friday you have 13 negative and 12 positive on Monday you have 10 negative and 12 positive so by this you can analyze that on which days you are like getting more positive returns and on which days you are getting more negative returns hope it’s clear it’s very easy just you have to apply like some logic and and it will become a magic for you okay now let’s understand that what you can do when you have both the columns as numerical so let me show you here that we have this data frame so here what you can do let’s find out the relationship between the volume and daily returns so for that what I’ll do I’ll write PD do scatter and here you can pass the uh volume and also the daily returns right and when you run this it says it has no attribute as scatter okay sir if it doesn’t have any attribute scatter then what it does have of course it will be PLT so now you can see that we have a scatter plot and for this actually I’ll show you the Deep explanation in the next session how it is plotting the theoretical and mathematical part but for now just understand that when we have two numerical columns we can draw dra a scle plot right and now the last one is that when.
We have a categorical and numerical so for that you can draw a box plot and for box plot also I need to show you something that is percentile and IQR concept then only you’ll be able to understand the box plot so but let me draw it first so plot and box plot so here you have to pass a column so column is equals to I daily return also you have to give here buy right so by also you can just copy here this one and paste it here right actually box plot is applied on the data frame so I have to write data and then when I run you will get this box plot so for the understanding we have to understand the topic and then we will again uh understand this right okay let’s first understand the percentile IQR and then we’ll come back to this and we will also cover the remaining part which is multivariate analysis right.
Watch this Day 37 video tutorial
Day 37: Descriptive Statistics Part- 2