My experience with these (in R) is better than loess, but they IMO tend to be too aggressive, and identify overly complicated functions by default. But I do not like loess very much, it is often both locally too wiggly and globally too smooth in my experience, and the weighting function has no really good default.Īnother popular choice is to use generalized additive model smoothers. Many analysts are taught the loess linear smoother for this. Y ~ f(x), where f(x) is pretty flexible to identify potential non-linear relationships. In that case, typically what I want to do is estimate a functional relationship in a regression equation, e.g. If you have continuous inputs though it is tougher. Restricted Cubic Spline Plotsīinning like I did prior works out well when you have only a few bins of data. The next plots are a bit easier to show that though. I will need to update these in the future to jitter the data slightly to be able to superimpose the original data observations. This is because I typically only use these for exploratory data analysis, it is pretty rare I use these plots in a final presentation or paper. I also do not have an argument to save the plot. Smooth.prop_spike(DC_crime,'TotalLic','BurgClip')Ī few things to note about this is I clip out bins with only 1 observation in them for both of these plots. #Example with proportion confidence interval spike plotsĭC_crime = DC_crime.clip(0,1) Here I clip the burglary data to 0/1 values and then estimate proportions. Here it uses exact binomial confidence intervals at the 99% confidence level. mean_lic = an_spike(DC_crime,'TotalLic','TotalCrime',Īnother example I am frequently interested in is proportions and confidence intervals. ![]() If you don’t like the resulting format of the plot though, you can just pass plot=False,ret_data=True for arguments, and you get the aggregated data that I use to build the plots in the end. So you could pass in a string for the X variable. You can pass in any X variable that can be binned in the end. This example works out because licenses are just whole numbers, so it can be binned. Mean_lic = an_spike(DC_crime,'TotalLic','TotalCrime', an_spike(DC_crime,'TotalLic','TotalCrime') #Example binning and making mean/std dev spike plots I by default plot the spikes as +/- 2 standard deviations, but you can set it via the mult argument. ![]() The function name is mean_spike, and you pass in at a minimum the dataframe, x variable, and y variable. So here in this example I estimate E, E, etc, where Y is the total number of part 1 crimes and x is the total number of alcohol licenses on the street unit (e.g. ![]() The first set of examples, I bin the data and estimate the conditional means and standard deviations. #Dissertation dataset, can read from dropbox Mydir = r'D:\Dropbox\Dropbox\PublicCode_Git\Blog_Code\Python\Smooth' Only difference from my prior posts is I don’t have gridlines by default here (they can be a bit busy). Also I change the default matplotlib theme using smooth.change_theme(). My functions are in the smooth set of code. Data Prepįirst to get started, I am importing my libraries and loading up some of the data from my dissertation on crime in DC at street units. I have posted the code to follow along on github here, in particular smooth.py has the functions of interest, and below I have various examples (that are saved in the Examples_Conditional.py file). ![]() Here are some example exploratory data analysis plots to accomplish that task in python. Typically you want to look at the conditional value of the Y variable based on the X variable. One big chunk of why you want to make scatterplots though is if you are interested in a predictive relationship. The other day I made a blog post on my notes on making scatterplots in matplotlib.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |