Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf
Ancestry in Brazil
/r/MapPorn
https://redd.it/yhe9h5
Q What is wrong with interpreting a Bayesian probability from a Frequentist CI?
I know the interpretation for a Frequentist 95% CI is that if we were to construct infinitely many similarly constructed intervals, 95% of them would contain the true population value.
Generally, many people ("wrongly") say that if they generate a 95% CI of (5,10) that there is a 95% chance that the population value is between 5 and 10. But how is this wrong? There is a large X number of possible CI's of size n, 95% of which contain the true value, and this is one out of the X intervals, so there is a 95% chance that it is one of the intervals that contains the true value. And if that is the case, there is a 95% chance that it contains the true value.
Now, what I have also heard is people will say that by doing this I'm interpreting a Frequentist construct with a Bayesian probability. Because to the Frequentists, it doesn't make sense to talk about the probability of whether this contains the true value or not, *it does or it doesn't*. But what is wrong with saying "I've created this Frequentist 95% CI, (5,10), so there is a 95% chance (from Bayesian definition of probability rather than Frequentist) that the true population value is between 5 and 10."
/r/statistics
https://redd.it/yhf0fw
Countries with the most Olympic medals per capita
/r/MapPorn
https://redd.it/yhcsim
[OC] Total Number of Births Since 1850
/r/dataisbeautiful
https://redd.it/yg8som
How would I figure this out
/r/mathpics
https://redd.it/ygp7ah
Question R-Squared: biased and invalid for small samples?
I've running regressions with different samples, and I have the impression that, the smaller the sample, the larger the R-squared. For instance R-squared with n =2 is always 100%. Is sample R-squared a biased estimator of population R-squared? Is R-squared invalid for small samples?
/r/statistics
https://redd.it/ygrv5i
[OC] US colleges attract a lot of foreign students
/r/dataisbeautiful
https://redd.it/ygtrah
[R] ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts + Gradio Demo
https://redd.it/ygj11f
@datascientology
Why I like Halloween
/r/funnycharts
https://redd.it/yewrzk
An interesting visual jump
/r/dataisugly
https://redd.it/ygg7di
UN vote to end US embargo against cuba
/r/MapPorn
https://redd.it/ygjf47
Difficulty transitioning between R and Python?
I’m using R in grad school, as required, and Python at work, due to availability. I’m reasonably comfortable with both, but switching back and forth within the same day is a bit rough. At what level of fluency will it get easier? Any suggestions? Just keep at it?
/r/datascience
https://redd.it/yg38or
[OC] The average colour of each US state flag.
/r/dataisbeautiful
https://redd.it/yfqdjy
[OC] The average colour of each US state flag.
/r/dataisbeautiful
https://redd.it/yfqdjy
Map of The 13 British Overseas Territories
/r/MapPorn
https://redd.it/yhedox
The Stack - A 3TB Dataset of permissively-licensed code in 30 languages
https://twitter.com/bigcodeproject/status/1585631176353796097?s=46&t=mLrACB0pej1c7ge2uX2vKg
/r/datasets
https://redd.it/yfhxnb
Infinite monkey Theorem [oc]
/r/mathpics
https://redd.it/ygjxgi
Ancient Data Viz? I guess pictographs, cave arts, ethnomath count but are there spiritual or indigenous knowledge that exists? Pls lmk if u kno any orgs, groups, articles, vids, or academic papers on this!!
https://redd.it/yh8c6n
@datascientology
[OC] Number of Costco vs Sam's Club Stores across US States
/r/dataisbeautiful
https://redd.it/ygnt2l
[OC] Number of packs required to fill the FIFA World Cup Qatar 2022 Album (670 stickers without trading)
/r/dataisbeautiful
https://redd.it/ygrwhg
There is a lake in Finland, that looks like Finland.
/r/MapPorn
https://redd.it/ygl4tf
[OC] GOT and HOTD Episodes by IMDb User Ratings
/r/dataisbeautiful
https://redd.it/ygkmmu
data_irl
https://imgur.com/gallery/y9rdpXZ
/r/data_irl
https://redd.it/yglpst
The Top 100 Most Valuable Brands in 2022 (according to Brand Finance)
/r/Infographics
https://redd.it/ygpil4
[OC] The average colour of each European country flag.
/r/dataisbeautiful
https://redd.it/ygjn50
[OC] Contributing factors to price inflation. Another representation of the chart Rep. Katie Porter showed. With a bonus example to demonstrate.
https://www.epi.org/blog/corporate-profits-have-contributed-disproportionately-to-inflation-how-should-policymakers-respond/
/r/dataisbeautiful
https://redd.it/yg6ccq
[OC] How much time do men and women spend caring for children in the US?
/r/dataisbeautiful
https://redd.it/yfvmtd
Electrical grids of Canada/USA
/r/MapPorn
https://redd.it/yg0dnu
A critical reflection of jupyter notebooks
In my experience notebooks are a surprisingly controversial topic. I've seen things ranging from Databricks building tools for data scientists and data engineers that can seemingly only run on notebooks unless you install the notoriously buggy databricks connect to people using the word "notebook" antithesis of good programming habits.
Recently I've been listening to more talks about interactive vs batch programming and have just been reflecting on how I write code myself. Here's my own set of hot takes:
​
1. The name of notebooks explains what they are meant for. They are for experimenting, prototyping, potentially automating reports with markdown, etc. Essentially, you use them to jot down ideas as you would on a piece of paper.
2. You should build systems/features/... with notebooks and not with regular scripts to save time. You should treat your notebook as a debugger that is always on. Writing code in notebooks is a great way to build code interactively and incrementally. If you have IO sitting around and waiting to load data out of your DB to train a model each run doesn't make sense.
3. Notebooks DO encourage poor programming standards if you don't watch out. People say this as a buzzword without ever clarifying what they mean. The biggest one here is the (ab)use of global variables and the fact that notebooks are typically self contained units. Having a proper project structure and reusing code across your project is important. Ideally you define building blocks in functions/classes somewhere in your project and run them in your notebooks, if so your notebook is equivalent to production code.
4. Using a notebook as a scratchpad and porting it to "production code" is faster than writing production code in a .py file. This is the summary of the following 3 points and what I personally do. The overhead of porting a notebook to 4-5 different files in a clear directory structure with a main somewhere that runs them, in the same order as a notebook would, is imo just less than building it from scratch like that.
5. Don't delete your scratchpads, keep them around as documentation for your streamlined production workflow. Why? Because running each block in a notebook is ime gives you more freedom than fighting with your debugger if/when something does go wrong.
​
Sidenote: this is why I think people have issues transitioning from R to Python or from Spyder to another IDE/text editor. R (studio) and Spyder are a lot closer to interactive programming because you can run your code line by line and not lose your variables. This is how programming should be, but not how the vast majority of people learnt it and people don't like change.
/r/datascience
https://redd.it/yfsxrn