Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf
Q In business, it seems like we care much more often about Type II errors than Type I errors.
I often seem to encounter situations in business where Type I errors don't seem very important.
Say we're testing two images on our website and want to know which one causes more conversions, so I use a two-tailed test with the null hypothesis that there is no difference in conversion rates between the images. In this scenario, it seems like I shouldn't care much about making a Type I error. If the null hypothesis is true, but I incorrectly conclude that it can be rejected and implement the "winner", there's actually no downside to my business: we don't lose any conversions, because the null hypothesis is true so it never mattered at all which I chose. Whichever one I picked there would be no change to conversions. I guess the opportunity is missed to realise that this test was a waste of time and maybe avoid implementing pointless tests in the future, but considering only this one test, if we assume the null is true then my choice is unimportant and therefore so is the statistical significance of my results.
What does seem to matter far more is power. If there is a true difference between these alternatives, even a very small one, sometimes in business it's really important that I be able to detect it. If improving conversion a few percentage points can earn my company millions of dollars, then I need to set up my test so that we find those tiny effects.
So... is this correct? Am I missing something about the practical interpretation of significance here? If this is the case, why is it that so much time and literature and tooling focuses on the significance of results rather than the power?
/r/statistics
https://redd.it/yfubif
7M+ Venmo transactions scraped from the public API
Transactions scraped from the [Venmo](https://venmo.com/) public API by [Dan Salmon](https://danthesalmon.com/about/)
This data was collected during the following date ranges:
* July 2018 - September 2018
* October 2018
* Jan 2019 - Feb 2019
While there is no data for the amount transferred, it's interesting to look at most frequently occurring target (receiver) / actor (sender) pairs.
Source: [https://github.com/sa7mon/venmo-data](https://github.com/sa7mon/venmo-data)
\[Self promotion\] We've re-hosted the data on Gigasheet for exploration before downloading [https://app.gigasheet.com/spreadsheet/Venmo-Transactions-by-Dan-Salmon-github-com-sa7mon-venmo-data/56db56e2\_acb7\_4cc9\_9d7a\_ae308f5a2a06?public=true](https://app.gigasheet.com/spreadsheet/Venmo-Transactions-by-Dan-Salmon-github-com-sa7mon-venmo-data/56db56e2_acb7_4cc9_9d7a_ae308f5a2a06?public=true)
/r/datasets
https://redd.it/yfn8sz
kaggle is wild (・o・)
/r/datascience
https://redd.it/yfnbab
Russian tank T72-B3M, Obr. 2016
/r/Infographics
https://redd.it/yfm3of
Ethnic Map of Canada, 2021 [OC]
/r/dataisbeautiful
https://redd.it/yfn8s5
The 12 Largest US Cities in 2020
/r/MapPorn
https://redd.it/yf29cz
It’s pretty, but what does it mean?
/r/dataisugly
https://redd.it/ye0y6q
What's that humming sound? The World Hum Database
The World Hum Database (aka "The Hum") has thousands of user-submitted reports about an "unusual unidentified low-frequency sound that scientists now call the Worldwide Hum."
https://thehum.info/
https://thehum.info/ewExternalFiles/march22dbase.csv
/r/datasets
https://redd.it/ye4rhc
[OC] How Meta made (or struggled to make) money in Q3 👇
/r/dataisbeautiful
https://redd.it/yeo103
Some of England's most relevant brands by county of origin.
/r/MapPorn
https://redd.it/yeow7f
[OC] Where do Democrats and Republicans stand on free speech and the internet?
/r/dataisbeautiful
https://redd.it/yeu2r4
List of large numbers up to TREE(3)
https://youtu.be/RYMjOwH_bWg
/r/mathpics
https://redd.it/y3qhlw
Europe: How willing would you be to help another country in a crisis?
/r/MapPorn
https://redd.it/yehe70
Drawing Europe from memory on a sequence pillow, a little stretched towards the north east but not bad.
/r/MapPorn
https://redd.it/ydwta2
I made a family tree of the Olympian gods, featuring classical artworks
/r/Infographics
https://redd.it/ye1i3k
[OC] The absolute quality of Better Call Saul.
/r/dataisbeautiful
https://redd.it/yfphz4
Vietnamese diaspora worldwide as a share of local population.
/r/MapPorn
https://redd.it/yfjket
Ethnicities of Slovakia (Slovaks, Hungarians, Rusyns, Roma), based on the 2021 census
/r/MapPorn
https://redd.it/yfi514
An 'Undoing' of the *Culprit* Hard Unknot
/r/mathpics
https://redd.it/y0r3jb
New paper on Automatically Detecting Label Errors in Entity Recognition Data
Hi Redditors!
I think you guys will find this very useful. Any of us that use entity recognition datasets have probably come across labels that are incorrect. Our newest research) investigates automated methods to find sentences with mislabeled words in such datasets. Mislabeling is especially common in ML tasks like token classification, where labels must be chosen on a fine-grained basis. It is exhausting to get every single word labeled right!
We benchmarked a bunch of possible algorithms on real data (with actual label errors rather than synthetic errors often considered in academic studies) and identified one straightforward approach that can find mislabeled words with better precision/recall than others.
This algorithm is now available for you to run on your own text data in one line of open-source code). We ran this method on the famous CoNLL-2003 entity recognition dataset and found it has hundreds of label errors.
Blogpost: https://cleanlab.ai/blog/entity-recognition/
Paper: https://arxiv.org/abs/2210.03920
/r/datasets
https://redd.it/yewllw
All the metals we have mined this year.
/r/Infographics
https://redd.it/yeole2
U.S. Senators Ranked by Their Ability to Legislate in 2022 [OC]
https://redd.it/yewqtl
@datascientology
America's 3 deadliest drugs are legal .Underlying cause of death in America in 2015, by drug
/r/visualization
https://redd.it/yeq34i
Most common baby names in London, 2021
/r/MapPorn
https://redd.it/yetos9
Some Cute Little Sequences of Figures Beautifully Evincing the (Possibly Not Altogether Obvious @ First Glance) Topological Equivalence of Linked & Unlinked 'Handcuffs'
/r/mathpics
https://redd.it/y1j292
6 Things You May Not Know About Pumpkins
/r/Infographics
https://redd.it/yekklc
D Why can't we say "we are 95% sure"? Still don't follow this "misunderstanding" of confidence intervals.
If someone asks me "who is the actor in that film about blah blah" and I say "I'm 95% sure it's Tom Cruise", then what I mean is that for 95% of these situations where I feel this certain about something, I will be correct. Obviously he is already in the film or he isn't, since the film already happened.
I see confidence intervals the same way. Yes the true value already either exists or doesn't in the interval, but why can't we say we are 95% sure it exists in interval a, b with the INTENDED MEANING being "95% of the time our estimation procedure will contain the true parameter in a, b"? Like, what the hell else could "95% sure" mean for events that already happened?
/r/statistics
https://redd.it/yeccnw
Data Science Book Club
I’ve been a data scientist for 3 years and love it. I have come across some essential textbooks and books that would supplement my knowledge and career. I’ve made a list elsewhere and was wondering if others would like to join me as I try to read and discuss these books. I can host it in discord and we can read 75 pages a week, meeting for an hour virtually to discuss the ideas within. Any takers?
/r/datascience
https://redd.it/ye8626
Islam in Canada, 2021 vs. 2011
https://redd.it/ye8nxg
@datascientology