Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf
TopicOpen Open Discussion Thread — Anybody can post a general visualization question or start a fresh discussion!
Anybody can post a question related to data visualization or discussion in the monthly topical threads. Meta questions are fine too, but if you want a more direct line to the mods, click here
If you have a general question you need answered, or a discussion you'd like to start, feel free to make a top-level comment.
Beginners are encouraged to ask basic questions, so please be patient responding to people who might not know as much as yourself.
---
To view all Open Discussion threads, click here.
To view all topical threads, click here.
Want to suggest a topic? Click here.
/r/dataisbeautiful
https://redd.it/yj6tsc
Please fill out this survey about medical assisted death for a research paper in my college class (Everyone)
https://forms.gle/vrgFLeEbNXTC13YX9
/r/SampleSize
https://redd.it/ykmmca
Peak map of Ireland
/r/MapPorn
https://redd.it/yk5l5d
Relative PornHub searches in the UK.
/r/MapPorn
https://redd.it/ykimz7
Broken McDonald's Ice cream machines worldwide
https://mcbroken.com/
/r/datasets
https://redd.it/yk0o85
D Simple Questions Thread
Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
Thanks to everyone for answering questions in the previous thread!
/r/MachineLearning
https://redd.it/ybjvk5
I tried bending the rules of Perler beads!
https://redd.it/y0c0uk
@datascientology
Map that shows if you were to drill straight through the earth where you would pop out on the other side.
/r/MapPorn
https://redd.it/yk2sgi
Q If you had 3-5 years to prep for a PhD Stats, what would you do?
Given this time, what topics would you study, how would you research programs, what people would you reach out to, etc.?
My background: graduated with my BS Stats this past spring, really enjoyed my upper electives (bayesian, unsupervised learning, stochastic processes), and now I work in big four
My plan before the program: refresh + learn interesting/relevant topics (linear algebra, real analysis, data structures + algorithms), research many many programs (really like UWashington atm), and work 3-5 years to get experience and pocket cash
My plan after the program: I'd love to get into a soccer (football) analytics role - I really enjoy reading published statistical papers related to soccer and would like to do much the same (preferably for a club)
Any help/critique is appreciated even if it doesn't directly fit the background or timeline I'm working with. Also happy to explain more if it helps. Thanks!
/r/statistics
https://redd.it/yjt95c
Travel times from London in 1914 (Source: RealLifeLore)
/r/MapPorn
https://redd.it/yjyqo6
Request: Data sets of pharmaceutical drugs and which substances they have interactions with
I'm trying to find data sets of pharmaceutical drugs and substances they have interactions with. Example, if I search for "ambien", I want to see a list of all of the drugs that you shouldn't be taking with it. This might differ from country to country, so I want as many of these as I can find.
/r/datasets
https://redd.it/yityxq
N Meta AI | Evolutionary-scale prediction of atomic level protein structure with a language model
Paper: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v2
Meta's Tweet: https://twitter.com/MetaAI/status/1587467591068459008
Abstract
>Artificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Characterizing the structures of the exponentially growing billions of protein sequences revealed by large scale gene sequencing experiments would necessitate a breakthrough in the speed of folding. Here we show that direct inference of structure from primary sequence using a large language model enables an order of magnitude speed-up in high resolution structure prediction. Leveraging the insight that language models learn evolutionary patterns across millions of sequences, we train models up to 15B parameters, the largest language model of proteins to date. As the language models are scaled they learn information that enables prediction of the three-dimensional structure of a protein at the resolution of individual atoms. This results in prediction that is up to 60x faster than state-of-the-art while maintaining resolution and accuracy. Building on this, we present the ESM Metagenomic Atlas. This is the first large-scale structural characterization of metagenomic proteins, with more than 617 million structures. The atlas reveals more than 225 million high confidence predictions, including millions whose structures are novel in comparison with experimentally determined structures, giving an unprecedented view into the vast breadth and diversity of the structures of some of the least understood proteins on earth.
/r/MachineLearning
https://redd.it/yjdt78
Countries that recognise Kosovo
/r/MapPorn
https://redd.it/yjcdz0
Race of players in major professional team sports leagues
/r/dataisbeautiful
https://redd.it/yjgqvr
Deforestation In The Amazon Has Increased Significantly Over the Past Decade [OC]
/r/dataisbeautiful
https://redd.it/yjctv6
The cost of 1 gigabyte of mobile data in every country around the world
/r/Infographics
https://redd.it/yki6v5
Sentiment analysis of customer support tickets
Hi folks
I was wondering if there are any free sentiment analysis tools that are pre-trained (on typical customer support quer), so that I can run some text through it to get a general idea of positivity negativity? It’s not a whole lot of text, maybe several thousand paragraphs.
Thanks.
/r/datascience
https://redd.it/ykmpgt
How many U.S. counties have a population greater than the state of Wyoming?
/r/MapPorn
https://redd.it/ykcd1q
Riemann n-sphere
/r/mathpics
https://redd.it/y060t6
P Implementation of MagicMix from ByteDance researchers, - New way to interpolate concepts with much more natural, geometric coherency (implemented with Stable Diffusion!)
Hi. Today I've came across this interesting paper https://arxiv.org/abs/2210.16056 that proposes interesting method to combine semantics of text and image in diffusion process.
In short, this mixes "layout" with "content", however unlike style transfer,
>"...semantic mixing aims to fuse multiple semantics into one single object."
I was surprised by the examples they showed, so I wanted to try it but the code wasn't available. I've implemented the method myself, and I wanted to share it here!
https://github.com/cloneofsimo/magicmix
Layout of \\"realistic photo of a rabbit\\" with content of \\"tiger\\"
I hope my implementation helps who is reading the paper!
Note: I'm not the author of the paper, and this is not an official implementation
/r/MachineLearning
https://redd.it/ykiuq0
Help hosting trillions of rows of new health insurance public price data
As of July 1st this year all health insurers in the US were required to publish files on their websites of all their negotiated prices they have for every possible medical procedure with every doctor in the country. In totality this data set equates to trillions of rows and hundreds of TB of data.
I'm interested in building out a collaborative effort to aggregate all this data, but the cost of hosting seems to be a huge problem. What's the cheapest, effective way to host all this data in such a way that it's publicly accessible?
/r/datascience
https://redd.it/yk9gye
Religions of Canada, 2021 [OC]
/r/dataisbeautiful
https://redd.it/yk2pqh
The beginning of national anthems.
/r/MapPorn
https://redd.it/yk1o3n
[OC] Different types of government systems
/r/dataisbeautiful
https://redd.it/yk1028
Hilbert Curve pumpkin carving
/r/mathpics
https://redd.it/yhuudg
US Child Pedestrian Deaths by Day of the Year: 2006-2020 [OC]
/r/dataisbeautiful
https://redd.it/yjarqg
[OC] Is Nuclear Energy Dangerous? A comparison of chinese coal mining related fatalities to worldwide nuclear and radiation related fatalities
/r/dataisbeautiful
https://redd.it/yjovks
Can you specialize in data cleaning?
Context: I'm studying Data Analytics right now through the Google certification, and once I land a job(any job😂 not necessarily specific to data analysis) I intend on pursuing a degree in Data Science.
Ran across Data Cleaning, as you'd expect, pretty early on. And everything about it sounds really interesting and like something that I'd enjoy. I looked more into it, but over and over again things kept coming up about how the work keeps being forced on data scientists who don't want anything to do with data cleaning.
So my question is, is it possible to specialize specifically in data cleaning? And if so, are there specific certifications or other relevant education that I should pursue to do that? Is data cleaning at risk of being automated?
/r/datascience
https://redd.it/yjkjlx
D Pedagogy: Thoughts on this (old) blog post by Andrew Gelman on de-emphasizing the sampling distributions of the sample mean in intro Stats classes?
Teaching stats, and have tried to come up with the most intuitive explanations of the sampling distribution of the sampling mean, ran simulations with them etc. to try to inculcate the idea, thinking it would build up and be useful moving into inference for regression and other topics later. Found in office hours, many students still arent getting it (which I don't blame them for, I didn't in intro stats either). Then I came across this post, and I do not know how I feel about it for thinking about how I would change my class in the next iteration. Curious what all you Statisticians and educators and stats practitioners think!
​
Edit: added link
/r/statistics
https://redd.it/yiuj4z
World's biggest employers
/r/Infographics
https://redd.it/yj1kp5