Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf
TopicOpen Open Discussion Thread — Anybody can post a general visualization question or start a fresh discussion!
Anybody can post a question related to data visualization or discussion in the monthly topical threads. Meta questions are fine too, but if you want a more direct line to the mods, click here
If you have a general question you need answered, or a discussion you'd like to start, feel free to make a top-level comment.
Beginners are encouraged to ask basic questions, so please be patient responding to people who might not know as much as yourself.
---
To view all Open Discussion threads, click here.
To view all topical threads, click here.
Want to suggest a topic? Click here.
/r/dataisbeautiful
https://redd.it/z9n9d5
Looking for gas price data by ZIP Code
As the subject says, I'm looking for historical gas price data by Zipcode going back at least a year; and if possible, 8 years. The more detailed, the better. If it's on a daily basis, that would be great.
/r/datasets
https://redd.it/zcpg7j
Politically Exposed Persons (PEPs) Data Set
This data comes from OpenSanctions.org: "A politically exposed person (PEP) is a person that has been entrusted with a prominent public function. PEPs include elected officials, members of government.
Integrating data about political actors is an essential step in making an open source due diligence database. However, it is a much more intricate task than collecting sanctions lists (of which there are only a few dozen), and fully addressing it will be the focus of a later stage of this project."
You can find the data here on more than 197,000 entities: https://www.opensanctions.org/datasets/peps/
I've re-hosted the "Targets as simplified CSV" with more than 170,000 people records for exploration:
https://app.gigasheet.com/spreadsheet/Politically-Exposed-Persons--PEPs---opensanctions-org/862d21cf\_6eb3\_44db\_953b\_33b9324527e6?public=true
/r/datasets
https://redd.it/z7y9ap
FIFA World Cup 2022 in Qatar saw many surprising results. Too many, compared with the previous tournaments. I'm running an experiment where I consistently bet on the least likely outcome and track how my fictional balance changes over time. This simple strategy would work fantastic in 2022 [OC]
/r/dataisbeautiful
https://redd.it/zcm4r0
JSON data of ingredient combinations into recipe outputs
Hi there,
I've done a search of the sub and looked at a few sources but none seem to quite fit what I am looking for.
I am looking for a dataset (preferrably JSON but I can convert it from others) which has:
1. Ingredients in inputs.
2. Recipes at output, for example:
{"scrambled egg":
{"ingredients":
{"name": "milk", "amount": 100, "units": "ml"},
{"name": "egg", "amount": 2, "units": "hole"}}}
Format is optional, my use case is I am building a cooking system for a non commercial game so need a dataset that can be machine understandable without having to use neural networks for language processing.
Thanks in advance, all
/r/datasets
https://redd.it/zckme1
I hate everthing about this
/r/dataisugly
https://redd.it/zce8az
The most popular video games in Europe (2021)
/r/MapPorn
https://redd.it/zc9njh
Has Russia been at war with different European Countries?
/r/MapPorn
https://redd.it/zc6n87
[OC] Ski Resorts in North America
/r/dataisbeautiful
https://redd.it/zc8f1d
[R][P] I made a Hugging Face gradio demo for text-to-3D paper Score Jacobian Chaining
/r/MachineLearning
https://redd.it/zbwwmc
2 Twitter Datasets for Finance-related Tweets Have Been Open-Sourced
Hi 👋,
Want to share two datasets I built for multi-class text classification. One dataset classifies finance related tweets for sentiment (bullish, bearish, neutral) and the other dataset classifies finance topic (20 topics) for tweets. They each hold an MIT license. Feel free to explore!
topic: https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic
sentiment: https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment
/r/datasets
https://redd.it/zbhzcz
Hot take: Kaggle for entry level CVs is very mid-2010s. Here's what I'd do instead.
Kaggle can be fun, but don't do it because you think it'll land you a job---that strategy has peaked and the noise is too high. People don't want to know you can apply some canned ML to a canned problem, and the frontiers of ML research is deep into AI at this point, to the point where it's just straight up a different career. What'd I suggest instead is practice asking questions and finding answers, which for this purpose should be as eye catching as possible.
Download some city data and make a hilariously detailed plan for how to get good parking. Good can mean the cheapest or you can really have fun and try to optimize getting free parking at the risk of getting fined. Really learn about the domain, like be able to explain why it looks different on weekends because they allow alternate side parking or something. Bonus points for driving to the city and trying it out for real. Explain why your model's oversimplified.
This is just an example. IMHO it gets more to the heart of what data science really is today.
/r/datascience
https://redd.it/zby4e4
Languages of Britain and Ireland, 400 AD - 1900 AD.
/r/MapPorn
https://redd.it/zbph8v
[OC] The Highest Streaming Spotify Artist/Band From Each State
/r/Infographics
https://redd.it/zcv8bo
E Stats Professor teams up with nameless GHOUL to play a parody of SQUARE HAMMER in class
My professor is a rock star and made a stats parody of Square Hammer by Ghost, then played it live in class
Here is the link:
https://www.youtube.com/watch?v=hupRyzrFRrg
/r/statistics
https://redd.it/zcp2ax
D NeurIPS 2022 Outstanding Paper modified results significantly in the camera ready
The paper is "A Neural Corpus Indexer for Document Retrieval"
According to the Revisions record on OpenReview, the final modification of the Rebuttal phaseat which point Table 1 reads.
​
https://preview.redd.it/75ibpthipw3a1.png?width=720&format=png&auto=webp&s=fd5c6071db4eb3f47b8e41ded08aa253cbc07c4a
But the Camera Ready version in which results of the same experience in Table 1 are obviously different from the first submitting and the difference is huge.
​
https://preview.redd.it/quwdju9npw3a1.png?width=720&format=png&auto=webp&s=421c6b48803331945610b27e6acd649563614d32
/r/MachineLearning
https://redd.it/zcdw0k
Looking for yearly lime price data for limes in mexico
I am currently doing a project for my master's thesis that requires yearly data of lime prices in Mexico from at least 1990 to 2017. I am looking to merge this with an already existing data set on crime in various Mexican municipalities. Any data set you can provide me that would aid in my project would be super helpful, thanks!
/r/datasets
https://redd.it/za694y
This is the farthest place on earth from any ocean
/r/MapPorn
https://redd.it/zclgxn
[OC] The F word in Popular Movies
/r/dataisbeautiful
https://redd.it/zcchs5
Europe/North Africa overlaid on US/Canada by latitude
/r/MapPorn
https://redd.it/zc9ry9
Mapped: Global Energy Prices, by Country in 2022
/r/visualization
https://redd.it/zc3cch
help utilizing the healthcare price transparency data from insurers (that have been published since July 1)
I've been playing around with the data the last couple of weeks and wondering if anyone here has had any success. The problems I'm facing are
1. The files are insanely large. Eventually the only way I've been able to open them is using the DADROIT large JSON viewer: http://dadroit.com (but even this only worked when I got an M1 Mac)
2. There's a ton of them and I have to trawl through hundreds of files which take ages to load, in order to get to one that contains the data that I need (I have a list of providers by NPI or TIN, and I want to find all the plans that have contracts with those doctors or provider groups, and what the negotiated rates are)
Has anyone had success or found any tools that allow you to browse the files and search them without having to do stuff locally?
​
Looking at files like these for Aetna: https://health1.aetna.com/app/public/#/one/insurerCode=AETNACVS\_I&brandCode=ALICFI/machine-readable-transparency-in-coverage and these for United: https://transparency-in-coverage.uhc.com/?\_gl=1\*1arv6qk\*\_ga\*NjE5NTQzMTM2LjE2NzAwODEwMDY.\*\_ga\_HZQWR2GYM4\*MTY3MDA4MTAwNy4xLjEuMTY3MDA4MTcwMy4wLjAuMA..
​
From what I can tell the files are pretty consistent so its going to be relatively straightforward to do the work. It just takes a gigaton of time to find something that can do it.
/r/datasets
https://redd.it/zbizft
Critique My Resume- Don’t Hold Back- Thank You
/r/datascience
https://redd.it/zbr91d
Provinces and Territories of Canada as European countries of similar population
/r/MapPorn
https://redd.it/zbydt3
[OC] Most Medals Won at the World Cup Football
/r/dataisbeautiful
https://redd.it/zbimgy
[OC] The basics of CSS
/r/dataisbeautiful
https://redd.it/zc18dl
A map of all of the world’s lakes
/r/MapPorn
https://redd.it/zbqg6v
Population difference between Europe and Africa
/r/MapPorn
https://redd.it/zbjr0c