% of people who think one of their main goals in life is to make their parents proud
/r/dataisugly
https://redd.it/zsypae
[P] Annotated History of Modern AI and Deep Learning (Schmidhuber)
https://people.idsia.ch/~juergen/deep-learning-history.html
/r/MachineLearning
https://redd.it/ztxyui
[OC] NFL wide receiver Justin Jefferson is on pace to break the single-season receiving yards record
/r/dataisbeautiful
https://redd.it/ztwgd5
Compared to your other family members, how would you rate yourself as a gift giver on a scale of 0-100? [OC]
/r/dataisbeautiful
https://redd.it/ztjykq
Inequality in annual earnings worsens in 2021: Top 1% of earners get a larger share of the earnings pie while the bottom 90% lose ground
https://www.epi.org/publication/inequality-2021-ssa-data/?utm_source=sillychillly
/r/dataisbeautiful
https://redd.it/ztrxaf
Take a moment. How certain are you that you are currently awake and not dreaming? (Everyone)
https://forms.gle/B16YJVwH5feK4HQKA
/r/SampleSize
https://redd.it/ztpcdv
Q If randomization can't assure of balanced proportions of potentially influential characteristics within experimental groups, how useful are quasi causal analysis like matching?
/r/statistics
https://redd.it/ztf5cv
Master u/etoipiplus1, I salute you with this Desmos homage to your fractal creations
/r/mathpics
https://redd.it/zpquyd
[OC] Animation of Arctic blast sweeping down and across Canada and the US
/r/dataisbeautiful
https://redd.it/ztkgvh
Is your country in green or slightly different green?
/r/dataisugly
https://redd.it/ztik6h
P App that Determines Whether You've Been Naughty or Nice Based on Your Reddit Comments
Hex Application
Since we are heading into the holiday season, I thought it would be interesting to take a look if you could create a model to look at morality with user's Reddit comments. I used Scikit-Learn's Logistic Regression Model for this.
I started by downloading around 750 comments from Social Grep's website. They have pulled Reddit comments from different sets of subreddits. I pulled from their datasets for confession-like subreddits, the irl subreddits, and the dataset subreddit. I classified the comments manually by a set rule of morality. Once they were scored, I trained/tested the Logistic model with those comments.
For the specific user testing, I used PRAW to pull the most recent 50 comments from the username provided in the Hex Application. I ran the trained model and outputted the probability of each comment being nice and took an average of the probabilities and used that value to determine whether the user was naughty or nice. I use a script to email a CSV with all of the tested comments and the final score to the user.
Based on the results that have came through so far, the model is definitely biased towards giving the user a nice decision. I believe that is based on the training data being around 70% nice versus naughty. Does anyone have a way to help the model from being biased like that?
Feel free to try the app out and let me know what you think!
/r/MachineLearning
https://redd.it/zspu96
[OC] The cost of Christmas varies widely across the world, from less than $100 to over $2000
/r/dataisbeautiful
https://redd.it/ztbovn
How a spider builds an orb web: generate high tensile strength anchor, bridge & frame threads; thin radius threads, and a sticky capture spiral
/r/Infographics
https://redd.it/zta47e
Trinidad and Tobago’s Annual Air Visitor Arrivals
/r/Infographics
https://redd.it/zsrbss
Just because I know everyone here loves this kind of charts: American press publication showing the differences in troop strengths among the waring countries, 1914.
/r/dataisugly
https://redd.it/ztiwmb
ISO: Data for the exact numbers that people imagine corresponding to words like "a few," "several," "lots," etc.
My ask is mostly in the title. I'm looking for data on what people infer from quantifier words in English, such as "a few," "many", "several".
Conceptually, I'm looking for how people think about these kinds of questions:
* For 10 of something, would you say "several", "many", or "something else"...?
* If someone said they had "several" of something, what's the most and least they might reasonably mean?
* Could 3 of something be "a couple"?
I think I saw an infographic with this a while back...?
/r/datasets
https://redd.it/zth26y
[OC] Yeah Science! Scientific Output vs. National Wealth
/r/dataisbeautiful
https://redd.it/ztjxbn
Interactive map: Coldest day of the year across the United States
https://www.climate.gov/news-features/understanding-climate/interactive-map-coldest-day-year-across-united-states
/r/dataisbeautiful
https://redd.it/zu1vuj
[OC] Big fan of IASIP here! Put together a fun little visual. Happy holidays!
/r/dataisbeautiful
https://redd.it/ztrxrf
US Prison Population and Crime Rate over time [OC]
/r/dataisbeautiful
https://redd.it/ztzln1
[OC] Mexico welcomes nearly half of all international arrivals into Latin America + the Caribbean each year.
/r/dataisbeautiful
https://redd.it/ztrbdj
P Extracting and Structuring Recipes Using GPT3
I've been experimenting with GPT3 for different use cases over the past few weeks, the latest one was seeing how well it could parse out structured data from recipe free text, and how well it could further enrich this data.
The general idea was to have a few different prompts to the model, with output from one prompt inputting into the next prompt:
1. Extract ingredients and instructions from the recipe
2. Given the ingredients, group them together into categories
3. Given the full structured recipe generated above, enrich it further with additional metadata (time to cook, healthiness, etc)
This worked out better than I expected - given an input recipe I'm able to consistently (and accurately) extract the constituent parts and group the ingredients together logically (like grains, dairy, etc).
I wrote about it here: https://binal.pub/2022/12/extracting-and-structuring-recipes-using-gpt3/
One thing that I was surprised by as well was this turned out to be a decent recipe generator. So instead of using a full recipe I could input "Pumpkin Pie" and the structured response at the end would be the ingredients and instructions to bake a pumpkin pie with quantities/timings that seemed to be about what you'd expect.
/r/MachineLearning
https://redd.it/zrzhtq
[OC] Top 10 most used words by subreddit in 2022 (reuploaded)
https://redd.it/zt0zab
@datascientology
Discussion Anyone else having a hard time not getting mad/cringing at the general public anthropomorphizing the hell out of chatGPT?
It was one thing with DALLE-2, but at least it couldn’t talk back to them. I mean I have been in board meetings with powerful people in leadership positions that have nothing to do with tech have absolutely horrendous ideas about what ChatGPT is- I am not lying, I have genuinely heard them say they believe it’s basically conscious and using excerpt screenshots of it saying it hates humans as a basis to make business decisions about the future of AI in their company. Like….WHAT? Have other people heard absurd things like this too?
I think it’s just hard to see the professional reality of machine learning, becoming extremely debased from the general public idea of machine learning. I’m sure as we all get even better at our jobs it’s only going to get much much worse. I wouldn’t be surprised if soon we are the new magical witches of the world. i’ll see you guys on the pyres in 20 years.( ok really I’m just joking on that last part)
What do you all think?
/r/MachineLearning
https://redd.it/ztbsf5
Wheel of Emotional Granularity [by Abby VanMuijen with Michelle McGhee]
/r/Infographics
https://redd.it/zs39x4
Sample Peyote: generate multi-table synthetic data on any topic using GPT-3
Last weekend, I created a tool that uses GPT-3 to create synthetic datasets. I call it Sample Peyote, because it hallucinates sample data sets.
Here's a Star Wars dataset that it generated. There are several more examples linked from the README on github. Source code is there, too.
​
This was mostly a kick-the-tires project to understand what GPT is capable of, but I wanted it to be based in a real workflow with nontrivial requirements:
Start from scratch: Most synthetic data generators work by taking a sample of real data, and generating a fake dataset that has similar properties. I want to generate (aka "hallucinate") data starting from just an idea.
Cover any topic: I want to be able to generate data related to many different topics.
Generate a database, not just a table: I don't just want to generate a table. I want to generate a realistic-feeling database, with multiple tables and realistic use of things like foreign keys, ENUMs, and timestamps.
Pass the **Enhance That! test**: Generate data that "feels authentic."
​
I'd love feedback, and ideas for use cases.
/r/datasets
https://redd.it/zrr2yr
TIL: the Uruguay river is mostly not in Uruguay and it doesn't start there.
/r/MapPorn
https://redd.it/zt3azv