datascientology | Образование

Telegram-канал datascientology - Data Scientology

1234

Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf

Подписаться на канал

Data Scientology

[OC] The cost of Christmas varies widely across the world, from less than $100 to over $2000

/r/dataisbeautiful
https://redd.it/ztbovn

Читать полностью…

Data Scientology

How a spider builds an orb web: generate high tensile strength anchor, bridge & frame threads; thin radius threads, and a sticky capture spiral

/r/Infographics
https://redd.it/zta47e

Читать полностью…

Data Scientology

Trinidad and Tobago’s Annual Air Visitor Arrivals

/r/Infographics
https://redd.it/zsrbss

Читать полностью…

Data Scientology

Countries where there were executions in 2022

https://redd.it/zsy3lt
@datascientology

Читать полностью…

Data Scientology

P A self-driving car using Nvidia Jetson Nano, with movement controlled by a pre-trained convolution neural network (CNN) written in Taichi

Intro & source code: https://github.com/houkensjtu/taichi-hackathon-akinasan

1. The circuit of an ordinary RC toy car is modified so that Jetson Nano can control the movement of the car through GPIO port. Of course, we need to use motor drive controller here, because the upper limit of the output current of Jetson Nano is not enough to drive the car motor directly.
2. The convolution neural network (CNN) is implemented using Taichi programming language.
3. The road data was collected, then classified and labeled, and finally used in the training of CNN models.
4. The pre-trained model is imported into Jetson Nano and the action prediction made for the images captured during driving.

Demo:

https://reddit.com/link/zshrlv/video/pcm3f6id3f7a1/player

/r/MachineLearning
https://redd.it/zshrlv

Читать полностью…

Data Scientology

D When chatGPT stops being free: Run SOTA LLM in cloud

TL;DR: I found GPU compute to be generally cheap and spot or on-demand instances can be launched on AWS for a few USD / hour up to over 100GB vRAM. So I thought it would make sense to run your own SOTA LLM like Bloomz 176B inference endpoint whenever you need it for a few questions to answer. I thought it would still make more sense than shoving money into a closed walled garden like "not-so-OpenAi" when they make ChatGPT or GPT-4 available for $$$. But I struggle due to lack of tutorials/resources.

Therefore, I carefully checked benchmarks, model parameters and sizes as well as training sources for all SOTA LLMs here.

Knowing since reading the Chinchilla paper that Model Scaling according to OpenAI was wrong and more params != better quality generation. So I was looking for the best performing LLM openly available in terms of quality and broadness to use for multilingual everyday questions/code completion/reasoning similar to what chatGPT provides (minus the fine-tuning for chat-style conversations).

My choice fell on Bloomz (because that handles multi-lingual questions well and has good zero shot performance for instructions and Q&A style text generation. Confusingly Galactica seems to outperform Bloom on several benchmarks. But since Galactica had a very narrow training set only using scientific papers, I guess usage is probably limited for answers on non-scientific topics.

Therefore I tried running the original bloom 176B and alternatively also Bloomz 176B on AWS SageMaker JumpStart, which should be a one click deployment. This fails after 20min. On Azure ML, I tried using DeepSpeed-MII which also supports bloom but also fails due the instance size of max 12GB vRAM I guess.

From my understanding to save costs on inference, it's probably possible to use one or multiple of the following solutions:

- Precision: int8 instead of fp16
- Microsoft/DeepSpeed-MII for an up 40x reduction on inference cost on Azure, this thing also supports int8 and fp16 bloom out of the box, but it fails on Azure due to instance size.
- facebook/xformer not sure, but if I remember correctly this brought inference requirements down to 4GB vRAM for StableDiffusion and DreamBooth fine-tuning to 10GB. No idea if this is usefull for Bloom(z) inference cost reduction though

I have a CompSci background but I am not familiar with most stuff, except that I was running StableDiffusion since day one on my rtx3080 using linux and also doing fine-tuning with DreamBooth. But that was all just following youtube tutorials. I can't find a single post or youtube video of anyone explaining a full BLOOM / Galactica / BLOOMZ inference deployment on cloud platforms like AWS/Azure using one of the optimizations mentioned above, yet alone deployment of the raw model. :(

I still can't figure it out by myself after 3 days.

TL;DR2: Trying to find likeminded people who are interested to run open source SOTA LLMs for when chatGPT will be paid or just for fun.

Any comments, inputs, rants, counter-arguments are welcome.

/end of rant

/r/MachineLearning
https://redd.it/zstequ

Читать полностью…

Data Scientology

In 2021 I created these infographics to show the vast difference between one and a trillion. I never really did anything with them so I wanted to share them here to get some feedback.

https://redd.it/zs1bez
@datascientology

Читать полностью…

Data Scientology

Map of percentages of respondents who say they would fight for their country

/r/MapPorn
https://redd.it/zshlqt

Читать полностью…

Data Scientology

[OC] English Words of Spanish Origin and the Number of Mentions in Wikipedia

/r/dataisbeautiful
https://redd.it/zs48my

Читать полностью…

Data Scientology

Are data science jobs affected by the tech bubble burst?

I teach a probability and statistics course to mostly computer science students, and I like to start the semester by talking about all the awesome job opportunities they'll have when they graduate.

I search Indeed for "data scientist" and share the number of active positions. Last semester there were substantially more than there are now - from roughly 24,000 down to around 14,000.

I imagine some students may have concerns that the number of job opportunities may be dwindling due to the supposedly bursting tech bubble, but I'm not sure if this more affects pure programming jobs (my background is not in data science).

I'd love to hear from people in the field, especially if you've been on the job hunt lately - any words of encouragement to current students?

Thanks for any info!

/r/datascience
https://redd.it/zrc1b5

Читать полностью…

Data Scientology

[OC] I made these posters in late 2021 to illustrate the vast difference between one and a trillion. It served to help me better understand how massive some numbers really are. I never shared this out before so wanted to before the data become too out of date.

https://redd.it/zs1kow
@datascientology

Читать полностью…

Data Scientology

Q Getting a Bachelors in statistics as female over 30yrs old!

Hey y’all! My wife is considering getting a statistics degree. She really likes statistics and even passed college statistics with an A while most of her classmates had to retake the class.

Our question is;

is getting a degree in statistics as a female in her early 30’s a good idea?

Is the R.O.I there?

Will employers overlook her due to age and/or being a female?

All replies or advice is welcome.

Thanks!

/r/statistics
https://redd.it/zs0mgi

Читать полностью…

Data Scientology

Is it normal to be quite forgetful of techniques/methods in data science?

I’m currently working as a Data Analyst. My background is in Physics, so whilst I have a strong mathematical background and I’m used to remembering and working with a lot of equations, I’ve never had any “formal” statistics/data science training.

In my work, I’ve found myself using a range of analytical techniques. There’s the stuff I do every day, like computing basic summary statistics since I work mainly with categorical data, but also things like linear regression, various significance tests (t-test, chi squared), to more “complicated” techniques such as decision trees, and even things like forecasting.

However, every time I spend a few weeks away from one of these things (like decision trees), I completely forget how they work. I can remember things like there’s nodes and branches and it makes splits based on entropy, but beyond that it’s like I’ve forgotten everything I’ve read. Same with forecasting - I know that ARIMA models exist and that there’s different terms calculated which take into account trend and seasonality, but beyond that I’ve forgotten.

Is this normal?

/r/datascience
https://redd.it/zrtzf4

Читать полностью…

Data Scientology

[OC] Mexico now leads the OG European beer countries in exports

/r/dataisbeautiful
https://redd.it/zrw8bb

Читать полностью…

Data Scientology

FIFA World Cup 2026 — Host cities

/r/MapPorn
https://redd.it/zrn0zk

Читать полностью…

Data Scientology

Wheel of Emotional Granularity [by Abby VanMuijen with Michelle McGhee]

/r/Infographics
https://redd.it/zs39x4

Читать полностью…

Data Scientology

Sample Peyote: generate multi-table synthetic data on any topic using GPT-3

Last weekend, I created a tool that uses GPT-3 to create synthetic datasets. I call it Sample Peyote, because it hallucinates sample data sets.

Here's a Star Wars dataset that it generated. There are several more examples linked from the README on github. Source code is there, too.

​

This was mostly a kick-the-tires project to understand what GPT is capable of, but I wanted it to be based in a real workflow with nontrivial requirements:

Start from scratch: Most synthetic data generators work by taking a sample of real data, and generating a fake dataset that has similar properties. I want to generate (aka "hallucinate") data starting from just an idea.
Cover any topic: I want to be able to generate data related to many different topics.
Generate a database, not just a table: I don't just want to generate a table. I want to generate a realistic-feeling database, with multiple tables and realistic use of things like foreign keys, ENUMs, and timestamps.
Pass the **Enhance That! test**: Generate data that "feels authentic."

​

I'd love feedback, and ideas for use cases.

/r/datasets
https://redd.it/zrr2yr

Читать полностью…

Data Scientology

TIL: the Uruguay river is mostly not in Uruguay and it doesn't start there.

/r/MapPorn
https://redd.it/zt3azv

Читать полностью…

Data Scientology

I made repurrsive Sierpinski triangle cat stacks

/r/mathpics
https://redd.it/zq6d6r

Читать полностью…

Data Scientology

Hey guys, I have 19 days to prepare for a 2hours onsite Data Science Interview. How should I prepare for it to maximise my chances?

The company is in the **aerospace and defense sector**.

Job Requirements:

* Good understanding of Python
* Solid bases in statistics and Machine Learning, in particular in forecasting methods
* Solid knowledge of database operations (SQL and NoSQL)
* A strong taste for creating impactful visualizations (dataviz)
* Experience, if possible in an industrial context, with at least one dashboard creation software/library and data visualization libraries (PowerBI, Grafana, Tableau, Dash, Shiny, D3.js, ...)

If you have roadmaps to prepare this kind of interviews please share them with me, it's about the job of my dreams and I am willing to work hard to get it. I have some basics in Data Science, statistics, and Probability but I want to start from scratch.

/r/datascience
https://redd.it/zsvcys

Читать полностью…

Data Scientology

[OC] US states sorted by life expectancy, colored by Biden's share of the 2020 Presidential Election

/r/dataisbeautiful
https://redd.it/zslrnq

Читать полностью…

Data Scientology

The Life of a Carbon Atom.

/r/Infographics
https://redd.it/zsn0zp

Читать полностью…

Data Scientology

These data have been displayed much better than this

/r/dataisugly
https://redd.it/zq0bvx

Читать полностью…

Data Scientology

Rural Equivalent Of New York City, Los Angeles And Chicago

/r/MapPorn
https://redd.it/zs1ptz

Читать полностью…

Data Scientology

Pandas 1.5.0 or later has copy-on-write (CoW), which can be optionally enabled, removes inconsistencies, and speeds up many operations.
https://towardsdatascience.com/a-solution-for-inconsistencies-in-indexing-operations-in-pandas-b76e10719744

/r/datascience
https://redd.it/zsbxov

Читать полностью…

Data Scientology

D Using "duplicates" during training?

I have collected experimental data for various conditions. In order to ensure repeatability, each test is replicated 5 times: which means same input but slightly different output due to experimental variability.

If you were to build a machine learning algorithm, would you use all 5 data points for each given test, hoping that your algorithm will learn to converge towards the mean response? Or it is advisable to pre-compute the means and only feed these to the model? ( so that you ensure that one input can only have one output)

I can see pros and cons to both approches and would welcome feedback. Thank you.

/r/MachineLearning
https://redd.it/zsbivc

Читать полностью…

Data Scientology

‘Cause who wants column charts or marimekkos?

/r/dataisugly
https://redd.it/zr5xxb

Читать полностью…

Data Scientology

'It is a duty towards society to have children' % that agree

/r/MapPorn
https://redd.it/zs1zbl

Читать полностью…

Data Scientology

Most spoken language in the USA

/r/MapPorn
https://redd.it/zrtmq1

Читать полностью…

Data Scientology

[OC] Among Big Tech, Amazon spends the most on R&D

/r/dataisbeautiful
https://redd.it/zrsfko

Читать полностью…
Подписаться на канал