datascientology | Образование

Telegram-канал datascientology - Data Scientology

1234

Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf

Подписаться на канал

Data Scientology

Countries where there were executions in 2022

https://redd.it/zsy3lt
@datascientology

Читать полностью…

Data Scientology

P A self-driving car using Nvidia Jetson Nano, with movement controlled by a pre-trained convolution neural network (CNN) written in Taichi

Intro & source code: https://github.com/houkensjtu/taichi-hackathon-akinasan

1. The circuit of an ordinary RC toy car is modified so that Jetson Nano can control the movement of the car through GPIO port. Of course, we need to use motor drive controller here, because the upper limit of the output current of Jetson Nano is not enough to drive the car motor directly.
2. The convolution neural network (CNN) is implemented using Taichi programming language.
3. The road data was collected, then classified and labeled, and finally used in the training of CNN models.
4. The pre-trained model is imported into Jetson Nano and the action prediction made for the images captured during driving.

Demo:

https://reddit.com/link/zshrlv/video/pcm3f6id3f7a1/player

/r/MachineLearning
https://redd.it/zshrlv

Читать полностью…

Data Scientology

D When chatGPT stops being free: Run SOTA LLM in cloud

TL;DR: I found GPU compute to be generally cheap and spot or on-demand instances can be launched on AWS for a few USD / hour up to over 100GB vRAM. So I thought it would make sense to run your own SOTA LLM like Bloomz 176B inference endpoint whenever you need it for a few questions to answer. I thought it would still make more sense than shoving money into a closed walled garden like "not-so-OpenAi" when they make ChatGPT or GPT-4 available for $$$. But I struggle due to lack of tutorials/resources.

Therefore, I carefully checked benchmarks, model parameters and sizes as well as training sources for all SOTA LLMs here.

Knowing since reading the Chinchilla paper that Model Scaling according to OpenAI was wrong and more params != better quality generation. So I was looking for the best performing LLM openly available in terms of quality and broadness to use for multilingual everyday questions/code completion/reasoning similar to what chatGPT provides (minus the fine-tuning for chat-style conversations).

My choice fell on Bloomz (because that handles multi-lingual questions well and has good zero shot performance for instructions and Q&A style text generation. Confusingly Galactica seems to outperform Bloom on several benchmarks. But since Galactica had a very narrow training set only using scientific papers, I guess usage is probably limited for answers on non-scientific topics.

Therefore I tried running the original bloom 176B and alternatively also Bloomz 176B on AWS SageMaker JumpStart, which should be a one click deployment. This fails after 20min. On Azure ML, I tried using DeepSpeed-MII which also supports bloom but also fails due the instance size of max 12GB vRAM I guess.

From my understanding to save costs on inference, it's probably possible to use one or multiple of the following solutions:

- Precision: int8 instead of fp16
- Microsoft/DeepSpeed-MII for an up 40x reduction on inference cost on Azure, this thing also supports int8 and fp16 bloom out of the box, but it fails on Azure due to instance size.
- facebook/xformer not sure, but if I remember correctly this brought inference requirements down to 4GB vRAM for StableDiffusion and DreamBooth fine-tuning to 10GB. No idea if this is usefull for Bloom(z) inference cost reduction though

I have a CompSci background but I am not familiar with most stuff, except that I was running StableDiffusion since day one on my rtx3080 using linux and also doing fine-tuning with DreamBooth. But that was all just following youtube tutorials. I can't find a single post or youtube video of anyone explaining a full BLOOM / Galactica / BLOOMZ inference deployment on cloud platforms like AWS/Azure using one of the optimizations mentioned above, yet alone deployment of the raw model. :(

I still can't figure it out by myself after 3 days.

TL;DR2: Trying to find likeminded people who are interested to run open source SOTA LLMs for when chatGPT will be paid or just for fun.

Any comments, inputs, rants, counter-arguments are welcome.

/end of rant

/r/MachineLearning
https://redd.it/zstequ

Читать полностью…

Data Scientology

In 2021 I created these infographics to show the vast difference between one and a trillion. I never really did anything with them so I wanted to share them here to get some feedback.

https://redd.it/zs1bez
@datascientology

Читать полностью…

Data Scientology

Map of percentages of respondents who say they would fight for their country

/r/MapPorn
https://redd.it/zshlqt

Читать полностью…

Data Scientology

[OC] English Words of Spanish Origin and the Number of Mentions in Wikipedia

/r/dataisbeautiful
https://redd.it/zs48my

Читать полностью…

Data Scientology

Are data science jobs affected by the tech bubble burst?

I teach a probability and statistics course to mostly computer science students, and I like to start the semester by talking about all the awesome job opportunities they'll have when they graduate.

I search Indeed for "data scientist" and share the number of active positions. Last semester there were substantially more than there are now - from roughly 24,000 down to around 14,000.

I imagine some students may have concerns that the number of job opportunities may be dwindling due to the supposedly bursting tech bubble, but I'm not sure if this more affects pure programming jobs (my background is not in data science).

I'd love to hear from people in the field, especially if you've been on the job hunt lately - any words of encouragement to current students?

Thanks for any info!

/r/datascience
https://redd.it/zrc1b5

Читать полностью…

Data Scientology

[OC] I made these posters in late 2021 to illustrate the vast difference between one and a trillion. It served to help me better understand how massive some numbers really are. I never shared this out before so wanted to before the data become too out of date.

https://redd.it/zs1kow
@datascientology

Читать полностью…

Data Scientology

Q Getting a Bachelors in statistics as female over 30yrs old!

Hey y’all! My wife is considering getting a statistics degree. She really likes statistics and even passed college statistics with an A while most of her classmates had to retake the class.

Our question is;

is getting a degree in statistics as a female in her early 30’s a good idea?

Is the R.O.I there?

Will employers overlook her due to age and/or being a female?

All replies or advice is welcome.

Thanks!

/r/statistics
https://redd.it/zs0mgi

Читать полностью…

Data Scientology

Is it normal to be quite forgetful of techniques/methods in data science?

I’m currently working as a Data Analyst. My background is in Physics, so whilst I have a strong mathematical background and I’m used to remembering and working with a lot of equations, I’ve never had any “formal” statistics/data science training.

In my work, I’ve found myself using a range of analytical techniques. There’s the stuff I do every day, like computing basic summary statistics since I work mainly with categorical data, but also things like linear regression, various significance tests (t-test, chi squared), to more “complicated” techniques such as decision trees, and even things like forecasting.

However, every time I spend a few weeks away from one of these things (like decision trees), I completely forget how they work. I can remember things like there’s nodes and branches and it makes splits based on entropy, but beyond that it’s like I’ve forgotten everything I’ve read. Same with forecasting - I know that ARIMA models exist and that there’s different terms calculated which take into account trend and seasonality, but beyond that I’ve forgotten.

Is this normal?

/r/datascience
https://redd.it/zrtzf4

Читать полностью…

Data Scientology

[OC] Mexico now leads the OG European beer countries in exports

/r/dataisbeautiful
https://redd.it/zrw8bb

Читать полностью…

Data Scientology

FIFA World Cup 2026 — Host cities

/r/MapPorn
https://redd.it/zrn0zk

Читать полностью…

Data Scientology

N Point-E: a new Dalle-like model that generates 3D Point Clouds from Prompts

It's only been a month since OpenAI released ChatGPT, and yesterday they launched Point-E, a new Dalle-like model that generates 3D Point Clouds from Complex Prompts. As someone who is always interested in the latest advancements in machine learning, I was really excited to dig into this paper and see what it had to offer.

One of the key features of Point-E is its use of diffusion models to generate synthetic views and 3D point clouds. These models use text input to generate an image, which is then used as a reference for generating the 3D point cloud. This process takes only 1-2 minutes on a single GPU, making it much faster than previous state-of-the-art methods.

While the quality of the samples produced by Point-E may be lower than those produced by other methods, the speed of generation makes it a practical option for certain use cases.

If you're interested in learning more about this new model and how it was developed, I highly recommend giving the full paper a read. But if you're more into reading the gist of it, I added a link to an overview blog I published about.

The blog: https://dagshub.com/blog/overview-of-point-e/

The paper: https://arxiv.org/abs/2212.08751

I'm sure I have yet to reach all the insights while writing the blog, and I'd love to get your thoughts about the model and how OpenAI developed it.

/r/MachineLearning
https://redd.it/zrfy75

Читать полностью…

Data Scientology

Got my first Data Science job!!!

I just graduated with a masters in Data Science last Friday and I got my first job in my degree field. I had applied for the position on December 1st, after 2 interviews I got the call this afternoon. My best advice is don’t get hung up on the job title, look at the description. Mine was listed as a programmer but it is working with SQL, Python and Tableau. I wouldn’t have found it based on the title.

/r/datascience
https://redd.it/zr3xli

Читать полностью…

Data Scientology

I am so happy with this purchase.

/r/mathpics
https://redd.it/zrnmih

Читать полностью…

Data Scientology

I made repurrsive Sierpinski triangle cat stacks

/r/mathpics
https://redd.it/zq6d6r

Читать полностью…

Data Scientology

Hey guys, I have 19 days to prepare for a 2hours onsite Data Science Interview. How should I prepare for it to maximise my chances?

The company is in the **aerospace and defense sector**.

Job Requirements:

* Good understanding of Python
* Solid bases in statistics and Machine Learning, in particular in forecasting methods
* Solid knowledge of database operations (SQL and NoSQL)
* A strong taste for creating impactful visualizations (dataviz)
* Experience, if possible in an industrial context, with at least one dashboard creation software/library and data visualization libraries (PowerBI, Grafana, Tableau, Dash, Shiny, D3.js, ...)

If you have roadmaps to prepare this kind of interviews please share them with me, it's about the job of my dreams and I am willing to work hard to get it. I have some basics in Data Science, statistics, and Probability but I want to start from scratch.

/r/datascience
https://redd.it/zsvcys

Читать полностью…

Data Scientology

[OC] US states sorted by life expectancy, colored by Biden's share of the 2020 Presidential Election

/r/dataisbeautiful
https://redd.it/zslrnq

Читать полностью…

Data Scientology

The Life of a Carbon Atom.

/r/Infographics
https://redd.it/zsn0zp

Читать полностью…

Data Scientology

These data have been displayed much better than this

/r/dataisugly
https://redd.it/zq0bvx

Читать полностью…

Data Scientology

Rural Equivalent Of New York City, Los Angeles And Chicago

/r/MapPorn
https://redd.it/zs1ptz

Читать полностью…

Data Scientology

Pandas 1.5.0 or later has copy-on-write (CoW), which can be optionally enabled, removes inconsistencies, and speeds up many operations.
https://towardsdatascience.com/a-solution-for-inconsistencies-in-indexing-operations-in-pandas-b76e10719744

/r/datascience
https://redd.it/zsbxov

Читать полностью…

Data Scientology

D Using "duplicates" during training?

I have collected experimental data for various conditions. In order to ensure repeatability, each test is replicated 5 times: which means same input but slightly different output due to experimental variability.

If you were to build a machine learning algorithm, would you use all 5 data points for each given test, hoping that your algorithm will learn to converge towards the mean response? Or it is advisable to pre-compute the means and only feed these to the model? ( so that you ensure that one input can only have one output)

I can see pros and cons to both approches and would welcome feedback. Thank you.

/r/MachineLearning
https://redd.it/zsbivc

Читать полностью…

Data Scientology

‘Cause who wants column charts or marimekkos?

/r/dataisugly
https://redd.it/zr5xxb

Читать полностью…

Data Scientology

'It is a duty towards society to have children' % that agree

/r/MapPorn
https://redd.it/zs1zbl

Читать полностью…

Data Scientology

Most spoken language in the USA

/r/MapPorn
https://redd.it/zrtmq1

Читать полностью…

Data Scientology

[OC] Among Big Tech, Amazon spends the most on R&D

/r/dataisbeautiful
https://redd.it/zrsfko

Читать полностью…

Data Scientology

Tucker Carlson with some ugly data

/r/dataisugly
https://redd.it/zrp300

Читать полностью…

Data Scientology

Petition to bring back images in posts.

Look at all time top rated posts. We like images, and some of them are actually really good. Statistics need visualization.


You can also vote in this survey

/r/SampleSize
https://redd.it/zrbz8f

Читать полностью…

Data Scientology

[OC] Popularity of Social Media in US

/r/dataisbeautiful
https://redd.it/zqscip

Читать полностью…
Подписаться на канал