datascientology | Образование

Telegram-канал datascientology - Data Scientology

1234

Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf

Подписаться на канал

Data Scientology

[OC] The Highest Paid Twitch Streamers

/r/visualization
https://redd.it/yje17p

Читать полностью…

Data Scientology

[OC] How Harvard admissions rates Asian American candidates relative to White American candidates

/r/dataisbeautiful
https://redd.it/yjbefg

Читать полностью…

Data Scientology

[OC] The GDP of California in the G7 instead of the United States

/r/dataisbeautiful
https://redd.it/yig9js

Читать полностью…

Data Scientology

D Using JavaScript for ML Training/Research (not in the browser)

Accelerators are getting faster but Python remains slow. The investment into a true JIT for Python (CPython, the only one folks use for ML) seems to be far away.

Projects like JAX, Triton and Taichi are awesome but remain somewhat brittle to use. We believe this is because there isn't a true JIT in Python (like the one found in JavaScript) that these projects can leverage. On top of full endorsement from the language's community, building a true JIT takes millions of lines of heavily interconnected code and likely hundreds of thousands if not millions of engineer hours.[1\][2\] There is very little to indicate this extremely expensive type of investment will come from CPython.

As a hedge against CPython never becoming fast, we're creating a project called Shumai that attempts to deeply integrate with a new JavaScript runtime (Bun[3\]).

The Bun JS runtime (and other JS runtimes) have had true JIT compilation for over a decade and are fiercely competitive with each other. This has lead to a prosperous and efficiency-driven language that rapidly adopts new features and is very open to improvement. JavaScript has dropped semicolons, added `import` statements (learning from Python), made asynchronous programming native (this is very important for scaling models, but may not be relevant for small training), added a much faster FFI than Python (for C extensions, 10-100x less overhead), and even has things like importing directly from a URL (no need for `pip install`).

This type of ecosystem, in our minds, is a very natural fit for research.

It's also markedly faster than Python for anything that isn't large matrix multiplications that saturate the GPU[4\]. The work being done in kernel fusion (e.g. AITemplate, Triton, XLA) is still very relevant and necessary for the best performance. However, *other* work, such as doing complicated bookkeeping, simulating games for RL, running very dynamic networks, interleaving complex network code with computation, etc. would all benefit from a faster host language.

There are other more obvious benefits to using JS for ML (which TF.js has done a great job of capturing), but theses are not the current focus of the Shumai project. Instead, we are focusing on single node performance and multi-node scaling (with JS-style functional + reactive programming).

If you're interested in checking out the current state of the code, here it is: https://github.com/facebookresearch/shumai

\~\~\~

[1\] V8, a JIT engine and nothing else: 2.3 million lines of code https://www.openhub.net/p/v8-js/analyses/latest/languages\_summary

[2\] WebKit: 26 million lines of code https://www.openhub.net/p/WebKit/analyses/latest/languages\_summary (note that this includes a browser not just a JIT)

[3\] Bun: https://github.com/oven-sh/bun

[4\] Bun's JS is 41x faster than the newest Python, \~120x faster than 3.8: https://jott.live/markdown/py3.11\_vs\_3.8

/r/MachineLearning
https://redd.it/yin6ho

Читать полностью…

Data Scientology

To whom it’s scary

/r/mathpics
https://redd.it/yioy60

Читать полностью…

Data Scientology

D Diffusion vs MCMC as sampling algorithms

I was recently reviewing the diffusion methods used in Stable Diffusion and my mind wandered to Markov Chain Monte Carlo, which got me thinking - are there important theoretical similarities / differences between these methods?

A bit of background:

Intro to Stable Diffusion: A nice illustrated guide by Jay Alammar [https://jalammar.github.io/illustrated-stable-diffusion/](https://jalammar.github.io/illustrated-stable-diffusion/)
Intro to MCMC: Stanford CS168 notes by Tim Roughgarden and Gregory Valiant http://timroughgarden.org/s17/l/l14.pdf
The Metropolis-Hastings (MH) algorithm, a specific MCMC algorithm: [https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings\_algorithm](https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm)

My own stream-of-consciousness thoughts:

1. Function: Both diffusion and MH are sampling-based generative models that learns to produce data from a given distribution.
2. Iterative sampling: Diffusion works by predicting a de-noising term, which is progressively added to a random noise sample. This is similar to the MH proposal function g(x' | x) which generates candidate next state in the Markov chain. For diffusion, it converges to the training distribution, whereas MCMC converges to the stationary distribution.
3. Biasing: Diffusion can be conditioned on exact boundary conditions, or 'guided' towards certain types of outputs by modifying the diffusion gradient in the vein of [Janner et al](
https://arxiv.org/abs/2205.09991). In the same way the MH algorithm can respect hard and soft constraints by configuring the acceptance ratio f(x') / f(x) to encode the desired properties of the stationary distribution
4. Overall, we may say that in the literature, diffusion is largely 'learned' whereas MCMC is 'designed', but reversing that scenario may introduce interesting results (in terms of learnable MCMC or designed diffusion).

This thought raises a lot of important questions for me.

Can we interpret diffusion as some variant of MCMC and therefore derive theoretical properties of it?
The basic discussion above analyses MH and diffusion in terms of 2 properties, which are their iterative sampling and biasing procedures. Can we 'mix-and-match' to get new algorithms which might be better?
What are the important theoretical differences between these two methods?

It's not clear to me what the answers are, I'm hoping to have a good discussion with the smart minds in this forum!

/r/MachineLearning
https://redd.it/yieq8c

Читать полностью…

Data Scientology

News The Stack: 3 TB of permissively licensed source code - Hugging Face and ServiceNow Research Denis Kocetkov et al 2022

ServiceNow and Hugging Face have released a 3.1TB dataset of permissively licensed code in 30 programming languages. This is about 4x larger than the dataset used to train GPT-3 (though obviously ‘code only’), and 3x the size of CodeParrot, the next largest released code dataset.

Paper: https://drive.google.com/file/d/17J-0KXTDzY9Esp-JqXYHIcy--i\_7G5Bb/view

https://wandb.ai/telidavies/ml-news/reports/The-Stack-BigCode-s-New-3-TB-Dataset-Of-Permissively-Licensed-Code--VmlldzoyODY1MDUy

Hugging Face: https://huggingface.co/datasets/bigcode/the-stack

Twitter: https://twitter.com/BigCodeProject/status/1585631176353796097

Download The Stack: https://hf.co/BigCode

Source: https:\/\/twitter.com\/BigCodeProject\/status\/1585631176353796097

Source: https:\/\/twitter.com\/BigCodeProject\/status\/1585631176353796097

​

Source: https:\/\/twitter.com\/BigCodeProject\/status\/1585631176353796097

/r/MachineLearning
https://redd.it/yijfkw

Читать полностью…

Data Scientology

Dataset: Walmart Coffee Listings from 500 stores

In total, there're 1399 coffee listings.

Dataset: https://www.kaggle.com/datasets/dimitryzub/walmart-coffee-listings-from-500-stores

Data:
- Title
- Coffee Type
- Rating
- Reviews
- Seller Name
- Thumbnail (of the product)
- Price

This dataset was created while I was working on a SerpApi demo project to showcase Walmart Search Engine Results API.

GitHub repository with a data extraction/analysis scripts and plots: https://github.com/dimitryzub/walmart-stores-coffee-analysis

A blog post that shows data extraction steps: https://serpapi.com/blog/serpapi-demo-project-walmart-coffee-exploratory-data-analysis/#extracting-walmart-data

/r/datasets
https://redd.it/yi3ewr

Читать полностью…

Data Scientology

Sex Ratio at Birth: USA, China, Japan, Germany, & India [OC]

/r/dataisbeautiful
https://redd.it/yids7i

Читать полностью…

Data Scientology

Legality of consensual sex between siblings

/r/MapPorn
https://redd.it/yi6uvc

Читать полностью…

Data Scientology

What should kids really be afraid of? [OC]

/r/dataisbeautiful
https://redd.it/yhxmqy

Читать полностью…

Data Scientology

[P] Explain Paper - A Better Way to Read Academic Papers
https://twitter.com/amanjha__/status/1584628485510733825

/r/MachineLearning
https://redd.it/yhx3g3

Читать полностью…

Data Scientology

PYTHON CHARTS: a new visualization website feaaturing matplotlib, seaborn and plotly Over 500 charts with reproducible code

I've recently launched "PYTHON CHARTS", a website that provides lots of matplotlib, seaborn and plotly easy-to-follow tutorials with reproducible code, both in English and Spanish.


Link: https://python-charts.com/
Link (spanish): https://python-charts.com/es/

​

https://preview.redd.it/v4kwjk5hn0x91.png?width=939&format=png&auto=webp&s=e2b92d7db2d6c63ce4bff55dabe34e96236d646e

The posts are filterable based on the chart type and library:

https://preview.redd.it/4tfvn5prn0x91.png?width=898&format=png&auto=webp&s=e7cba3f1bda4ec05fcf7f1a21489d1811c3e4a30

Each tutorial will guide the reader step by step from a basic to more styled chart:

https://preview.redd.it/yrsnxpdwn0x91.png?width=694&format=png&auto=webp&s=ea772dda73588bbf87326e8ef384d002e0355f76

The site also provides some color tools to copy matplotlib colors both in HEX or by its name. You can also convert HEX to RGB in the page:

https://preview.redd.it/hxhdctl2o0x91.png?width=890&format=png&auto=webp&s=5cc280970d2112986d5ba35205e6aa6f224689e5

​

I created this website on my spare time for all those finding the original docs difficult to follow.
This site has its equivalent in R: https://r-charts.com/

Hope you like it!

/r/datascience
https://redd.it/yhrlpj

Читать полностью…

Data Scientology

The results of the second round of the 2022 Presidential Election in Brazil, by state

/r/MapPorn
https://redd.it/yhspmj

Читать полностью…

Data Scientology

Time change in Europe

/r/MapPorn
https://redd.it/yhnv97

Читать полностью…

Data Scientology

US House of Representatives vote on the 1964 Civil Rights Act

/r/MapPorn
https://redd.it/yiz0f0

Читать полностью…

Data Scientology

Each color contains an almost equal population size in both of these regions of Africa.

/r/MapPorn
https://redd.it/yj8jsu

Читать полностью…

Data Scientology

PN Learn Rust as ML practitioner

Hi,

we are developing a ML practitioner-friendly library to help learn Rust mimicking the sklearn API: Smartcore.
You can see some examples in these notebooks.


Please give feedback and consider contributing. Algorithms are implemented from scratch to be easy to understand and read through with room to grow in efficiency and portability. There are some newcomers-friendly issues and TODOs.

/r/MachineLearning
https://redd.it/yi86he

Читать полностью…

Data Scientology

American wages haven't budged for the last 50 years
https://www.weforum.org/agenda/2019/04/50-years-of-us-wages-in-one-chart/

/r/dataisbeautiful
https://redd.it/yipdc3

Читать полностью…

Data Scientology

Data_irl

/r/data_irl
https://redd.it/yj1wnb

Читать полностью…

Data Scientology

"Last Call" times for bars in the United States

/r/Infographics
https://redd.it/yid7kj

Читать полностью…

Data Scientology

Serial killers with the highest number of victims by European countries

/r/MapPorn
https://redd.it/yi7279

Читать полностью…

Data Scientology

With the Brazilian election results, Latin America's left leaning governments.

/r/MapPorn
https://redd.it/yhv1da

Читать полностью…

Data Scientology

Clown heads and shitty upscale

/r/dataisugly
https://redd.it/yi7agn

Читать полностью…

Data Scientology

Is there any remote job board that focus solely (or more) on data related roles?

Most remote job boards list mostly developer roles. While I get there are more openings, I am wondering if there’s any that focus mostly on data related roles.

/r/datascience
https://redd.it/yhgoxl

Читать полностью…

Data Scientology

I have SO many questions

/r/dataisugly
https://redd.it/yhv1xd

Читать полностью…

Data Scientology

[OC] CEO Compensation Growth Outpaces Stock Market since 1978

/r/dataisbeautiful
https://redd.it/yhqhym

Читать полностью…

Data Scientology

Map of U.S. ZIP Code zones

/r/MapPorn
https://redd.it/yhyrv2

Читать полностью…

Data Scientology

[OC] Breaking down revenue and profit sources of Saudi Aramco, one of the largest energy firms in the world

/r/dataisbeautiful
https://redd.it/yhdzaa

Читать полностью…

Data Scientology

Meanders of the Mississippi, 1944

/r/MapPorn
https://redd.it/yhjqfn

Читать полностью…
Подписаться на канал