datascientology | Education

Telegram-канал datascientology - Data Scientology

1234

Hot data science related posts every hour. Chat: https://telegram.me/r_channels Contacts: @lgyanf

Subscribe to a channel

Data Scientology

How does Qwen3-Next Perform in Complex Code Generation & Software Architecture?

https://redd.it/1opwl9h
@datascientology

Читать полностью…

Data Scientology

masters in computational linguistics uppsala or tübingen

hi all

i'm planning to apply for a masters in computational linguistics / language technology as an international (non EU/EEA) student. i've done research on programs and have narrowed down on these few:

1. uppsala's MA language technology masters
2. tübingen's MA computational linguistics
3. stockholm's MA AI and language
4. stuttgart's MSc Computational Linguistics
5. konstanz's MA speech and language processing
6. helsinki's MA linguistic diversity and digital humanities (language technology track)
7. potsdam's MSc cognitive systems

coming from a linguistic background (bachelor with honours), i'm looking at 2 year programs as i believe i'd be able to learn more programming theory + technical skills that would better equip me for an industry role in the tech sector. i'm thus not as keen on 1 year programs such as leiden's linguistics (comp ling track), VU's linguistics language and AI, or groningen's speech technology programs. i'm learning python online to gain some basic proficiency in programming before starting the masters.

uppsala and tübingen are my top 2 choices if i were to be accepted, particularly because they seem more accessible to prospective students from a linguistic background based on my research. i'm hoping to gain more information about these two cities and their programs based on people's personal experience so that i can make an informed choice. these are my questions:

1. ACCESSIBILITY: how accessible is the program for those with a linguistic background? accessible could mean being less CS-intensive, or that there are foundational classes in programming/ML/AI to help those with a humanities background ease into the program with less difficulty
2. TEACHING QUALITY: what's your experience with the quality of teaching, how well organised the course is, helpfulness of professors, whether studying resources are provided or you'd have to source for your own materials, etc
3. JOB OPPORTUNITIES: in which city would an international student find it easier to get a job after graduating?
4. HEALTHCARE: how easy is it to get a medical appointment for minor and major illnesses in the city, both as a student and after graduation?
5. SOCIAL LIFE: how open people are to making new (local) friends, especially if one is not fluent in Swedish (for uppsala) or German (for tübingen)?
6. ACTIVITIES: which city has more options for activities if i'm not a huge fan of partying, alcohol, pub crawls? (occasional outings for special occassions are fine, but it's not something i would do frequently or particularly enjoy) i'm open to hiking, bouldering, music events, board games, reading, or any other activity
7. TRANSPORT: how well-connected and accessible is public transport within these cities, and also from the city to other cities?
8. COST OF LIVING: it seems like living costs (on numbeo) are generally lower in uppsala than tübingen (which is counter to my initial impression that CoL is higher in nordic countries) and i'm wondering if this is really the case? i've also read comments that tübingen is an expensive city to live in - would this make the cost of living in tübingen 'comparable' to uppsala?
9. QUALTITY OF LIFE: how would you describe the overall quality of life in uppsala/tübingen, and if you have experience living in both, is the quality of life noticeably better in one of the cities? (my impression is that anywhere in the nordics would have a better quality of life but i'd like to hear your experience if you've lived there)

i'd be grateful if you could share your experience in uppsala and/or tübingen, or if you have experience with the other programs (and countries). thanks so much!

TLDR: international student (non EU/EEA) with BA (Honours) in Linguistics looking for advice on whether to choose uppsala or tübingen for masters in computational linguistics/language technology

/r/LanguageTechnology
https://redd.it/1omgy7r

Читать полностью…

Data Scientology

How was this achieved? They are able to track movements and complete steps automatically

/r/computervision
https://redd.it/1oeosd3

Читать полностью…

Data Scientology

D Self-Promotion Thread

Please post your personal projects, startups, product placements, collaboration needs, blogs etc.

Please mention the payment and pricing requirements for products and services.

Please do not post link shorteners, link aggregator websites , or auto-subscribe links.

\--

Any abuse of trust will lead to bans.

Encourage others who create new posts for questions to post here instead!

Thread will stay alive until next one so keep posting after the date in the title.

\--

Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.

/r/MachineLearning
https://redd.it/1nvrmw5

Читать полностью…

Data Scientology

D Monthly Who's Hiring and Who wants to be Hired?

For Job Postings please use this template

>Hiring: [Location\], Salary:[\], [Remote | Relocation\], [Full Time | Contract | Part Time\] and [Brief overview, what you're looking for\]

For Those looking for jobs please use this template

>Want to be Hired: [Location\], Salary Expectation:[\], [Remote | Relocation\], [Full Time | Contract | Part Time\] Resume: [Link to resume\] and [Brief overview, what you're looking for\]

​

Please remember that this community is geared towards those with experience.

/r/MachineLearning
https://redd.it/1nuwj5t

Читать полностью…

Data Scientology

I built TagiFLY – a lightweight open-source labeling tool for computer vision (feedback welcome!)

Hi everyone,

Most annotation tools I’ve used felt too heavy or cluttered for small projects. So I created TagiFLY – a lightweight, open-source labeling app focused only on what you need.

🔹 What it does

6 annotation tools (box, polygon, point, line, mask, keypoints)
4 export formats: JSON, YOLO, COCO, Pascal VOC
Light & dark theme, keyboard shortcuts, multiple image formats (JPG, PNG)

🔹 Why I built it
I wanted a simple tool to create datasets for:

🤖 Training data for ML
🎯 Computer vision projects
📊 Research or personal experiments

https://preview.redd.it/y28zj1mftyrf1.png?width=1748&format=png&auto=webp&s=6aeb618cdfd4276c07c1e94c79763a4074ff8334



Export Window



🔹 Demo & Code
👉 GitHub repo: https://github.com/dvtlab/tagiFLY


⚠️ It’s still in beta – so it may have bugs or missing features.
I’d love to hear your thoughts:

Which features do you think are most useful?
What would you like to see added in future versions?

Thanks a lot 🚀

/r/computervision
https://redd.it/1nsyv0h

Читать полностью…

Data Scientology

D Is non-DL related research a poor fit for ICLR?

I was one of the lucky people rejected from NEURIPS with 6444 scores but cranky AC, so looking to resubmit now. Since it got good reviews at NEURIPS, I'm considering submitting to ICLR incorporating suggested changes.

However, my paper proposes a linear dimensionality reduction technique, based on information geometry. It is my understanding that ICLR is very focused on neural networks and Deep Learning, so I am worried that my paper is not a good fit, so also considering AISTATS.

Is a novel linear dimensionality reduction technique too out of scope for ICLR? I am an outsider to the field, so would very much appreciate opinions.

/r/MachineLearning
https://redd.it/1nn56yu

Читать полностью…

Data Scientology

D Monthly Who's Hiring and Who wants to be Hired?

For Job Postings please use this template

>Hiring: [Location\], Salary:[\], [Remote | Relocation\], [Full Time | Contract | Part Time\] and [Brief overview, what you're looking for\]

For Those looking for jobs please use this template

>Want to be Hired: [Location\], Salary Expectation:[\], [Remote | Relocation\], [Full Time | Contract | Part Time\] Resume: [Link to resume\] and [Brief overview, what you're looking for\]

​

Please remember that this community is geared towards those with experience.

/r/MachineLearning
https://redd.it/1n4jdo7

Читать полностью…

Data Scientology

Tried building an explainable Vision-Language Model with CLIP to spot and explain product defects!

/r/deeplearning
https://redd.it/1n6lpte

Читать полностью…

Data Scientology

AI research is drowning in papers that can’t be reproduced. What’s your biggest reproducibility challenge?

Curious — what’s been your hardest challenge recently? Sharing your own outputs, reusing others’ work?

We’re exploring new tools to make reproducibility proofs verifiable and permanent (with web3 tools, i.e. ipfs), and would love to hear your inputs.

The post sounds a little formal, as we are reaching a bunch of different subreddits, but please share your experiences if you have any, I’d love to hear your perspective.

Mods, if I'm breaking some rules, I apologize, I read the subreddit rules, and I didn't see any clear violations, but if I am, delete my post and don't ban me please :c.



/r/LanguageTechnology
https://redd.it/1mzxfkz

Читать полностью…

Data Scientology

The best tools I’ve found for evaluating AI voice agents

I’ve been working on a voice agent project recently and quickly realized that building the pipeline (STT → LLM → TTS) is the easy part. The real challenge is evaluation, making sure the system performs reliably across accents, contexts, and multi-turn conversations.

I went down the rabbit hole of voice eval tools and here are the ones I found most useful:

1. **Deepgram Eval**
* Strong for transcription accuracy testing.
* Provides detailed WER (word error rate) metrics and error breakdowns.
2. **Speechmatics**
* I used this mainly for multilingual evaluation.
* Handles accents/dialects better than most engines I tested.
3. **Voiceflow Testing**
* Focused on evaluating conversation flows end-to-end.
* Helpful when testing dialogue design beyond just turn-level accuracy.
4. **Play.ht Voice QA**
* More on the TTS side, quality and naturalness of synthetic voices.
* Useful if you care about voice fidelity as much as the NLP part.
5. **Maxim AI**
* This stood out because it let me run *structured evals on the whole voice pipeline*.
* Latency checks, persona-based stress tests, and pre/post-release evaluation of agents.
* Felt much closer to “real user” testing than just measuring WER.

I’d love to hear if anyone here has explored other approaches to **systematic evaluation of voice agents,** especially for multi-turn robustness or human-likeness metrics.

/r/LanguageTechnology
https://redd.it/1mufzbv

Читать полностью…

Data Scientology

Open Sourced Research Repos Mostly Garbage

Im doing my MSc thesis rn. So Im going through a lot of paper reading and if lucky enough find some implementations too. However most of them look like a the guy was coding for the first time, lots of unanswered pretty fundamental issues about repo(env setup, reproduction problems, crashes…). I saw a latent diffusion repo that requires seperate env setups for vae and diffusion model, how is this even possible(they’re not saving latents to be read by diffusion module later)?! Or the results reported in paper and repo differs. At some point I start to doubt that most of these work especially ones from not well known research groups are kind of bloated/dishonest. Because how can you not have a functioning piece software for a method you published?

What do you guys think?

/r/deeplearning
https://redd.it/1mt9osc

Читать полностью…

Data Scientology

The AI Spam has been overwhelming - conversations with ChatGPT and psuedo-research are now bannable offences. Please help the sub by reporting the spam!

Psuedo-research AI conversations about prompt engineering and recursion have been testing all of our patience, and I know we've seen a massive dip in legitimate activity because of it.

Effective today, AI-generated posts & psuedo-research will be a bannable offense.

I'm trying to keep up with post removals with automod rules, but the bots are constantly adjusting to it and the human offenders are constantly trying to appeal post removals.

Please report any rule breakers, which will flag the post for removal and mod review.

/r/LanguageTechnology
https://redd.it/1mf7igt

Читать полностью…

Data Scientology

Estimating Distance of Ships from PTZ Camera (Only Bounding Box + PTZ Params)

/r/computervision
https://redd.it/1mgp2gr

Читать полностью…

Data Scientology

Is it possible to do something like this with Nvidia Jetson?

/r/computervision
https://redd.it/1ma4gvj

Читать полностью…

Data Scientology

Wheres the Best Place to Rent a GPU for Model Training

Im planning some AI model training and want to rent a powerful GPU like an RTX 4090 instead of buying onejust curious.
Which platforms do you usually use Hows the pricing and availability in your area ?

/r/deeplearning
https://redd.it/1onzsyi

Читать полностью…

Data Scientology

Google PhD Fellowship recipients 2025 D

Google have just announced the 2025 recipients.

What are the criteria to get this fellowship?

https://research.google/programs-and-events/phd-fellowship/recipients/

/r/MachineLearning
https://redd.it/1ogy6z9

Читать полностью…

Data Scientology

Dual 3D vision | software/library - synced TEMAS modules

/r/computervision
https://redd.it/1o8uyy4

Читать полностью…

Data Scientology

Face Reidentification Project 👤🔍🆔

/r/computervision
https://redd.it/1o0e88e

Читать полностью…

Data Scientology

basketball players recognition with RF-DETR, SAM2, SigLIP and ResNet

/r/computervision
https://redd.it/1nv4d8u

Читать полностью…

Data Scientology

R ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

We released ShinkaEvolve, a new state-of-the-art and fully open-source framework for program optimization, which we specifically designed to be easily integrated into any scientific codebase.

Open source code: https://github.com/SakanaAI/ShinkaEvolve

Technical report: https://arxiv.org/abs/2509.19349

Blog: https://sakana.ai/shinka-evolve/

You can start playing with ShinkaEvolve without even downloading any code, all inside a remote Google Colab instance: https://colab.research.google.com/github/SakanaAI/ShinkaEvolve/blob/main/examples/shinka\_tutorial.ipynb

In our technical report, we show how ShinkaEvolve can be easily applied across different problem domains. On the canonical circle packing task, ShinkaEvolve discovers a new solution with state-of-the-art performance beyond the recent closed-source AlphaEvolve using only 150 program evaluations. We even apply ShinkaEvolve to small-scale LLM pretraining, discovering a new load-balancing loss for MoE architectures with remarkable stabilization properties.

ShinkaEvolve also comes with a detailed and lightweight WebUI to monitor its discoveries in real-time!

/r/MachineLearning
https://redd.it/1nq856v

Читать полностью…

Data Scientology

R NeurIPS rejected paper resubmission

My paper just got rejected (scores: 4, 4, 3, 3). I’m considering resubmitting it to IEEE SatML. What’s your opinion on SatML? Would it be better to aim for a journal like IEEE TIFS instead? Any other recommendations? I’m not really interested in ICLR since I feel it might get rejected there too. Field: AI Security.

/r/MachineLearning
https://redd.it/1nkrmzr

Читать полностью…

Data Scientology

D Self-Promotion Thread

Please post your personal projects, startups, product placements, collaboration needs, blogs etc.

Please mention the payment and pricing requirements for products and services.

Please do not post link shorteners, link aggregator websites , or auto-subscribe links.

\--

Any abuse of trust will lead to bans.

Encourage others who create new posts for questions to post here instead!

Thread will stay alive until next one so keep posting after the date in the title.

\--

Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.

/r/MachineLearning
https://redd.it/1n67lft

Читать полностью…

Data Scientology

M4 Mac Mini for real time inference

Nvidia Jetson nanos are 4X costlier than they are in the United States so I was thinking of dealing with some edge deployments using a M4 mini mac which is 50% cheaper with double the VRAM and all the plug and play benefits, though lacking the NVIDIA accelerator ecosystem.

I use a M1 Air for development (with heavier work happening in cloud notebooks) and can run RFDETR Small at 8fps atits native resolution of 512x512 on my laptop. This was fairly unoptimized

I was wondering if anyone has had the chance of running it or any other YOLO or Detection Transformer model on an M4 Mini Mac and experienced a better performance -- 40-50fps would be totally worth it overall.

Also, my current setup just included calling the model.predict function, what is the way ahead for optimized MPS deployments? Do I convert my model to mlx? Will that give me a performance boost? A lazy question I admit, but I will be reporting the outcomes in comments later when I try it out after affirmations.

Thank you for your attention.

/r/computervision
https://redd.it/1n563nj

Читать полностью…

Data Scientology

I built a tool to benchmark tokenizers across 100+ languages and found some wild disparities R

TL;DR: Created tokka-bench to compare tokenizers across languages. Turns out your fine-tune's multilingual performance might suck because of tokenization, not architecture. Also explains why proprietary models (Claude, GPT, Gemini) are so much better at non-English tasks.

Links:

[Live dashboard](https://tokka-bench.streamlit.app/)
Full blog post
[GitHub repo](https://github.com/bgub/tokka-bench)

https://preview.redd.it/7i03jela9elf1.png?width=1724&format=png&auto=webp&s=95378457970e6337b147e71d7a8f0ab2dd67cb91

# The Problem Nobody Talks About

I started this as a side quest while pretraining a multilingual model, but tokenization turned out to be way more important than expected. There are two hidden layers creating massive efficiency gaps:

UTF-8 encoding differences:

English: \~1 byte per character
Arabic: 2+ bytes per character
Chinese: 3+ bytes per character

Tokenization bias: Most tokenizers are trained on English-heavy data, so they allocate way more vocabulary to English patterns. These compound into serious problems.

# Why This Affects Performance

During training: If you allocate tokens proportionally (10M English, 1M Khmer), the Khmer text has WAY less semantic content because it needs more tokens per word. Plus Khmer tokens end up being character-level instead of semantic units, making concept storage much harder.

During inference: Low-resource languages need 2-3x more tokens per sentence:

Slower throughput (costs more to serve)
Context windows fill up faster
More chances to mess up during generation

# What I Built

tokka-bench measures four key things:

1. Efficiency \- bytes per token (compression quality)
2. Coverage \- unique tokens used (script representation)
3. Word splitting \- how often semantic units get fragmented
4. Subword fertility \- average tokens per semantic unit

# Interesting Findings

You can actually reverse-engineer training data from tokenizer performance:

Kimi K2: Exceptional Mandarin coverage (obviously Chinese-trained)
Gemma 3: Strong Urdu/Hindi performance
gpt-oss: Good Arabic/Gujarati coverage

Weirdest finding: Programming languages show almost identical efficiency across all tokenizers. Probably because everyone trains on GitHub with similar language distributions.

# Technical Details

Built on high-quality datasets (FineWeb, FineWeb-2, StarCoder). Samples 2MB per language and calculates per-language metrics. Has some limitations around cross-linguistic comparison due to UTF-8 differences, but great for comparing tokenizers on the same language.

Shoutout to Judit Ács for the original subword fertility metrics and Rust et al's ACL paper that laid the groundwork.

PS: if you're from an AI lab and want to contribute your tokenizer's metrics (even if proprietary), please reach out! The community would benefit a lot from understanding how SOTA systems handle this stuff.

Posted this on LinkedIn/Twitter already but figured r/MachineLearning would appreciate the technical details. Happy to answer questions about methodology or findings!

/r/MachineLearning
https://redd.it/1n0r8b7

Читать полностью…

Data Scientology

D Conferences need to find better venues

Better = venues that are virtually accessible for any researcher/author to go to.

Just this morning, I'm denied the U.S. B1 visa. I'm supposed to present my work at ICCV 2025 in Hawaii. And during my in-person interview, the Visa Officer did not even bother to ask for the invitation letter.

This really blows cause it's supposed to be my first time and I was so excited about attending it. Would love to hear your thoughts about this.

/r/MachineLearning
https://redd.it/1mtfikh

Читать полностью…

Data Scientology

HR have been using AI against you for years


HR scan, rank, and reject your CV without a single human glance.
They filter, using Ai all automatically.
Most decisions are made before an actual recruiter even sees your name.

---

For this reason I built **Laboro** to flip the game.
It applies to jobs for you:
scrapes listings from 70k+ company career pages, matches them to your real experience,
Then Laboro AI Agent opens the browser, finds the forms, understands the fields, and fills them out with your CV.

---

So now it’s AI vs AI.
Their algorithm decides who gets in, mine makes sure you get there.
That’s not cheating that’s leveling the field.

When both sides have the same tech power, the only thing left to decide is merit, no insider referrals, no gatekeeping, no “friend of the hiring manager”, Just skills vs the system.

---

If HR hates that, maybe it’s because for the first time… they don’t get to choose who plays.

/r/deeplearning
https://redd.it/1mops78

Читать полностью…

Data Scientology

[D] Is modern academic published zero-sum?

It seems the current state of publishing in A* venues (CVPR, NeurIPS, ICML, ICCV/ECCV) is zero-sum. One person’s rejection is another person’s acceptance. Reviewers seem to reject papers just for the sake of rejection. There’s a sense that some reviewers reject papers not on substantive grounds, but out of an implicit obligation to limit acceptance rates. Rebuttals appear to be pointless as reviewers take stubborn positions and not acknowledge their misunderstandings during this period. Good science just doesn’t appear to be as valued as the next flashiest LLM/VLM that gets pretty results.

/r/MachineLearning
https://redd.it/1miq2y4

Читать полностью…

Data Scientology

D - NeurIPS'2025 Reviews

Hey everyone,

NeurIPS 2025 reviews should be dropping soon (July 24th AoE), and I thought it might be a good idea to start a thread where we can share our thoughts, experiences, and reactions.

Feel free to post your initial impressions, any surprises (good or bad), questions about rebuttals, or just how you’re feeling about the process this year. Whether it’s your first submission or your tenth, you’re not alone in the rollercoaster.

Let’s keep things constructive and supportive. Good luck to all!

/r/MachineLearning
https://redd.it/1m74ugv

Читать полностью…

Data Scientology

D Monthly Who's Hiring and Who wants to be Hired?

For Job Postings please use this template

>Hiring: [Location\], Salary:[\], [Remote | Relocation\], [Full Time | Contract | Part Time\] and [Brief overview, what you're looking for\]

For Those looking for jobs please use this template

>Want to be Hired: [Location\], Salary Expectation:[\], [Remote | Relocation\], [Full Time | Contract | Part Time\] Resume: [Link to resume\] and [Brief overview, what you're looking for\]

​

Please remember that this community is geared towards those with experience.

/r/MachineLearning
https://redd.it/1loqe5e

Читать полностью…
Subscribe to a channel