RelPose++: Recovering 6D Poses from Sparse-view Observations
https://amyxlase.github.io/relpose-plus-plus/
By: Amy Lin, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani, Carnegie Mellon University
​
>Estimating 6D Camera Poses from Sparse Views. RelPose++ extracts per-image features (with positionally encoded image index and bounding box parameters) and jointly processes these features using a Transformer. We used an energy-based framework to recover coherent sets of camera rotations by using a score-predictor for pairs of relative rotations. RelPose++ also predicts camera translations by defining an appropriate coordinate system that decouples the ambiguity in rotation estimation from translation prediction. Altogether, RelPose++ is able to predict accurate 6D camera poses from 2-8 images.
/r/computervision
https://redd.it/13co28p
RMultiple gpus
I want to train a large network and I have 4 gpus on the server. My model is trained on just one gpus and it can’t be trained because of “cuda out of memory” error. How can I train my model on all available gpus? Is it complex or easy to do so? Do you have any ideas to solve this error?
/r/deeplearning
https://redd.it/13awigr
I did a beginners mistake
Sharing this so maybe it'll save at least one of you the same trouble...
I used to use open cv a lot, and open cv resize function takes the input shape as (width, height).
I'm doing my graduation internship. I've been doing benchmarking in SOTA segmentation models and trying stuff to develop the company's algorithm, I've managed to improve 8% performance (mIoU) and -30% inference time.
Only problem: my tutor did some trainings with his code, just to verify, and his results were better than mine (with my architecture implementation). Everything was the same, loss function, hyperparameters, data augmentation... everything. But one thing. I used torch vision resize function, and guess what, for them it's (height, width).
I was training my models on a fucked up dataset cuz I inverted height and width.
The good news is thatI have more than 10% improvement now... the bad news is that tomorrow I have to tell my director that I fucked up from the start and that I have to redo the benchmarks.
I feel like shit tbh, stupidest mistake ever
/r/computervision
https://redd.it/1360a14
Your First Recommendation System: From Data Preparation to ML Debugging and Improvements Assessment
https://towardsdatascience.com/your-first-recommendation-system-from-data-preparation-to-ml-debugging-and-improvements-assessment-eb628573436
/r/deeplearning
https://redd.it/131bigx
Training YOLOV8 on Custom Dataset in batches of 128!
yolo task=detect mode=train model=yolov8x.pt epochs=100 batch=128 data=data/trash_piles.yaml device=\'0,1\'
Training with some rather large batches on Dual RTX 6000 Ada.
​
https://preview.redd.it/8np0jv15hjva1.png?width=1310&format=png&auto=webp&v=enabled&s=d68a23c96a0e0b788bc2512634cfc66ffe64d45e
/r/deeplearning
https://redd.it/12vr6ag
R 🚀🧠 Introducing 3 New LoRA Models Trained with LLaMA on the OASST Dataset at 2048 seq length! 📊🔥
We are super excited to announce the release of 3 brand new LoRA models trained using the LLaMA model! These state-of-the-art models have been trained on the full 2048 sequence length for 4 epochs, using the OASST dataset. 🌐💡
Shoutout to LAION and Open-Assistant for giving us early research access to the dataset 🎉
Checkout this and more over on our serpai/chat-llama">FREE gumroad if you want to sign up for future releases and guides as well.
Checkout out our website for a post with more info: https://serp.ai/chat-llama/
\- LoRA-7B 🚀
\- LoRA-13B 💥
\- LoRA-30B 🌌
We can't wait to see what amazing things you'll be able to accomplish with these new models! 🌟 So, feel free to share your experiences, ask questions, or discuss the potential applications for these models. 🧪🔬
Happy experimenting, and let's revolutionize the world of machine learning together! 💻🌍
Checkout our github for LLaMA LoRA training repos, inferencing guis, chat plugins (that you can also use with llama), and more.
Cheers! 🍻
/r/MachineLearning
https://redd.it/12rds2h
How we picked a vector database for our open-source app
https://stablecog.com/blog/the-best-vector-database-for-stablecogs-semantic-search
/r/deeplearning
https://redd.it/12krpup
Why Do You Think a Model Like GPT-4 Works So Well in non-English Languages?
I don't know what other "3rd party data" GPT-4 was trained on, but I wonder if part of the success of these models is that languages when projected into a large embedding space are fairly isomorphic. (Generally) humans have universal aspects: senses, family structures, they are diurnal, etc. There's variation in how you apologize or express social closeness in each language, but human beings have to do a lot of similar things like apologize or express social distance.
There's some direct evidence of this in language pragmatics, where an English rhetorical taxonomy that is useful for many NLP applications, has all been bootstrapped into Arabic and functioned well for clustering. The idea is that the sociocultural moves we make in English (sounding more or less certain, expressing emotion, distinguishing between the abstract and concrete) are made in all languages--different at the lexical level but the same at the lexicogrammatical (there's a Russian version that works well in testing but there's no publicly available papers on it).
What's your take on this?
/r/LanguageTechnology
https://redd.it/12glwwh
[repo](https://github.com/Saswatm123/3D-Brain-Tumor-Segmentation-PyTorch), and the [README](https://github.com/Saswatm123/3D-Brain-Tumor-Segmentation-PyTorch) has more in-depth images of everything explained here. Thanks for reading, and I am open to hearing about job opportunities at the moment :)
/r/computervision
https://redd.it/12bhdfz
3D Brain Tumor Segmentation in PyTorch using U-Net & Eigenvector projection for color invariance
[https://github.com/Saswatm123/3D-Brain-Tumor-Segmentation-PyTorch](https://github.com/Saswatm123/3D-Brain-Tumor-Segmentation-PyTorch)
**Preface: I'm looking for a CV or ML/MLE job since getting my last offer rescinded in December.**
I created a 3D Brain Tumor segmentation model using PyTorch that takes in human brain MRI scan data as input, and outputs a segmentation map of the part of the brain that has a tumor, if it is present.
[Input images example](https://i.redd.it/e4pf5o20mura1.gif)
​
[Predicted segmentation map for previous input](https://i.redd.it/dii6ijo3mura1.gif)
There are many more pictures/GIFs on the README, but I would like to briefly go over the project here.
I settled on a U-Net architecture after experimenting with a couple other architecture styles. The skip connections really make a difference in the pixel-by-pixel accuracy of the segmentation map produced.
One interesting issue that took a bit of extra math to solve was regarding image color. The images would often be random hues since these MRI images come from different machines. For example, one brain image would be blue-on-black, while another one might be orange-on-green. Tumorous tissue is detected by a deviation in color, as can be seen in the first GIF, where even an untrained eye can pick out the tumor location, and the specific hue does not matter. Examples of this multiple hue effect can be seen in the README. This not only increases the dimensionality of the problem space, but also overactivates the residual connections often. For example, if they are used to lower-intensity color schemes (blue on black), a brighter color scheme (orange on green) would create an almost fully activated segmentation map since the first skip-connection would simply forward the image to the last couple layers. I needed a way to create color invariance in images. This is usually solved through grayscaling the image, which takes the L2 norm per pixel and uses that as a "brightness" value. However, this does not work for this use case. An L2 norm takes the shell of a 3D sphere & compresses it to a single point. This means that points on the same sphere shell, but separate from each other,(\[0,1,1\],\[0,1,0\],\[1,1,0\]) would all be considered the same, and a tumor would go undetected. We need to maintain 3D distance between points while ignoring the actual color.
Solution: We view each image as a 5D point cloud of (x, y, R, G, B), where (x, y) are the coordinates per pixel, and (R, G, B) are the values for the pixel. We may ignore (x, y) for now and focus on the (R, G, B) values. Color invariance while maintaining shape is now simply a problem of scale, translation, and rotation invariance of a 3D point cloud.
Translation invariance is trivial - we simply center the means per axis. This means that any configuration of this point cloud that has the same shape, but is translated differently, maps to the same position.
Rotation invariance then can be achieved by taking the Eigenvectors of our centered point cloud, ordering them by length, and mapping them to axes (largest EV = axis 0, second largest = axis 1, etc.) We can then simply rotate our point cloud according to our eigenvector projections. This ends up being a 1-sample PCA, where the sample is the point cloud image. The README shows a table of various images with this technique applied to it, along with their point cloud representations.
This technique helps my model beat the human accuracy benchmark. The problem of residual channels being overwhelmed/thrown off by various color schemes is not an issue anymore.
I prefer solutions involving invariance & explicit bias over augmentation because augmentation is exponential in time & space. If there are 5 factors with 3 levels each that we wish to make our model robust to, the extra multiplier is 3\^5, and we can get rid of this with some ML craftiness. The augmented solution is also much more vulnerable to adversarial attacks in a way that
Estimation of Z height of a ball in flight.
/r/computervision
https://redd.it/129s8jr
OpenAI API for text extraction
Hi, I have a corpus of several extracted and labeled items. I want to use these to find similar items in an unseen long text document using an openAI endpoint. Is there something like semantic search but with learned embeddings? Which route should I take? Thank you in advance.
/r/LanguageTechnology
https://redd.it/124h7sy
(Soon) NLP graduate and feel completely inferior on the job market
I am a master student in NLP/Computational linguistics and currently looking for jobs after graduation. Prepare for long panicked post, hope this is the right place to ask/vent..
Both my bachelor and master were a specialized NLP degree. Especially the bachelor was pretty general: I took all the same intro to linguistics (syntax, phonetics, morphology etc.) classes as the theoretical linguistics. I had a lot of „traditional“ NLP methods such as parsing based on formal languages, automata theory, search algorithms. Basic maths, statistics, linear algebra. Specialized seminars on coreference, sentiment analysis etc but those were mostly in the style of reading-papers-and-discussing-them. My master offered more technical and applied courses, but I did not feel well prepared since I never learned how to program neural networks myself except for a very basic numpy and pandas based classifier, but suddenly everyone was talking about transformer models. I had theoretical ML classes, but somehow we were just expected to know how to implement them into our projects too? I am now doing my thesis where I am using an existing system (pytorch-based) and adapting and tuning it for a slightly different task. While I (thought I) know how to program and the basic of how machine learning, the reality is I feel soooo out of place. I have a hard time even understanding the pytorch documentation, and I feel like there are a million things to consider. Shapes don’t match, cuda out of memory, suddenly i need to do gradient clipping which I feel I was taught about in 30min 2 years ago maybe. I usually make it work somehow after 5 nervous breakdows, but I constantly feel like I am half-assing everything, just trying to get it to run at least. If I were to build such a system, even a way simple one, from scratch, I would die.
Now looking at jobs, most of those that advertise with NLP require „practical machine learning experience with frameworks such as TensorFlow, PyTorch…“, and nearly every job is also equally directed at graduates from EITHER data science, mathematics, computer science, NLP … How can I keep up with data scientists in this aspect? Did I mess up by not practicing how to actually code and understand complex systems during my degree? I know a few other students who expressed similar concerns, at least from my school. I definitely see potential for me in areas with highly specialized use cases/messy/non-standard data, but wonder if this really needed >3 years of linguistic basics. Will employers actually care about my linguistic background compared to a data scientist with some NLP experience? Currently I feel like I would have done better doing a data science degree and then taking a few classes on linguistics later on to specialize…. I guess I will find a job one way or another but I am already scared of interviews because of these inadequacies.
/r/LanguageTechnology
https://redd.it/11zvsnj
Best GPUs for pretraining roBERTa-size LLMs with a $50K budget, 4x RTX A6000 v.s. 4x A6000 ADA v.s. 2x A100 80GB
Hi folks,
Our lab plans to purchase a server with some decent GPUs to perform some pertaining tasks for program codes. We won't work on very large LLM and we even may not try the T5 model. Currently, we want to first try the roBERTa model. We have a $50K budget. And it's our first time purchasing GPU servers.
I did some preliminary study and found the suggested GPU is A6000 ADA which has 48 GB GPU memory, according to https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/. Since our tasks require lots of GPU memory, we think a GPU with more than 32 GB will be good for us. So our alternative choices are RTX A6000 and A100 80GB HBM2 cards.
Based on these, we got three server specs from Exxact ( https://www.exxactcorp.com/TWS-115999024/configurator), (1) a $43K spec with 4 A6000 ADA cards, (2) a $32K spec with 4 RTX A6000 cards, and (3) a $41K spec with 2 A100 80GB cards. The other parts in the specs, e.g., CPU and RAM, are almost the same. I have attached the specs in screenshots.
Now, I have some questions.
1. A6000 ADA removed NVLink (https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-nv-link-even-on-pro-gpus/230874) which is very important for performance boosting and GPU memory pooling. Does this mean it's a good choice to have multiple A6000 ADA cards on a server?
2. A6000 ADA is a very new GPU improved from RTX A6000. But it has the NVLink, which means the server GPU memory can reach 48 * 4 GB when connecting 4 RTX A6000 cards. However, we are going to use the GPU server for several years. For IT products, it's always better to purchase the latest ones. Is that true for GPU cards? And A6000 ADA has more tensor and cuda cores than RTX A6000.
3. For the A100 80GB spec, we can only have 2 cards wondering the budget. For the LLM pertaining, more cards usually mean more parallelism and faster training. Based on my study, A6000 ADA has comparable performance to A100 on DL benchmarks. Is this A100 80GB spec a good choice?
4. Except for the ahead-mentioned specs, what else would you recommend for our pretraining tasks, especially for GPUs?
Thanks for your time! We really appreciate any suggestions.
/r/deeplearning
https://redd.it/11vb220
Finno-Ugric open-source machine translation
We here at the University of Tartu created an NMT engine for 23 Finno-Ugric languages, targeting low-resource languages: Livonian, Komi, Udmurt, Võro and several others. Most of the covered low-res languages are not part of Meta's M2M100 or NLLB, nor are they part of Google Translate, Bing Translator or DeepL yet.
FairSeq translation model and full list of supported languages here: https://huggingface.co/tartuNLP/smugri3-finno-ugric-nmt. Online demo here: https://translate.ut.ee/, submitting corrected translations is also supported, in case you speak any of these languages - we are hoping to use the feedback to improve translation quality in the near future.
/r/LanguageTechnology
https://redd.it/11r0izu
DirectStorage - Loading data to GPU directly from the SSD drive, almost without using CPU
Hello forum!
Is it possible in major frameworks (TF or PyTorch) to load all data to GPU directly from a fast SSD disc, without using the CPU? Is it possible today? I read about it 2-3 years ago. As far as I remember, Linux was the first to deliver such an option.
I am talking about DirectStorage :
Nvidia: How to connect GPUs direct to SSDs for a speed boost • The Register
https://www.theregister.com/2022/03/14/nvidia\_gpu\_data/
GPUDirect Storage: A Direct Path Between Storage and GPU Memory | NVIDIA Technical Blog
https://developer.nvidia.com/blog/gpudirect-storage/
DirectStorage API: Does it allow direct SSD-to-GPU data streaming? | ResetEra
https://www.resetera.com/threads/directstorage-api-does-it-allow-direct-ssd-to-gpu-data-streaming.278642/
​
Is it possible today? I am asking because I am thinking about a fast disc, like Samsung 990 PRO 2TB.
I have read that NVME is needed, SATA3 is not sufficient.
​
Regards
/r/deeplearning
https://redd.it/13aks7g
Jetson/DeepStream Learning Resources
I recently got into the world of Nvidia DeepStream and Jetson modules and I'm finding it difficult to find solid explanations of how all of this stuff works together. Most of the videos on YouTube are marketing stuff, 2 hour long seminars, or people just flashing a Jetson module and then opening the demo apps.
I am a self taught python programmer of about 8 years but came from the 3D world so have a pretty abstract understanding of ML. I am really wanting to understand this stuff on a lower-level but I feel like I am getting tripped up on the VAST amount of terms and tools out there in the Nvidia CV space.
So does anyone know of any really good resources for learning this stuff. I'm more of a visual learner so videos would be great but well done documentation can sometimes be just as good.
I know there's the Nvidia DLI courses but most cost money and seem pretty pricey. Is it worth it or is it just more dry talking that leaves you more confused after? Any specific ones that people would recommend?
I would really appreciate any advice anyone can give. Feel free to dump as many links as you have saved lol. Sorry for the long post. Thanks in advance!
/r/computervision
https://redd.it/13867rb
How to get started in a language not that common, not in Latin script?
Hi all,
I have been trying to work in NLP bettering for Gujarati and am at my wit’s end in how to begin/contribute.
The language is not exactly in a script that is common, and there’s a lot more that can be done here. I need help on what the right way/place to contribute would be.
My skills - English and Gujarati, Python, ML and NLP
Interests - trying to uplift Gujarati translation/audio/generative models from where it is today!
/r/LanguageTechnology
https://redd.it/13403we
“Track-Anything”: video object tracking & segmentation tool
https://github.com/gaomingqi/Track-Anything
Track-Anything is a flexible and interactive tool for video object tracking and segmentation. It is developed upon Segment Anything, can specify anything to track and segment via user clicks only. During tracking, users can flexibly change the objects they wanna track or correct the region of interest if there are any ambiguities. These characteristics enable Track-Anything to be suitable for:
Video object tracking and segmentation with shot changes.
Visualized development and data annnotation for video object tracking and segmentation.
Object-centric downstream video tasks, such as video inpainting and editing.
/r/computervision
https://redd.it/12z1av5
Segment Anything Model (SAM) explained in detail
Here is a video explaining the latest SAM model from Meta AI. It covers the model training, data engine, SA-1B data collection and finally the results. https://youtu.be/qa3uK3Ewd9Q
Hope its useful.
/r/deeplearning
https://redd.it/12ssm42
Reminder: Use the report button and read the rules!
/r/MachineLearning
https://redd.it/120f4oy
D Simple Questions Thread
Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
Thanks to everyone for answering questions in the previous thread!
/r/MachineLearning
https://redd.it/12gls93
Meta's new Segment Anything Model Explained
https://youtu.be/bx0He5eE8fE
/r/deeplearning
https://redd.it/12dn9t5
an explicitly invariant model is not.
The Loss Functions I used were DICE & Tversky. DICE is simply Intersection over Union between our predicted segmentation map and the ground truth segmentation map (code in repo & below).
def DICE_loss(input, target, eps= 1e-5):
'''
Args:
input:
Predicted Tensor to gauge accuracy of. Same size as target.
target:
Target Tensor to use as ground truth. Same size as input.
eps:
Smoothing value to ensure no division by zero.
Desc:
DICE Loss function computes 1 - DICE coefficient. DICE coefficient
is representation of Intersection over Union. Formula is:
2 * |Input && Target| / ( |Input| + |Target| )
For |...| sybolizing cardinality of a set.
Since input can include soft probabilities as well as hard 1/0,
the cardinality of an input is the sum.
Returns:
Tensor containing 1 - DICE coefficient, optimal when minimized @ 0
'''
intersection = (input * target).view(input.shape[0], -1).sum(axis= -1)
union = input.view(input.shape[0], -1).sum(axis= -1) + target.view(target.shape[0], -1).sum(axis= -1)
return (1 - 2*intersection/(union + eps) ).sum()
Tversky is similar, but more fine tuned. Tversky breaks down the Union term into False Positive + False Negative + True Positive. We can then add alpha & beta parameters to the False Positive & False Negative terms & guide our model's learning dynamically based on the mistakes it is making. Here is the code (also in repo).
def tversky_loss(input, target, eps= 1, alpha= .5, beta= .5):
'''
Args:
input:
Predicted Tensor to gauge accuracy of. Same size as target.
target:
Target Tensor to use as ground truth. Same size as input.
eps:
Smoothing value to ensure no division by zero.
alpha:
Weight to put on False Positives. Higher value penalizes more.
Value of .5 for alpha & beta results in standard DICE loss.
beta:
Weight to put on False Negatives. Higher value penalizes more.
Value of .5 for alpha & beta results in standard DICE loss.
Desc:
Tversky Loss is DICE Loss (IoU) with separate weights put on
False Positives and False Negatives. The Union calculation for
the denominator is framed as:
Union = True Positive + False Positive + False Negative
This allows us to put separate weights on False Positives and
False Negatives, leading to the calculation:
Union = True Positive + alpha * False Positive + beta * False Negative
Values of .5 for both parameters create the standard DICE loss.
Values lie in domain (0, inf).
Returns:
Tensor containing 1 - Tversky coefficient, optimal when minimized @ 0.
'''
# Flattens mask to single binary image since all 3 channels are the same
# for all masks in the batch
target = target[:,0,:,:].reshape(-1)
input = input.reshape(-1)
true_pos = (input * target).sum()
false_pos = ( (1-target) * input).sum()
false_neg = (target * (1-input) ).sum()
tversky_coef = (true_pos + eps) / (true_pos + alpha*false_pos + beta*false_neg + eps)
return 1 - tversky_coef
The model, like I mentioned before, is a simple U-Net architecture, that looks like this:
[Image created in NN-SVG & MS Paint](https://preview.redd.it/m6eo4v1pzura1.png?width=1236&format=png&auto=webp&v=enabled&s=7596f1d05baed4a8a613ab7b3d43181f2f4c41cc)
The PyTorch code for the model can be found in the
3D Brain Tumor Segmentation in PyTorch using U-Net & Eigenvector projection for color invariance
https://github.com/Saswatm123/3D-Brain-Tumor-Segmentation-PyTorch
Preface: I'm looking for a CV or ML/MLE job since getting my last offer rescinded in December.
I created a 3D Brain Tumor segmentation model using PyTorch that takes in human brain MRI scan data as input, and outputs a segmentation map of the part of the brain that has a tumor, if it is present.
Input images example
​
Predicted segmentation map for previous input
There are many more pictures/GIFs on the README, but I would like to briefly go over the project here.
I settled on a U-Net architecture after experimenting with a couple other architecture styles. The skip connections really make a difference in the pixel-by-pixel accuracy of the segmentation map produced.
One interesting issue that took a bit of extra math to solve was regarding image color. The images would often be random hues since these MRI images come from different machines. For example, one brain image would be blue-on-black, while another one might be orange-on-green. Tumorous tissue is detected by a deviation in color, as can be seen in the first GIF, where even an untrained eye can pick out the tumor location, and the specific hue does not matter. Examples of this multiple hue effect can be seen in the README. This not only increases the dimensionality of the problem space, but also overactivates the residual connections often. For example, if they are used to lower-intensity color schemes (blue on black), a brighter color scheme (orange on green) would create an almost fully activated segmentation map since the first skip-connection would simply forward the image to the last couple layers. I needed a way to create color invariance in images. This is usually solved through grayscaling the image, which takes the L2 norm per pixel and uses that as a "brightness" value. However, this does not work for this use case. An L2 norm takes the shell of a 3D sphere & compresses it to a single point. This means that points on the same sphere shell, but separate from each other,([0,1,1\],[0,1,0\],[1,1,0\]) would all be considered the same, and a tumor would go undetected. We need to maintain 3D distance between points while ignoring the actual color.
Solution: We view each image as a 5D point cloud of (x, y, R, G, B), where (x, y) are the coordinates per pixel, and (R, G, B) are the values for the pixel. We may ignore (x, y) for now and focus on the (R, G, B) values. Color invariance while maintaining shape is now simply a problem of scale, translation, and rotation invariance of a 3D point cloud.
Translation invariance is trivial - we simply center the means per axis. This means that any configuration of this point cloud that has the same shape, but is translated differently, maps to the same position.
Rotation invariance then can be achieved by taking the Eigenvectors of our centered point cloud, ordering them by length, and mapping them to axes (largest EV = axis 0, second largest = axis 1, etc.) We can then simply rotate our point cloud according to our eigenvector projections. This ends up being a 1-sample PCA, where the sample is the point cloud image. The README shows a table of various images with this technique applied to it, along with their point cloud representations.
This technique helps my model beat the human accuracy benchmark. The problem of residual channels being overwhelmed/thrown off by various color schemes is not an issue anymore.
I prefer solutions involving invariance & explicit bias over augmentation because augmentation is exponential in time & space. If there are 5 factors with 3 levels each that we wish to make our model robust to, the extra multiplier is 3\^5, and we can get rid of this with some ML craftiness. The augmented solution is also much more vulnerable to adversarial attacks in a way that
Vectorised Object Mapping for Neural Field SLAM
By: **Xin Kong** **Shikun Liu** **Marwan Taher** **Andrew Davison** **Dyson Robotics Lab** **Imperial College London**
https://kxhit.github.io/vMAP
TL;DR: We present vMAP, an object-level real-time mapping system, with each object represented by a separate MLP neural field model, and object models are optimised in parallel via vectorised training.
We present vMAP, an object-level dense SLAM system using neural field representations. Each object is represented by a small MLP, enabling efficient, watertight object modelling without the need for 3D priors. As an RGB-D camera browses a scene with no prior information, vMAP detects object instances on-the-fly, and dynamically adds them to its map. Specifically, thanks to the power of vectorised training, vMAP can optimise as 50 individual objects in a single scene, with an extremely efficient training speed of 5Hz map update. We experimentally demonstrate significantly improved scene-level and object-level reconstruction quality compared to prior neural field SLAM systems.
​
​
https://preview.redd.it/0nlr8i5bnwqa1.png?width=2404&format=png&auto=webp&v=enabled&s=57e8b4daae8de3592ed54a6af25936098d0546ab
/r/computervision
https://redd.it/126scqw
Should I specialize in NLP considering the advent of Large Language Models?
I am feeling that most of cutting edge research work is being done in a handful of companies. In that case, how does the future look like say 5 years down the line for somebody specialising in research in NLP? Seems like models like ChatGPT can do many of NLP tasks and are so ahead of the curve that it will ne difficult to beat them. How do job prospects look like in NLP?
/r/LanguageTechnology
https://redd.it/121gv4c
Whisper Open AI: How to not include silences in the timestamps returned?
Hello
​
I'm using Whisper, the timestamps includes the silence.
When having a video with a speaker starting his speech at sec 10, I'm getting the first timestamp to be at sec 1. instead of sec 10.
Here is my config:
POST/ v1/audio/transcriptions
Config
```
{
model:"whisper-1"
file:"...mp3"
response_format:"srt",
prompt:"Hello, welcome to my lecture"
}
​
```
​
Output:
```
1
00:00:01,000 --> 00:00:14,000
Why are there both successful and struggling entrepreneurs?
​
2
00:00:15,000 --> 00:00:23,000
Many customers prefer to watch videos to enjoy online content.
​
3
00:00:24,000 --> 00:00:32,000
an other sentences.
​
​
​
```
​
* I believe `1` it should be `00:00:10,000 --> 00:00:14,000`, since there is no one talking at all for 10 sec.
* Also, the `3`, the speakers starts again talking at sec 28, but I'm getting the timestamp to be at sec 24. The silence is simply included in the timestamp with Whisper
​
Any idea how I could fix that, maybe using a prompt?
​
Thanks!
/r/LanguageTechnology
https://redd.it/11xdnvd
Fine tune BERT for domain-specific information retrieval.
Hi guys, I'm a little lost on how to start a little side project.
So I want to take a BERT model, fine tune it on additional information about a specific domain which it was not initially trained on and then it should be able to answer questions regarding that topic. The way I understand it, I would need to put an additional question answering head on top of the fine-tuned model, in order for it to be able to answer questions and not just put out "random", to my query related sentences. Is this thinking correct?
I question this because all I find on the internet is fine tuning a model on qa- data, that is labeled dataset with questions and answers. My dataset on the otherhand consists on only text data, hence the title "information retrieval".
Thanks for your insights!
/r/LanguageTechnology
https://redd.it/11sxkj0
Is deep learning really so annoying?
I'm currently a master's student considering doing a PhD in computer vision afterwards (I haven't fully decided yet).
I really like 3D geometry. I also find some recent works combining 3D geometry with deep learning very elegant.
However, training deep networks annoys me. I feel like it blocks my productivity and I don't have the patience to wait for days to see how one parameter change affects the model. It seems that I don't produce much, I mostly test. Also, as a NN is a black box, its behaviour seems super random.
I'm wondering:
1. Is this the sign that I should simply look for a 3D geometry job in industry? Or can it be that the frustration goes away when one gains experience in deep learning?
2. Over time, does one develop better intuition in deep learning about what works and what doesn't or does it remain largely guesswork?
Thanks in advance!
/r/computervision
https://redd.it/11pqwoi