Reddit DevOps. #devops Thanks @reddit2telegram and @r_channels
What are the type of things you do as a DevOps manager?
I'm assuming some of the people that work here are in Management Roles. And I get the general gist of it, but what have you been up to the past year, maybe something concrete, any stumbling blocks. Just looking to hear some stories.
https://redd.it/1ls69qd
@r_devops
Looking for a small team to build and learn together this summer
Hey r/devops,
I’m hoping to find a few people interested in teaming up to work on a practical project this summer. Something hands-on around infrastructure, automation, or tooling, where we can learn from each other and get real experience.
I’ve been mostly working with cloud tools and some scripting lately, but want to try collaborating with others instead of working solo. No pressure or fancy plans, just a group of folks who want to build and improve together.
If this sounds like your vibe, please reply or DM. I’d love to hear what you’re working on or want to try.
https://redd.it/1ls2k8h
@r_devops
4-month global builder challenge for DevOps engineers — teams, mentorship, grants, and prizes
Hey r/devops,
Wanted to share an opportunity that might resonate with those who enjoy building scalable, reliable infrastructure and automated pipelines.
The **World Computer Hacker League (WCHL)** is a 4-month global builder challenge focused on **open internet infrastructure, AI, and blockchain**. Many teams are working on projects involving deployment automation, infrastructure as code, CI/CD pipelines, monitoring, and decentralized ops tooling.
Here’s what’s on offer:
* 👥 Team-based projects only — no solo entries, but you can find teammates on Discord
* 🧠 Weekly workshops and mentorship from experienced engineers
* 💰 Grants, bounties, and milestone-based rewards
* 🌍 Open to students and independent engineers worldwide
* ⚙️ Tech and stack-agnostic — build with the tools and frameworks that fit your vision
If you’re interested in applying DevOps best practices to decentralized systems, automating cloud deployments, or managing secure infrastructure at scale, this could be a great place to experiment and build.
📌 If you’re in **Canada or the US**, register through **ICP HUB Canada & US** so we can support you directly during the challenge:
[https://wchl25.worldcomputer.com?utm\_source=ca\_ambassadors](https://wchl25.worldcomputer.com?utm_source=ca_ambassadors)
Feel free to reach out if you want to discuss project ideas or find collaborators. Would love to see some strong DevOps projects in the lineup!
https://redd.it/1lryojc
@r_devops
What is GitOps: A Full Example with Code
https://lukasniessen.medium.com/what-is-gitops-a-full-example-with-code-9efd4399c0ea
https://redd.it/1lrsn3k
@r_devops
Devops as a college student
I have Devops as an ability enhancement course and next sem will start in mid August so I have approximately 1.5 months . Where should I learn devops?? So that I can implement these skills by the end of the semester
https://redd.it/1lrq9g7
@r_devops
CKA / CKS discussions
Hi guys, I’m preparing to take the CKA cert and following this one I’ll be preparing for CKS
I would like to know if there is some sort of discord, group discussions of any kind, or even people interested in share some knowledge and brainstorming for the exam?
Thanks!
https://redd.it/1lrkb49
@r_devops
Helping an AI engineer friend get DevOps skills, what roadmap would you suggest?
Hey r/devops 👋
I’m a DevOps/SRE engineer and I want to help a good friend of mine who works in AI/ML but is struggling to land better roles — a lot of AI engineering jobs now ask for:
* Kubernetes
* CI/CD pipelines
* Containers (Docker/Podman)
* Infrastructure-as-Code (Ansible, Terraform)
* Some Linux and networking knowledge
He’s strong in Python and ML frameworks but lacks hands-on experience with infrastructure, automation, and deployment workflows.
I’d like to design a **series of enablement sessions** (maybe 1–2 hours per week for a few months) where we do **hands-on, real-world DevOps tasks** together. My current rough plan looks like this:
1. Linux & basic networking tools (SSH, systemd, DNS, etc.)
2. Digital certificates (OpenSSL, TLS, HTTPS intros)
3. Containers (Dockerfiles, Podman, images, volumes)
4. CI/CD with GitLab or GitHub Actions (test, build, deploy pipelines)
5. IaC with Ansible and Terraform (just enough to be productive)
6. Kubernetes (local setup with kind/minikube, basic manifests, Helm)
7. Secrets management (Vault, sealed-secrets, etc.)
8. Monitoring/logging basics (Prometheus, Grafana, Loki)
**Questions for you all:**
* What would *you* add or remove?
* Any good beginner-friendly but realistic projects to tie this together?
* How would you avoid overwhelming him while still covering what matters?
* Any great open-source repos or free hands-on labs you’d recommend?
Thanks in advance for any suggestions — really want to set him up for success! 🙏
https://redd.it/1lris69
@r_devops
I wrote a tool to prevent OOM-killed builds on our CI runners
Hey /r/devops,
I wanted to share a solution for a problem I'm sure some of you have faced: flaky CI builds caused by memory exhaustion.
The Problem:
We have build agents with plenty of CPU cores, but memory can be a bottleneck. When a pipeline kicks off a big parallel build (make -j
, cmake
, etc.), it can spawn dozens of compiler processes, eat all the available RAM, and then the kernel's OOM killer steps in. It terminates a critical process, failing the entire pipeline. Diagnosing and fixing these flaky, resource-based failures is a huge pain.
The Existing Solutions:
Memory limits (`cgroups`/Docker/K8s): We can set a hard `memory` limit on the container or pod. But this is a kill switch. The goal isn't just to kill the build when it hits a limit, but to let it finish successfully.
Reduce Parallelism: We could hardcode make -j8
instead of make -j32
in our build scripts, but that feels like hamstringing our expensive hardware and slowing down every single build just to prevent a rare failure.
My Solution: Memstop
To solve this, I created Memstop, a simple LD_PRELOAD
library written in C. It acts as a lightweight process gatekeeper.
Here’s how it works:
1. You preload it before running your build command.
2. Before make
(or another parent process) launches a new child process, Memstop hooks in.
3. It quickly checks /proc/meminfo
for the system's available memory.
4. If the available memory is below a configurable threshold (e.g., 10%), it simply sleeps
and waits until another process has finished and freed up memory.
The result is that the build process naturally self-regulates based on real-time memory pressure. It prevents the OOM killer from ever being invoked, turning a flaky, failing build into a reliable, successful one that just might take a little longer to complete.
How to Integrate it:
You can easily integrate this into your Dockerfile
when creating a build image, or just call it in the script:
section of your .gitlab-ci.yml
, Jenkinsfile, GitHub Actions workflow, etc.
Usage is simple:
export MEMSTOPPERCENT=15
LDPRELOAD=/usr/local/lib/memstop.so make -j32
I'm sharing it here because I think it could be a useful, lightweight addition to the DevOps toolkit for improving pipeline reliability without adding a lot of complexity. The project is open-source (GPLv3) and on GitHub.
Link: https://github.com/surban/memstop
I'd love to hear your feedback. How do you all currently handle this problem on your runners? Have you found other elegant solutions?
https://redd.it/1lrg0x4
@r_devops
How do you enforce steps across all of you orgs pipelines?
I'm using Azure DevOps but I guess that question works for other platforms too.
How do you make sure all build pipeline run, for example a CVE scan? Some kind of policy as code that set rules for all pipelines.
https://redd.it/1lrcbrs
@r_devops
Moving from Jenkins to Harness, any advice and experience you could share?
So I have to learn more about Harness, and our org is moving from Jenkins to Harness.
Some pain points I have heard is that it isn't working easily with Terraform like Jenkins declarative pipelines, and that build artifacts do not persist within the same build run, and additionally after or as part of the build and you have to post/copy artifacts to S3 for example in order to persist a build artifact after a pipeline run. I really hope the last 2 items on artifact persistence are not accurate.
If it does not work so smoothly with Terraform, is that because Harness is so brand new and thus underdeveloped/under supported, or so that they can get you more dependent on their ecosystem and moving away from Terraform (or both)?
Just sharing here in case anyone has any advice or anything they might caution about such a move in general, and those 3 points above. I like the declarative pipeline approach, and now there's a lot of clicking and UI work here (and apparently lots and lots of yaml).
Harness looks like it is highly configurable, but also over-engineered. We use GitHub for code repository by the way.
PS: Is the best way to learn - outside of simply using it - their free courses or just going straight to doc reading? Not sure which might be more well done.
https://redd.it/1lqynqk
@r_devops
Looking for DQL/USQL Query Examples - Mobile App Focus
Hey everyone! Just started using Dynatrace and I'm looking for some solid DQL and USQL queries that work well in practice.
Coming from New Relic, I really miss their dedicated community forum where users shared queries that we could use to build custom dashboards. Does something similar exist for Dynatrace? If so, please point me in the right direction!
Our environment is very mobile app heavy, and while I'm super jealous of all the amazing out-of-the-box backend service and infrastructure dashboards that DT provides, I'm struggling to find good mobile-focused examples.
Would love to see queries for:
Mobile app performance metrics
User experience monitoring
Crash analytics
Network performance for mobile
Custom mobile KPIs
Any recommendations for query repositories, community resources, or your personal go-to queries would be hugely appreciated!
Thanks in advance! 🙏
https://redd.it/1lqz3l0
@r_devops
Got Rejected from Amazon DevOps Role — How Can I Level Up My Scripting and Interview Skills?
I got an opportunity to interview for a Devops Role at Amazon. The process started with an OA. Which had basic logic questions, some Linux commands, Docker basics and Behavioral questions.
After a week I got a call from the recruiter and she told me about the onsite interviews ahead. The first round was a Live Coding round. It was mostly DSA and OOPs, the questions were easy to medium I would say. A binary search and a prefix suffix multiplication problem. And those pillars of OOPs. As this role was around JDKs the interviewer also asked about basic java things like final finally finalize and about Diamond Problem in inheritance and how to deal with it. The First round went quite good. I got qualified for the next round.
the next round was a scripting and troubleshooting round. The interviewer asked me about whether I was sure that that was a position with around 2+ years of experience and I said yes I am quite aware of that and then he started questioning me. I won't say that i am the best at bash Scripting but I know my way around.
I was able to give me scripts for accessing files and logs and other basic stuff but he kept asking me if this was the best approach and I honestly told him that from experience and knowledge these scripts would work but I am also sure that there might be a better approach to this.
Obviously he has been working for 5+ yrs in Amazon and must be having more hands-on experience but my scripts were not at par according to him. And within a week I got the rejection mail.
So now I want ask all those who read through my rant, how do I improve my scripting skills given that I mostly use things like python and AWS cdk at my work. And what else to do if the interviewer doesn't approve my answer.
TL;DR:
Cleared Amazon OA and first live coding round (DSA + Java OOPs), but got rejected after the scripting/troubleshooting round. Interviewer felt my Bash scripts weren’t optimal, though they worked. I was honest about my approach and limitations. I usually work with Python and AWS CDK. Now I’m looking for solid ways to improve my Bash scripting and handle tough interviewer pushback better. Any advice?
https://redd.it/1lqwo4l
@r_devops
Deployment environment from scratch - OpenTofu or Terraform?
Hello friends,
some time ago, I started a new job in a company providing a SaaS platform + some customer managed installations on various cloud providers. The entire infrastructure is deployed and managed through Ansible. Recently we started a project for a new platform which will be hosted entirely in Azure, our first time with this provider, and I started designing the infrastructure and integration into our deployment env. This became a huge pain pretty quickly. Ansible modules for Azure have a lot of missing functionalities and bugs and, as should come of a surprise to noone, Ansible itself is not really suitable for IaC.
I finally managed to convince my superior to build a new deployment environment from scratch, with Terraform/OpenTofu for IaC and Ansible for config management on top, but I have no experience with either or the other.
Would you choose Terraform or OpenTofu? Did you switch from one to the other? - And why?
I know some comparisons can be found online, but I'm more interested in real world experiences.
https://redd.it/1lqslqs
@r_devops
On-prem deployment for a monolith with database and a broker
I have been looking into the deployment cycle of our application, currently we are deploying to just normal Windows Client OS but I really don't like the idea of whole manufacturers relying on windows.
We really just want to deploy the system and leave it be, maybe for particular clients we want to watch how they are using the system, for example some new features etc with just some basic OpenTelemetry or something.
Currently we are deploying by installing manually the database and the broker and configuring them manually and then just use github runners for the actual deployment to IIS. We have no actual way to view telemetry data on production systems which I would like to have since I want to know how the users are interacting with our system.
I have already set up Aspire for local development which is really nice imho but the deployment options from there are just kubernetes which is overkill in my opinion.
I have looked into portainer which is a really nice option but it is really expensive in my opinion, what I'm left with is either moving to linux server + docker compose, linux server + native deployment or just continue what we are currently doing.
Also note that we do not have many clients and Windows Client Os has been a problem for us in the past for example updates and just the fact that some of them are running Windows 10 and it is deprecating in November/October.
I'm not sure what way we should go, what are other currently doing for on-prem deployments?
https://redd.it/1lqoqz4
@r_devops
How do you identify new attack vectors that target your cloud setup?""
Cloud security is a whole different beast compared to on-prem, isn't it? It feels like you're constantly trying to keep up with new services, features, and configurations across multiple accounts or even different providers. The sheer scale and rapid pace of change can make it incredibly difficult to ensure every corner of your environment is locked down and compliant, leading to that nagging feeling that something might be overlooked.
Whether it's managing endless IAM policies, keeping tabs on configuration drift, or just getting a truly unified view of your risks, there's always something that feels like an uphill battle. What's the one aspect of cloud security posture management that consistently gives you the biggest headache? Appreciate any insights you can share!
https://redd.it/1lqlymi
@r_devops
Suggestions Required How are you handling alerting for high-volume Lambda APIs without expensive tools like Datadog?
I run 8 AWS Lambda functions that collectively serve around 180 REST API endpoints. These Lambdas also make calls to various third-party services as part of their logic. Logs currently go to AWS CloudWatch, and on an average day, the system handles roughly 15 million API calls from frontends and makes about 10 million outbound calls to third-party services.
I want to set up alerting so that I’m notified when something meaningful goes wrong — for example:
Error rates spike on a specific endpoint
Latency increases beyond normal for certain APIs
A third-party service becomes unavailable
Traffic suddenly spikes or drops abnormally
I’m curious to know what you all are using for alerting in similar setups, or any suggestions/recommendations — especially those running on Lambdas and a tight budget (i.e., avoiding expensive tools like Datadog, New Relic, CW Metrics, etc.).
Here’s what I’m planning to implement:
Lambdas emit structured metric data to SQS
A small EC2 instance acts as a consumer, processes the metrics
That EC2 exposes metrics via `/metrics`, and Prometheus scrapes it
AlertManager will handle the actual alert rules and notifications
Has anyone done something similar? Any tools, patterns, or gotchas you’d recommend for high-throughput Lambda monitoring on a budget?
https://redd.it/1ls42jv
@r_devops
Ship tools as standalone static binaries
After Open AI decided to rewrite their CLI tool from TypeScript to Rust, I decided to post about why static binaries are a superior end-user experience.
I presumed it was obvious, but it seems it isn't, so I wrote in detail about why tools should be shipped as static binaries.
https://redd.it/1ls28s6
@r_devops
How do you manage environments in Helm charts?
I always like to write my helm charts as if they might be released publicly, meaning no company/domain-specific logic in the chart. I usually have environment-specific values-<env>.yaml files living in a separate gitops repo. The issue with this is that it doesn't scale, because these values-env.yaml need to exist for every environment. They typically contain values that could be derived from the environment name, e.g. hostnames for ingresses which contain the environment name, references to secrets with the environment name etc. This means when something changes there's a lot of strings to update. Now I could just add a variable named 'env' or something to the chart, construct the strings I need from that, and call it a day, but this would couple the chart to our particular setup. I don't want to maintain a separate chart just for internal use. How do you handle this?
https://redd.it/1lrv9c2
@r_devops
Where do you use Go over python
I've been working as DevOps, whatever that means, for many years now and even though I do see the performance benefits of using Go, there was hardly any scenario where it seemed like a better option than a simpler language such as Python.
There is also the fact that I would like my less experienced team members to be able to read the code easily.
Despite all that, I'm seeing more and more job ads asking for Go skills.
Is there something I'm missing or is it just a trend that will fade?
https://redd.it/1lrs5mx
@r_devops
CKS 2025 out of killer.sh questions
Hey guys, I'm going to make my CKS exam in 3 days, I'm doing pretty fast the mock exams and i can complete the killer.sh mock exam, the thing is that i know that with that exam you cover 80% of the exam, does OPA enters? or do you remember any tricky question(like for example the /dev/mem falco rule one)
https://redd.it/1lrptpv
@r_devops
Is Using AI web builders a Good way to learn web development?
I am a beginner and everytime i look for material to learn Web development it really feels overwhelming,
So i thought to myself why not learn web dev while using AI web builders, like prompt it to do something then study the code of how and why it executed it as it did.
Not sure its a smart way to do it but yeah.
Also what are the best options out there that i can use?
Thanks in advance
https://redd.it/1lrjjqx
@r_devops
Update on my project going global and being taken over by another team
Original post
---
Had a meeting with my manager where he gave me more context to the whole situation.
Turns out the team trying to reverse-engineer my work is entirely from a company we recently acquired. They first tried getting the code from my manager, but he stalled by telling them to go through proper channels first by having their manager contact our regional manager (his N+2). At the same time, my manager reached out to our regional manager behind the scenes informing them what happened, and the reply he got back was literally "…"
Eventually, their manager formally asked our regional manager for permission to "expand this innovation globally." Our regional manager replied saying similar discussions were already underway between us and another region but that we could "definitely" find some time if capacity allows it.
My manager showed me all these emails and said that the go-ahead has essentially been given. He also mentioned that this new team needs a win since our company is currently making layoffs in the newly acquired division. The project they've taken from us could help shield them from being affected. Said it's better they support the global rollout anyways since when we worked on it, he had in mind that it's a project with a start and end. Told me to not treat it like my baby as "it's grown up now and leaving." He also then bluntly said in this company only your manager and your N+2 matter when it comes to career growth, salary, and promotions. No one else will help you besides sending a thank-you email.
So I asked if the global impact of my project could justify renegotiating my recent salary raise. Note that I was informed of this raise just a week ago, before corporate leadership saw my work and requested a global rollout. I asked if it was possible for a job grade bump (guaranteeing me an additional 10% raise). He swiftly declined, saying it was too soon, and a job grade promotion on top of my 15% merit-based increase would cause a ruckus as other managers in his team would start questioning why I got both an increase and promotion 10 months into the job. Note that promotions and raises happen in the same period, so now I'll have to wait another 12 months until I can "officially" renegotiate. And yes, while 15% might seem significant in certain countries perhaps, it's actually not a substantial amount where I come from and thus won't feel a difference.
He ended by telling me to support them as much as possible so they don't end up complaining to their manager, who would then escalate it to the corporate leadership. And so I've been holding 1-2 hour long workshops and updating the documentation with even more intricacy so that it can serve as a global reference point to even the technically-limited. And hey, at least this documentation will show my name and contributions when future people reference it I guess.
TL;DR My work is going global, I'll have to support it in the very short term, but looks like I won't get much out of it. Looking around the market in the meantime and will probably jump ship if I land a 25–30% salary bump
https://redd.it/1lrjxdd
@r_devops
Update on My CLI Tool- Smarter Suggestions, Safer Commands, and History Navigation!
https://www.reddit.com/gallery/1lr2q1v
https://redd.it/1lrdr8t
@r_devops
iOS 18 finally allows real time caller ID. I built a privacy focused app to use it
Apple never allowed real time caller ID on iPhone until iOS 18. With the new API, it finally became possible to surface caller info as the phone rings.
I built Livecaller, a lightweight app that uses a new API to display real time caller ID without requiring an account or contact access. Everything runs locally on device, and the experience is designed to be as privacy friendly as possible.
Launched today on Product Hunt:
🔗 https://www.producthunt.com/posts/livecaller-3?utm\_source=other&utm\_medium=social
Happy to answer questions if anyone’s curious about the API, performance considerations, or build process.
https://redd.it/1lrbfo8
@r_devops
Ansible-Nexus, Automated setup of Sonatype Nexus with SSL/TLS
https://github.com/gebz97/ansible-nexus
Please give it a try and tell me what you think:)
https://redd.it/1lr1abj
@r_devops
Skipping builds on push to primary branch? Jenkins and Bitbucket
What’s the best or most common release build practice for build tools that auto-increment a version number?
We have builds with gradle-release
and/or npm version
that to the major/minor/patch + snapshot edits of their various properties or json files. With an Org folder and multi-branch pipeline, we get webhook event and the builds happen just fine. But then the build automation commits and pushes the version change back to the primary branch… and another event triggers another build.
We’ve put in shared library code to abort the build based on author or commit message, but that seems inelegant and causes the “last build” to always appear aborted.
The readme on github-scm-trait-commit-skip
and bitbucket-scm-trait-commit-skip
(same code base) state:
>
This seems to exactly exclude what seems to me to be the very reason for such a filter.
Am I doing it wrong? Is the idea of a release build from the primary branch all backwards? If I want a PR approval to trigger a release build, what is the rest of the world doing that I’m missing?
https://redd.it/1lqwatp
@r_devops
Has anyone here transitioned from contractor to FTE at Google in a DevOps role?
Hi everyone,
I’m currently working as a contractor at Google in a DevOps position. It’s been my long-time dream to become an FTE at Google, and I’m curious to know if anyone here has successfully made that transition.
If you have:
• What did your journey look like?
• Did you get converted internally, or did you reapply and go through the regular FTE hiring process?
• Any tips for standing out as a contractor?
• How did you prepare — technically or otherwise — to clear the FTE interviews?
• Any pitfalls or gotchas I should watch out for?
I’d really appreciate any advice or personal stories. This community’s insights would mean a lot as I try to plan my next steps!
Thanks so much in advance!
https://redd.it/1lqsr6b
@r_devops
Sharing a template for deploying Python(Django) apps to Kubernetes
Link: https://github.com/denibertovic/hellok8s-django/
Just sharing in case anyone finds this useful or educational.
The emphasis isn't on the app code itself (although there are a few best practices there as well) but rather on the surrounding devops tooling (nix/devenv for local environment, sops for secrets management, helm, kubernetes and github actions etc). And everything is pretty much transferable to other stacks...I'll probably do nextjs ... just need to polish a few things. Maybe I do one for actually setting up a cluster...but haven't decided yet.
I've been doing this for a long time so all of this is kind of second nature at this point and I sometimes feel silly sharing.... but friends tell me there's quite a lot of stuff in there to get their heads around. So anyway, yeah hope you find it useful.
https://redd.it/1lqrvmt
@r_devops
Building a Tool to Automate Architecture Diagrams – I’d Love Your Feedback!
Hi everyone!
As the title says, I'm building this tool to help developers save hours on creating technical diagrams.
Right now, it can generate diagrams for AWS, Azure, and Google Cloud.
I'd love for you to try it out and share your honest feedback—what worked well and what didn’t. Your input will really help me improve the tool!
It’s completely free to use :)
Here’s the link: https://www.rapidcharts.ai/
ps: The next step, once I’m confident the diagram generation works well, is to have it automatically update based on the codebase!
https://redd.it/1lqlco6
@r_devops
Cloud to Local Server - Should we do Openstack?
Hi,
I work at a startup with a small platform team who are currently running on AWS cloud. We rely on AWS mostly for Aurora Mysql, EKS, Load Balancers. We also have Site-to-Site VPNs, DXs but they are confined to higher environments. We use Kafka for queues but we manage it on our own using strimzi kafka cluster in the EKS cluster. Similarly we also manage our own observability and siem solutions deployed in the EKS cluster.
Recently we have been contemplating about moving our lower test environments out of cloud and save a few thousand dollars a month. Our customers also would be happy at the EOD as we usually pass on the cloud bill to them. So I'm stuck with the below questions
1. If we were to do this and move out of cloud for lower environments:
1. Should we look at solutions like OpenStack because we would want to have a same replica of the environment as we have in AWS, so that devs can get that exact same environment and will help everyone to find any platform related bugs. Or this will over complicate things for us?
2. Instead of OpenStack should we deploy our own EKS cluster and Mysql somehow and manage the rest of the things like we already do in AWS.
2. Should we not go to bare-metal and instead move the lower environments to cheaper clouds like DigitalOcean?
3. Should we even do this? Are the cost savings not worth the effort that the platform team puts in managing multiple cloud/bare-metal environments? Currently we pay around 3-5k USD per month in AWS costs for test environment per customer.
PS: We are a team of 4 engineers who manage devops, cloud, db management and kafka automation frameworks, observability and siem.
Thanks in advance for your insights.
https://redd.it/1lqln1d
@r_devops