![Reddit logo](https://static.remoteliz.com/static/companies/company-redditinc.com-logo.jpg)
Staff Software Engineer - ML Training Platform
RedditJob Summary
At Reddit's Machine Learning Platform team, as a Staff Software Engineer, Training Platform, you will work on foundational ML infrastructure that powers key features like Feeds Ranking and Recommendations. Your role involves building systems to enable machine learning engineers and data scientists, improving the ML software development lifecycle, and delivering a self-service platform for continuous iteration of ML models. You'll lead in designing high-performance solutions, optimize large-scale workflows, mentor team members, and work with management on strategic goals. The position offers comprehensive benefits including healthcare, 401k matching, workspace support, professional development funds, family planning support, flexible vacation, wellness days, parental leave, and paid volunteer time off.
Company Benefits
- ✓Comprehensive Healthcare Benefits
- ✓401k Matching
- ✓Workspace benefits for your home office
- ✓Personal & Professional development funds
- ✓Family Planning Support
- ✓Flexible Vacation
- ✓Reddit Global Wellness Days
- ✓4+ months paid Parental Leave
- ✓Paid Volunteer time off
Location: This role is completely remote-friendly. If you happen to live close to one of our physical office locations, our doors are open for you to come into the office as often as you'd like.
Who We Are: The Machine Learning Platform team at Reddit is a high-impact team that owns the infrastructure that powers recommendations, content discovery, user and content quantification, while directly impacting other teams such as Growth, Ads, Feeds, and Core Machine Learning teams.
What You’ll Do: As a Staff Software Engineer, Training Platform, this person will work on our wider Machine Learning Platform team, and be instrumental in architecting, implementing, and maintaining foundational ML infrastructure that powers Feeds Ranking, Content Understanding, Recommendations and much more to fulfill Reddit’s mission of bringing community and belonging to everyone in the world. You will build systems and tools that enable machine learning engineers (MLEs) and data scientists (DSs) and continuously improve the ML software development lifecycle. You will deliver a self service ML platform that enables the continuous iteration and improvement of systems that use ML techniques including Deep Learning, Natural Language Processing, Recommendation Systems, Representation Learning and Computer Vision.
Lead the building, testing, and maintenance of ML infrastructure at Reddit
Propose, design, and implement high-performance ML platform solutions that significantly advance the deployment of models that serve millions of redditors a seamless experience for MLEs
Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
Design and implement solutions that significantly advance the architecture of the ML Platform
Analyze bottlenecks in distributed systems and optimize for performance and cost-efficiency
Work with management on team goal setting, planning, and de-risk project execution
Mentor other team members in adopting a rigorous DevOps approach to maintain and/or improve ML platform components and services health and quality
Who You Might Be:
8+ years of work experience in a production software development environment or building data systems plus a degree in ML, Engineering, Computer Science, or other relevant discipline
Experience with design and architecture of large scale ML Systems
Experience with ML frameworks such as TensorFlow, PyTorch, or JAX
Experience with training workflows, hyperparameter tuning, and resource optimization on CPU and GPU
Experience with MLOps practices and tools such as Ray and MLFlow
Hands-on experience with Kubernetes, Docker, or other container orchestration systems
Experience building production-quality code incorporating testing, evaluation, and monitoring using object oriented programming, experience in: Python and/or golang.
Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems
Benefits:
Comprehensive Healthcare Benefits
401k Matching
Workspace benefits for your home office
Personal & Professional development funds
Family Planning Support
Flexible Vacation (please use them!) & Reddit Global Wellness Days
4+ months paid Parental Leave
Paid Volunteer time off
#LI-DB1 #LI-Remote