ICONIQ Capital · San Francisco, CA

Mujtaba Jafri — IT Engineer · AI & Infrastructure

Currently building AI-powered cybersecurity and automated access management at ICONIQ Capital. Previously embedded with Meta's Physics AI research team — running a 64-node, 512-GPU NVIDIA DGX cluster and executing dual 10G→100G WAN/LAN upgrades while researchers kept shipping.

View selected work Get in touch

Years

11+

GPUs

512

Backbone

100G

Meta rank

Top 5%

mujtaba@meta-dgx-01 — zshlink up · 100G

mujtaba@meta-dgx-01:~$

dgx-cluster · sausalito

64 nodes · 512 GPUs

util

node-01 → node-64 live workload

GPUs maintained

QSFP modules terminated

Tickets closed @ Meta

Customer satisfaction

// experience

Where I've kept the lights on.

Seven years across AI research, fintech, enterprise, and healthcare. Each role chosen for the same reason: complex infrastructure, no excuse for downtime.

ICONIQ Capital

IT Engineer

Mar 2026 — Present

San Francisco, CA

Current

Status

▸Building custom automation for distribution list and access management with Slack approval integration — replacing manual IAM workflows with self-service tooling.
▸Integrating AI into cybersecurity operations for phishing detection and prevention — leveraging ML models to analyze email patterns and flag threats in real time.
▸Developing AI-powered documentation and knowledge base integration for Jira — enabling intelligent search and auto-suggested resolutions from historical tickets.
▸Full-spectrum IT engineering across a premier investment firm — endpoint management, network operations, security tooling, and infrastructure automation.

AutomationAI/MLCybersecuritySlack APIJiraDL/Access Management

Network & Infrastructure Engineer

Physics AI Research · via Insight Global

Aug 2025 — Feb 2026

Sausalito, CA

512

GPUs maintained

DGX nodes

175

QSFP modules terminated

13+

Unplanned outage hrs avoided

▸Primary on-site infra engineer for Meta's Physics AI research team — kept researchers running simulations without interruption.
▸Architected dual 10G→100G WAN and LAN upgrades: 14 switches, 175 QSFP modules, multi-team downtime orchestration.
▸Administered 64-node NVIDIA DGX fleet (512 GPUs): drive swaps, kernel rollbacks, crash diagnostics, RMA pipelines.
▸Delivered full data center PDU upgrade with zero risk to active research workloads.
▸Built cross-site network for motion capture + event production teams supporting hundreds of real-time users.
▸Authored Chef remediation scripts that eliminated recurring driver and OS-level failures across the compute fleet.
▸Led Windows 11 compliance migration for 117 lab machines — on time, zero workflow disruption.
▸Hardened air-gapped lab perimeters: compliance check-in, patch enforcement across cameras, 3D printers, lab gear.
▸Closed 464+ tickets spanning switch imaging, hardware swaps, on-call response, and storage provisioning.

NVIDIA DGX100G WAN/LANChefCiscoPalo AltoJuniper

Prosper Marketplace

Desktop Support Engineer

Sep 2023 — Jan 2025

San Francisco, CA

40%+

Tickets handled

−15%

Resolution time cut

+20%

Team efficiency gain

20-30

Offboards / mo

▸Automated IAM workflows across AD and Azure AD with RBAC + scripted provisioning — +20% team efficiency.
▸Built AI-powered and logic-driven Slack bots from scratch for ticket triage and self-service — −15% resolution time.
▸Owned 40%+ of all company tickets; top-3 in productivity; 100% SLA on PCI-compliant offboards.
▸Led PCI compliance audits with security; owned full Google Workspace + M365 user lifecycle.
▸Full Mac/Windows support, imaging, encryption, and A/V production for board meetings and all-hands.

Active DirectoryAzure ADSlack botsPCI-DSSOkta

Enterprise Support Technician

May 2022 — May 2023

Menlo Park, CA

2,504

Cases closed

98%

CSAT

Top 5%

Global rank

6,000+

Technicians trained

▸Closed 2,504 cases at 98% CSAT — ranked top 5% of 6,000+ Meta technicians worldwide.
▸Designed enterprise training on Zero Trust, DevOps automation, and the software lifecycle — −18% ramp time.
▸Frontline incident commander during outages — cut unplanned downtime by 5-8 hours / month.
▸Ran enterprise mobile ops across AT&T, Verizon, T-Mobile, Sprint — +12% efficiency, −10% carrier costs.
▸Partnered with security to deploy patches, MDM tooling, and automation across the fleet.

Zero TrustDevOpsMDMiOS/Android fleet

Stanford HealthCare

Senior Field Service Technician

Sep 2018 — May 2022

Palo Alto, CA

1,000+

Users supported

500+ users

Relocation cutover

+20%

Helpdesk efficiency

24/7

Critical uptime

▸Built and maintained advanced lab environments with Abbott, Thermo Scientific, Illumina, and Bio-Rad for gene slicing, DNA analysis, and vaccine testing.
▸Directed end-to-end relocation for 500+ users — 100% on-time cutover, zero first-day disruption.
▸Engineered ServiceNow automation for device deployment, signage, and ticketing — +20% efficiency.
▸Enforced HIPAA full-disk encryption + Imprivata EAM across ORs, ERs, and patient rooms — 24/7 uptime.
▸Scaled Mac/Windows, VoIP, wireless printers, and barcode scanners across 1,000+ users.

HIPAAServiceNowImprivata EAMLab infra

Box IT

IT Engineer

Apr 2017 — Jun 2018

San Francisco, CA

3,500+

Tickets completed

300+

Companies supported

6,000+

Users supported

87%

Same-day resolution

▸Provided multi-tenant IT support for 6,000+ users across 300+ companies — remote and onsite.
▸Deployed servers, VMs, and workstations; installed and configured operating systems at scale.
▸Completed 3,500+ tickets with 87% same-day resolution and 94% CSAT.
▸Configured Cisco and Ubiquiti networking gear: access points, routers, switches, and gateways.

MSPCiscoUbiquitiVMwareMulti-tenant

SunRun

IT Specialist

Mar 2017 — Oct 2017

San Francisco, CA

▸Desktop and phone support for 5,000+ employees — remote and onsite.
▸Imaged machines and enrolled users in Active Directory and Okta SSO.
▸Managed site inventory: asset tagging, AD binding, and documentation.
▸Resolved Tier I–III escalations and supported network admins on time-sensitive projects.
▸Enhanced organization-wide security measures — +25% customer satisfaction improvement.

Active DirectoryOkta SSOImagingAsset management

Cernx

Network & Systems Administrator

Sep 2015 — Apr 2017

Emeryville, CA

350

Users managed

▸Administered Active Directory, workstations, and network security for 350 users across multiple office and warehouse locations.
▸Created network topologies and implemented servers and networking equipment — provided VPN access to remote workstations.
▸Engineered new servers to optimize IT infrastructure, minimize hardware load, and reduce software issues.
▸Configured, installed, and maintained routers, switches, and firewalls.
▸Advised the board of directors on IT needs, planning future expansion and establishing trust between management and IT.

Active DirectoryNetworkingServer adminVPN

Saweedo

Lead Web Developer

Apr 2012 — Jun 2013

El Cerrito, CA

▸Built and maintained web pages using HTML/CSS with advanced scripting practices.
▸Engineered layouts and UIs; integrated data from back-end services and databases.
▸Delivered installation, configuration, and support for web-based applications.

HTML/CSSJavaScriptWeb appsBackend integration

// selected work

Projects that shipped without drama.

A traditional resume can't show the scale. These are the things I'm proudest of.

ICONIQ Capital

AI Cybersecurity — Phishing Detection

Integrating AI/ML models into cybersecurity operations for real-time phishing detection and prevention — analyzing email patterns and flagging threats before they reach users.

AI/MLCybersecurityEmail security

ICONIQ Capital

Automated DL & Access Management

Custom-built automation for distribution list and access management with Slack approval integration — replacing manual IAM workflows with self-service tooling.

AutomationSlack APIIAM

ICONIQ Capital

AI Knowledge Base for Jira

AI-powered documentation and knowledge base integration for Jira — enabling intelligent search and auto-suggested resolutions from historical ticket data.

AIJiraKnowledge base

Meta · Physics AI

10G → 100G WAN/LAN Upgrade

Dual-path backbone uplift across multiple sites. 14 switches, 175 QSFPs, choreographed downtime — 13+ outage hours eliminated.

100GQSFP28CiscoJuniper

Meta · Physics AI

64-Node DGX Cluster Operations

Hands-on for 512 GPUs powering AI simulation. Diagnostics, RMA, kernel ops, and patch discipline that kept researchers shipping.

NVIDIA DGXLinuxRMAKernel

Meta · Physics AI

Chef Auto-Remediation

Fleet-wide recipes that detect and fix driver conflicts, package failures, and OS instability before they wake a human.

ChefBashReliability

Prosper Marketplace

AI-Powered Slack Bots

Triage + self-service bots that cut resolution time 15% and absorbed the boring half of L1 support.

AISlack APIAutomation

Meta · Physics AI

Air-Gapped Lab Hardening

Compliance check-in and patching across cameras, 3D printers, and lab gear inside isolated research networks.

SecurityComplianceLab Infra

Stanford HealthCare

500-User Relocation Cutover

Full network reconfig, hardware deploy, printer + drive maps — 100% on time with zero first-day tickets.

Project leadNetworkEndpoints

Personal

Self-Hosted AI Portfolio

This website — a full-stack SSR app with a self-hosted LLM chatbot running on bare metal. Docker, Nginx, custom Node.js server, streaming AI responses.

ReactNode.jsLLMDocker

// stack

The toolchain.

From rack-and-stack to RBAC. I pick the tool that keeps things boring and uptime high.

Cisco◆Meraki◆Palo Alto◆Juniper◆Ubiquiti◆DHCP◆DNS◆VLANs◆VPN◆10G/100G WAN/LAN◆Fiber termination◆NVIDIA DGX◆PDU management◆NAS / SAN◆Drive replacement◆RMA pipelines◆Rack & stack◆Fiber cabling◆AWS◆Azure◆GCP◆VMware◆Docker◆VirtualBox◆Citrix◆Linux (CentOS, Ubuntu, Fedora, Kali)◆Windows◆macOS◆iOS / iPadOS◆Android◆Chef◆Bash◆Python◆PowerShell◆AI-powered Slack bots◆ServiceNow workflows◆Jira integrations◆Okta◆SSO◆CrowdStrike◆Carbon Black◆FileVault◆BitLocker◆HIPAA◆PCI-DSS◆Zero Trust◆Imprivata EAM◆FIDO2◆Duo◆AI phishing detection◆ML-based threat analysis◆AI knowledge base integration◆LLM deployment◆Self-hosted inference◆Active Directory◆Azure AD◆Google Workspace◆Microsoft 365◆ServiceNow◆Jira◆IBM BigFix◆MySQL◆HTML/CSS◆JavaScript◆React◆Node.js◆Docker◆Nginx◆SSR◆Cisco◆Meraki◆Palo Alto◆Juniper◆Ubiquiti◆DHCP◆DNS◆VLANs◆VPN◆10G/100G WAN/LAN◆Fiber termination◆NVIDIA DGX◆PDU management◆NAS / SAN◆Drive replacement◆RMA pipelines◆Rack & stack◆Fiber cabling◆AWS◆Azure◆GCP◆VMware◆Docker◆VirtualBox◆Citrix◆Linux (CentOS, Ubuntu, Fedora, Kali)◆Windows◆macOS◆iOS / iPadOS◆Android◆Chef◆Bash◆Python◆PowerShell◆AI-powered Slack bots◆ServiceNow workflows◆Jira integrations◆Okta◆SSO◆CrowdStrike◆Carbon Black◆FileVault◆BitLocker◆HIPAA◆PCI-DSS◆Zero Trust◆Imprivata EAM◆FIDO2◆Duo◆AI phishing detection◆ML-based threat analysis◆AI knowledge base integration◆LLM deployment◆Self-hosted inference◆Active Directory◆Azure AD◆Google Workspace◆Microsoft 365◆ServiceNow◆Jira◆IBM BigFix◆MySQL◆HTML/CSS◆JavaScript◆React◆Node.js◆Docker◆Nginx◆SSR◆

Networking

CiscoMerakiPalo AltoJuniperUbiquitiDHCPDNSVLANsVPN10G/100G WAN/LANFiber termination

Infra & Data Center

NVIDIA DGXPDU managementNAS / SANDrive replacementRMA pipelinesRack & stackFiber cabling

Cloud & Virtualization

AWSAzureGCPVMwareDockerVirtualBoxCitrix

Operating Systems

Linux (CentOS, Ubuntu, Fedora, Kali)WindowsmacOSiOS / iPadOSAndroid

Automation & Scripting

ChefBashPythonPowerShellAI-powered Slack botsServiceNow workflowsJira integrations

Security & Compliance

OktaSSOCrowdStrikeCarbon BlackFileVaultBitLockerHIPAAPCI-DSSZero TrustImprivata EAMFIDO2Duo

AI & Cybersecurity

AI phishing detectionML-based threat analysisAI knowledge base integrationLLM deploymentSelf-hosted inference

Tools & Platforms

Active DirectoryAzure ADGoogle WorkspaceMicrosoft 365ServiceNowJiraIBM BigFixMySQL

Web Development

HTML/CSSJavaScriptReactNode.jsDockerNginxSSR

certifications & education

Google IT Support Professional

Google / Coursera · Apr 2020

Systems Networking & Management

West Contra Costa USD · May 2015

Digital Media & Web Design

West Contra Costa USD · May 2015

High School Diploma

El Cerrito High School · Jul 2013

// ask the AI

Talk to a self-hosted AI about me.

This isn't a third-party widget. The model runs on my own hardware, with my resume injected as a system prompt. Ask it anything — experience, projects, skills, or whether I'd be a fit for your team.

jafri.ai — v1/chat/completions

Tell me about the 100G upgrade.

jafri.ai

model online · streaming · zero tokens logged

Resume-Grounded

My full work history, projects, and skills are injected as the system prompt. Every answer is factually tied to real experience.

Self-Hosted

Runs on my own GPU node behind a local vLLM/llama.cpp server. No OpenAI keys, no data leaving my network.

End-to-End Encrypted

TLS 1.3 from Cloudflare edge to origin. API keys are server-side secrets — never exposed to the browser.

Streaming Tokens

Real-time SSE streaming. You see the model think word-by-word, just like ChatGPT — but on my metal.

Try: "What's his experience with DGX?" · "Does he know OSPF?" · "Why hire him?"

// running this site

The stack behind this page.

This site, the AI chatbot, and the network it rides on — all self-hosted on my own metal. Same operational discipline I bring to production infra at work.

metal

Compute

Bare-metal that hosts both this site and the LLM serving the chatbot.

▸GPU node — NVIDIA RTX-class GPU for inference (CUDA + cuDNN)
▸CPU host — Multi-core x86_64, ECC RAM, NVMe storage
▸Platform — Unraid — Docker containers for web, model, observability
▸OS — Unraid OS (Slackware-based), Docker-managed services

inference

AI Serving

Self-hosted OpenAI-compatible model server — the brain behind the chat widget.

▸Runtime — vLLM / llama.cpp — OpenAI-compatible /v1 endpoint
▸Model — Open-weight LLM (Llama/Qwen class), quantized for the GPU
▸Context injection — Resume + projects passed as system prompt server-side
▸Streaming — Server-sent events, token-by-token render

delivery

Web & Edge

TanStack Start app, running as a Docker container behind Nginx on Unraid.

▸Framework — TanStack Start (React 19, Vite 7, server functions)
▸Styling — Tailwind v4 + custom OKLCH design tokens
▸Reverse proxy — Nginx — TLS termination, gzip, proxy to Node.js container
▸Containerized — Docker — multi-stage build, Node 22 Alpine, port 3000

wire

Network

Same discipline I run at work — segmented, monitored, no flat networks.

▸Reverse proxy — Nginx — SSL termination, routing to Docker containers
▸Firewall — OPNsense / UniFi — stateful, IDS/IPS, geo-blocks
▸Segmentation — VLANs for lab / IoT / mgmt; AI server in restricted VLAN
▸DNS — Internal Pi-hole + Unbound; public via Cloudflare

hardening

Security

Defense in depth. API keys live in secrets, never in the browser.

▸Secrets — Endpoint URL + API key stored as server-side env vars in Docker
▸Transport — TLS 1.3 end-to-end — Nginx → Node.js container
▸App — Strict input caps, message-history trim, no PII logged
▸Access — SSH key-only, fail2ban, MFA on every admin surface

running it

Ops & Observability

If I can't see it, I can't fix it before users notice.

▸Metrics — Prometheus + Grafana dashboards (GPU temp, tokens/sec, latency)
▸Logs — Loki — structured logs from web + model server
▸Uptime — Uptime Kuma — public status, multi-region probes
▸Backups — Restic → off-site encrypted snapshots, weekly restore tests

// initiate connection

Got an infrastructure problem that can't go down?

Open to staff / senior infra and network roles in the Bay Area, remote-friendly. Fastest way to reach me is email.

[email protected]

Phone

(415) 318-6665

linkedin.com/in/mujtabajafri

Based in

San Francisco, CA