Identify website visitors with Clearbit Reveal and create HubSpot companies using code

high complexityCost: $0Recommended

Prerequisites

Prerequisites
  • Python 3.9+ or Node.js 18+
  • Clearbit Reveal API key (legacy) or HubSpot account with Breeze Intelligence add-on
  • HubSpot private app token with crm.objects.companies.read and crm.objects.companies.write scopes
  • Access to server logs or an analytics pipeline that captures visitor IP addresses
Clearbit is now Breeze Intelligence

Clearbit was acquired by HubSpot and rebranded as Breeze Intelligence. The standalone Reveal API is being sunset. This guide covers both the legacy API approach (for existing Clearbit customers) and the HubSpot-native Breeze approach. New users should start with Breeze.

Why code?

Code gives you the most flexibility for parsing log formats, filtering IPs, and batch processing. You can handle server logs in any format (Apache, nginx, JSON), implement custom ICP scoring logic, and process thousands of IPs in a single run. Free to host on GitHub Actions.

The trade-off is setup complexity. You need server access to capture IPs, a log parsing pipeline, and comfort with Python or Node.js. But once set up, the script runs reliably on a schedule with no per-execution cost.

How it works

  • Log parser extracts unique IPs from server logs, filtering for high-intent pages (pricing, demo, contact)
  • Clearbit Reveal API resolves each IP to a company (name, domain, industry, employee count)
  • ICP filter checks company size and sector against your criteria, discarding non-matches
  • HubSpot API deduplicates by domain and creates new company records with enrichment data

Step 1: Set up the project

# Test your Clearbit API key
curl -s "https://reveal.clearbit.com/v1/companies/find?ip=203.0.113.42" \
  -H "Authorization: Bearer $CLEARBIT_API_KEY" | head -c 300
 
# Test your HubSpot token
curl -s "https://api.hubapi.com/crm/v3/objects/companies?limit=1" \
  -H "Authorization: Bearer $HUBSPOT_ACCESS_TOKEN" | head -c 200

Step 2: Extract unique IPs from your logs

Before calling Clearbit, extract and deduplicate visitor IPs. This example reads from a common log format, but adapt it to your analytics pipeline.

import re
from collections import Counter
 
def extract_ips_from_log(log_path, min_visits=2):
    """Extract IPs that visited key pages multiple times (shows intent)."""
    ip_pages = {}
    target_pages = ["/pricing", "/demo", "/contact", "/enterprise"]
 
    with open(log_path) as f:
        for line in f:
            match = re.match(r'^(\d+\.\d+\.\d+\.\d+).*"GET (\S+)', line)
            if not match:
                continue
            ip, page = match.groups()
            if any(page.startswith(p) for p in target_pages):
                ip_pages.setdefault(ip, []).append(page)
 
    # Only return IPs with multiple visits to high-intent pages
    return {ip: pages for ip, pages in ip_pages.items() if len(pages) >= min_visits}

Step 3: Resolve IPs to companies via Clearbit Reveal

import requests
import os
import time
 
CLEARBIT_API_KEY = os.environ["CLEARBIT_API_KEY"]
HUBSPOT_ACCESS_TOKEN = os.environ["HUBSPOT_ACCESS_TOKEN"]
HS_HEADERS = {"Authorization": f"Bearer {HUBSPOT_ACCESS_TOKEN}", "Content-Type": "application/json"}
 
def reveal_company(ip):
    """Resolve an IP to a company via Clearbit Reveal."""
    resp = requests.get(
        "https://reveal.clearbit.com/v1/companies/find",
        params={"ip": ip},
        headers={"Authorization": f"Bearer {CLEARBIT_API_KEY}"},
    )
    if resp.status_code == 404:
        return None
    resp.raise_for_status()
    data = resp.json()
 
    company = data.get("company")
    if not company or company.get("type") != "company":
        return None
 
    return {
        "domain": company.get("domain"),
        "name": company.get("name"),
        "industry": company.get("category", {}).get("industry"),
        "employees": company.get("metrics", {}).get("employees"),
        "city": company.get("geo", {}).get("city"),
        "state": company.get("geo", {}).get("state"),
        "country": company.get("geo", {}).get("country"),
        "description": company.get("description"),
    }
Expect a low match rate

Only 20-30% of B2B visitor IPs resolve to a company. Consumer ISPs (Comcast, AT&T), VPNs, and mobile carriers always return null. Filter your IP list to corporate-looking traffic before calling the API to save credits.

Step 4: Filter for ICP and deduplicate against HubSpot

def matches_icp(company, min_employees=50):
    """Check if a resolved company matches your ICP criteria."""
    if not company.get("domain"):
        return False
    employees = company.get("employees") or 0
    return employees >= min_employees
 
 
def company_exists_in_hubspot(domain):
    """Check if a company with this domain already exists in HubSpot."""
    resp = requests.post(
        "https://api.hubapi.com/crm/v3/objects/companies/search",
        headers=HS_HEADERS,
        json={
            "filterGroups": [{"filters": [{
                "propertyName": "domain",
                "operator": "EQ",
                "value": domain,
            }]}],
        },
    )
    resp.raise_for_status()
    results = resp.json().get("results", [])
    return results[0]["id"] if results else None

Step 5: Create companies in HubSpot

def create_hubspot_company(company, pages_visited):
    """Create a new company in HubSpot with visitor metadata."""
    resp = requests.post(
        "https://api.hubapi.com/crm/v3/objects/companies",
        headers=HS_HEADERS,
        json={
            "properties": {
                "domain": company["domain"],
                "name": company["name"],
                "industry": company.get("industry", ""),
                "numberofemployees": str(company.get("employees", "")),
                "city": company.get("city", ""),
                "state": company.get("state", ""),
                "country": company.get("country", ""),
                "description": company.get("description", ""),
            }
        },
    )
    resp.raise_for_status()
    return resp.json()["id"]
 
 
# --- Main execution ---
ip_pages = extract_ips_from_log("/var/log/nginx/access.log")
print(f"Found {len(ip_pages)} IPs with high-intent visits")
 
created = 0
skipped = 0
unresolved = 0
 
for ip, pages in ip_pages.items():
    company = reveal_company(ip)
    if not company:
        unresolved += 1
        continue
 
    if not matches_icp(company):
        skipped += 1
        continue
 
    existing = company_exists_in_hubspot(company["domain"])
    if existing:
        print(f"  EXISTS: {company['name']} ({company['domain']})")
        skipped += 1
        continue
 
    company_id = create_hubspot_company(company, pages)
    print(f"  CREATED: {company['name']}{company.get('employees', '?')} employees — visited {', '.join(pages)}")
    created += 1
    time.sleep(0.2)
 
print(f"\nDone. Created: {created}, Skipped: {skipped}, Unresolved: {unresolved}")

Step 6: Schedule with cron or GitHub Actions

# .github/workflows/identify-visitors.yml
name: Identify Website Visitors
on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8 AM UTC
  workflow_dispatch: {}
jobs:
  run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install requests
      - run: python identify_visitors.py
        env:
          CLEARBIT_API_KEY: ${{ secrets.CLEARBIT_API_KEY }}
          HUBSPOT_ACCESS_TOKEN: ${{ secrets.HUBSPOT_ACCESS_TOKEN }}
Log access in CI

If your logs aren't accessible from GitHub Actions, pipe IPs to a file (S3, GCS) during the day and download it in the workflow. Or use a webhook-based approach where your server pushes IPs to an API endpoint in real time.

Breeze Intelligence alternative

If you're using HubSpot Breeze Intelligence, the IP-to-company resolution happens automatically within HubSpot — no code needed for that step.

What code adds on top of Breeze:

  1. Custom ICP filtering — Breeze identifies all visitors, but you may want stricter filters
  2. Custom properties — Enrich the auto-created records with data from your logs (pages visited, visit count, referrer)
  3. Routing logic — Assign companies to sales reps based on territory, industry, or company size
# Example: Enrich Breeze-created companies with visit metadata
# Poll for recently created companies and update them
resp = requests.post(
    "https://api.hubapi.com/crm/v3/objects/companies/search",
    headers=HS_HEADERS,
    json={
        "filterGroups": [{"filters": [{
            "propertyName": "createdate",
            "operator": "GTE",
            "value": str(twenty_four_hours_ago_ms),
        }]}],
        "properties": ["domain", "name"],
        "limit": 100,
    },
)

Troubleshooting

Common questions

What percentage of IPs will resolve to a company?

Expect 20-30% for B2B traffic. Consumer ISPs, VPNs, mobile carriers, and work-from-home traffic almost never resolve. If you're getting under 10%, check that you're reading the correct header for the client IP (not your load balancer's IP).

How do I handle log rotation?

Implement a checkpoint — save the last-processed log line offset or timestamp to a file, then resume from that point on the next run. Or rotate logs daily and only process the current day's file with a cron job that runs at end of day.

How much does Clearbit Reveal cost?

The legacy Reveal API is volume-based, typically starting around $99/mo for 2,500 lookups. Breeze Intelligence (the HubSpot-native replacement) is included with Professional+ plans and priced per credit. Check your HubSpot contract for specifics.

Cost

  • Hosting: Free on GitHub Actions or ~$5/mo on Railway
  • Clearbit Reveal (legacy): Volume-based pricing, typically starting ~$99/mo for 2,500 lookups
  • Breeze Intelligence: Included with HubSpot Professional+, priced per credit
  • HubSpot API: Free with any plan that supports private apps

Looking to scale your AI operations?

We build and optimize automation systems for mid-market businesses. Let's discuss the right approach for your team.