Skip to content

Quickstart

Build a worker, attach it from Haybarn, and call a function. About five minutes.

Prerequisites

  • JDK 21+ (JDK 22+ to enable the shared-memory transport).
  • Haybarn (or any DuckDB engine with the vgi extension) — see the callout below.
  • The examples/ project from this repo.

Don't have Haybarn yet?

Haybarn is Query Farm's DuckDB-derived engine; it ships the vgi extension in its community channel. Run its shell with whichever tool you already have — no separate install step:

bash
npx haybarn@rc      # via Node (the @rc tag is the current release)
uvx haybarn-cli     # via uv (install: curl -LsSf https://astral.sh/uv/install.sh | sh)

Inside the shell, enable the extension once per session:

sql
INSTALL vgi FROM community;
LOAD vgi;

The vgi extension currently ships for Haybarn; a DuckDB release is on the way, and a worker you write now will work with it unchanged.

1. Add the dependency

A worker needs exactly one dependency.

New to Gradle?

Gradle is the build tool most JVM projects use. You don't install it — the examples/ project ships a wrapper script (./gradlew) that downloads the right version on first run. The build.gradle.kts file below declares your project: where to fetch libraries (mavenCentral()), which ones (dependencies { … }), and how to package it (application). The coordinate farm.query:vgi:0.1.0 is group:artifact:version — Gradle resolves it from Maven Central. Running ./gradlew installDist then produces a self-contained, runnable worker.

kotlin
plugins { application }

repositories { mavenCentral() }

dependencies {
    implementation("farm.query:vgi:0.1.0")
    runtimeOnly("org.slf4j:slf4j-simple:2.0.16")   // any SLF4J binding
}

application {
    mainClass.set("farm.query.vgi.examples.AllInOneWorker")
    applicationDefaultJvmArgs = listOf(
        "--add-opens=java.base/java.nio=ALL-UNNAMED",
        "--enable-native-access=ALL-UNNAMED",
    )
}

Those two JVM flags are required — Arrow needs java.nio access and the shared-memory transport makes native calls. The -parameters compiler flag is also required; see JVM flags.

Prefer Maven?

The dependency is the same coordinate; only the build file differs. In pom.xml:

xml
<dependency>
  <groupId>farm.query</groupId>
  <artifactId>vgi</artifactId>
  <version>0.1.0</version>
</dependency>

Pass the JVM flags via the exec-maven-plugin (or your run script) and the -parameters flag through maven-compiler-plugin's <parameters>true</parameters>. The Gradle examples/ project is the supported path; Maven works identically at the library level.

2. Write a worker

A worker is a main that registers functions and calls runFromArgs:

java
// VGI-Java example: a scalar function.
//
// A scalar function maps each input row to one output row. You extend
// `ScalarFn` and write a single `compute()` method; the framework reads its
// parameter annotations to derive the SQL signature, the output type, and the
// per-batch dispatch. There is no schema boilerplate to write by hand.
//
// Run it on its own:
//   ./gradlew runScalar --args="--unix /tmp/scalar.sock --idle-timeout 60"
// then from Haybarn:
//   ATTACH 'demo' AS demo (TYPE vgi, LOCATION 'launch:/abs/path/bin/runScalar');
//   SELECT demo.upper_case('hello');   -- HELLO
package farm.query.vgi.examples;

import farm.query.vgi.Worker;
import farm.query.vgi.scalar.ScalarFn;
import farm.query.vgi.scalar.Vector;
import org.apache.arrow.vector.VarCharVector;

import java.nio.charset.StandardCharsets;
import java.util.Locale;

/** {@code upper_case(value VARCHAR) -> VARCHAR}: ASCII/Unicode uppercase. */
public final class ScalarExample extends ScalarFn {

    @Override public String name() { return "upper_case"; }
    @Override public String description() { return "Uppercase a string"; }

    // One `@Vector` input column + one trailing (unannotated) output vector.
    // The framework allocates `result`, sized to the batch row count, and
    // writes whatever you put into it back across the wire.
    //
    // Parameter rules in one breath:
    //   @Vector  -> a per-row input column (the Arrow vector type is the SQL type)
    //   @Const   -> a bind-time constant arg (long/double/String/boolean/byte[])
    //   @Setting -> a session setting (SET demo.foo = ...)
    //   last unannotated vector = the output (framework-allocated)
    public void compute(@Vector VarCharVector value, VarCharVector result) {
        int rows = value.getValueCount();
        result.allocateNew();
        for (int i = 0; i < rows; i++) {
            if (value.isNull(i)) { result.setNull(i); continue; }
            String up = new String(value.get(i), StandardCharsets.UTF_8).toUpperCase(Locale.ROOT);
            byte[] bytes = up.getBytes(StandardCharsets.UTF_8);
            result.setSafe(i, bytes, 0, bytes.length);
        }
    }

    public static void main(String[] args) {
        Worker.builder()
                .catalogName("demo")
                .registerScalar(new ScalarExample())
                .runFromArgs(args);   // handles --unix / --http / --idle-timeout / stdio
    }
}

3. Build it

bash
cd examples
./gradlew installDist

That produces a launch script at build/install/vgi-java-examples/bin/vgi-java-examples. (./run.sh does this and prints the SQL for you.)

4. Attach from Haybarn

sql
INSTALL vgi FROM community;
LOAD vgi;

-- Use the ABSOLUTE path to the launch script.
ATTACH 'demo' AS demo (TYPE vgi,
    LOCATION 'launch:/abs/path/build/install/vgi-java-examples/bin/vgi-java-examples');

SELECT demo.upper_case('hello');   -- HELLO

Why launch:?

A cold JVM takes seconds to start. The launch: LOCATION scheme starts the worker once behind a flock-coordinated Unix socket and reuses it across every query — and across every engine process on the machine. Without it, each query would pay the full JVM startup cost. You almost always want launch:.

Other LOCATION schemes exist (a bare path forks a subprocess per attach; http://host:port talks to a long-running HTTP worker). See CLI & environment.

5. Try all five kinds

The AllInOneWorker from the examples registers one function of each kind:

sql
-- VGI-Java quickstart — run in a Haybarn shell.
--
-- Prereq: build the worker first (`./gradlew installDist` in ../), then replace
-- the LOCATION path below with the absolute path printed by `../run.sh`.
--
-- The vgi extension must be available:
INSTALL vgi FROM community;
LOAD vgi;

-- 'launch:' starts the JVM worker once and pools it across queries.
ATTACH 'demo' AS demo (TYPE vgi,
    LOCATION 'launch:/ABSOLUTE/PATH/TO/build/install/vgi-java-examples/bin/vgi-java-examples');

-- scalar — one row in, one row out
SELECT demo.upper_case('hello');                              -- HELLO

-- table — a set-returning generator, streamed in batches
SELECT * FROM demo.numbers(5) ORDER BY n;                     -- 0,1,2,3,4
SELECT count(*) FROM (SELECT * FROM demo.numbers(1000000) LIMIT 7);  -- 7 (LIMIT pushdown)

-- table-in-out — a streaming relation transform
SELECT n FROM demo.echo((SELECT * FROM demo.numbers(3))) ORDER BY n;  -- 0,1,2

-- aggregate — parallel partial aggregation
SELECT g, demo.vgi_sum(v)
  FROM (VALUES (1,10),(1,20),(2,5)) t(g,v) GROUP BY g ORDER BY g;     -- 1->30, 2->5

-- buffering — must see all input before producing output
SELECT n FROM demo.collect((SELECT * FROM demo.numbers(4))) ORDER BY n;  -- 0,1,2,3

DETACH demo;

What just happened

Three things, and together they're the whole protocol in miniature:

  • Worker.builder()...runFromArgs(args) parsed --unix/--idle-timeout (added by launch:) and served the AF_UNIX transport.
  • The engine called the worker's init/bind RPCs to learn each function's schema, then streamed Arrow batches for execution.
  • Your compute() saw whole Arrow vectors and wrote whole Arrow vectors back — no row-by-row marshalling anywhere in the path.

Next: how a worker is wired together →