Scalar functions
A scalar maps each input row to one output row: upper_case('hi') -> 'HI'. It's the simplest kind and the best place to learn the annotation model the whole library is built on.
The model
Extend ScalarFn, give it a name(), and write a single compute() method. The framework reads compute()'s parameter annotations to derive the SQL signature, the output type, and the per-batch dispatch. You write a loop; you never write schema-marshalling code.
// VGI-Java example: a scalar function.
//
// A scalar function maps each input row to one output row. You extend
// `ScalarFn` and write a single `compute()` method; the framework reads its
// parameter annotations to derive the SQL signature, the output type, and the
// per-batch dispatch. There is no schema boilerplate to write by hand.
//
// Run it on its own:
// ./gradlew runScalar --args="--unix /tmp/scalar.sock --idle-timeout 60"
// then from Haybarn:
// ATTACH 'demo' AS demo (TYPE vgi, LOCATION 'launch:/abs/path/bin/runScalar');
// SELECT demo.upper_case('hello'); -- HELLO
package farm.query.vgi.examples;
import farm.query.vgi.Worker;
import farm.query.vgi.scalar.ScalarFn;
import farm.query.vgi.scalar.Vector;
import org.apache.arrow.vector.VarCharVector;
import java.nio.charset.StandardCharsets;
import java.util.Locale;
/** {@code upper_case(value VARCHAR) -> VARCHAR}: ASCII/Unicode uppercase. */
public final class ScalarExample extends ScalarFn {
@Override public String name() { return "upper_case"; }
@Override public String description() { return "Uppercase a string"; }
// One `@Vector` input column + one trailing (unannotated) output vector.
// The framework allocates `result`, sized to the batch row count, and
// writes whatever you put into it back across the wire.
//
// Parameter rules in one breath:
// @Vector -> a per-row input column (the Arrow vector type is the SQL type)
// @Const -> a bind-time constant arg (long/double/String/boolean/byte[])
// @Setting -> a session setting (SET demo.foo = ...)
// last unannotated vector = the output (framework-allocated)
public void compute(@Vector VarCharVector value, VarCharVector result) {
int rows = value.getValueCount();
result.allocateNew();
for (int i = 0; i < rows; i++) {
if (value.isNull(i)) { result.setNull(i); continue; }
String up = new String(value.get(i), StandardCharsets.UTF_8).toUpperCase(Locale.ROOT);
byte[] bytes = up.getBytes(StandardCharsets.UTF_8);
result.setSafe(i, bytes, 0, bytes.length);
}
}
public static void main(String[] args) {
Worker.builder()
.catalogName("demo")
.registerScalar(new ScalarExample())
.runFromArgs(args); // handles --unix / --http / --idle-timeout / stdio
}
}Attach and call it:
SELECT demo.upper_case('hello'); -- HELLO
SELECT demo.upper_case(x) FROM (VALUES ('a'),(NULL)) t(x); -- A, NULLParameter rules
compute() parameters are read positionally and by annotation:
@Vector SomeVector v— a per-row input column. The Arrow vector class fixes the SQL type:BigIntVector→BIGINT,VarCharVector→VARCHAR,Float8Vector→DOUBLE, and so on.@Vector(any = true) FieldVector v— an input column of any type (resolve the real type inoutputType).@Vector(varargs = true) List<X> vs— varargs of typed columns.@Const <java type> c— a bind-time constant argument. Type mapping:long/int→INT64,double→FLOAT64,String→UTF8,boolean→BOOL,byte[]→BINARY.@Setting <java type> s— a session setting (SET demo.x = …); same type mapping, optionaldefault_.@OutputLength int n— the batch row count, injected (for functions with no input column).- last unannotated vector — the output, framework-allocated and sized to the row count.
compute() returns void; you fill the output vector. Parameter names become the SQL argument names, which is why the -parameters compiler flag is mandatory.
A constant and a setting
// multiply_by(value BIGINT, factor BIGINT) using a session-tunable cap
public void compute(
@Vector BigIntVector value,
@Const long factor,
@Setting(default_ = "9223372036854775807") long cap,
BigIntVector result) {
int rows = value.getValueCount();
result.allocateNew(rows);
for (int i = 0; i < rows; i++) {
if (value.isNull(i)) { result.setNull(i); continue; }
result.set(i, Math.min(value.get(i) * factor, cap));
}
}SET demo.cap = 1000;
SELECT demo.multiply_by(x, 3) FROM ...;Dynamic output types
When the output type depends on the input type or a const arg, override outputType(). This is how a numeric double(x) returns BIGINT for an integer input but DOUBLE for a floating input, and validates at bind time:
public final class Double extends ScalarFn {
@Override public String name() { return "double"; }
// accept any numeric column; reject the rest at bind time
public void compute(
@Vector(any = true, typeBound = TypeBoundPredicate.IS_ADDABLE) FieldVector value,
FieldVector result) { /* … type-dispatched loop … */ }
@Override
protected ArrowType outputType(Schema inputSchema, Arguments args) {
ArrowType in = inputSchema.getFields().get(0).getType();
// promote int widths up one size, float32 -> float64, etc.
return promote(in);
}
}A typeBound violation is reported at bind time with a SQL-typed message, e.g. double: value must be numeric (got VARCHAR) — before any data moves.
For non-flat outputs (STRUCT / LIST / FixedSizeList), override outputSchema() instead, declaring the child fields. See the geo fixtures in vgi-example-worker.
Null handling
A row is null if vector.isNull(i). You decide what null in means: pass it through (result.setNull(i)), or treat it as an identity. Note the engine short-circuits an all-literal-NULL call before it reaches the worker, so typeof(demo.double(NULL::INT)) is NULL — that's the engine, not your code.
Performance notes
- The framework reuses a per-thread output
VectorSchemaRootacross batches, so steady-state scalar dispatch allocates nothing on the hot path. - Presize the output when you can.
result.allocateNew(rows)(fixed-width) orresult.allocateNew(dataBytes, rows)(varlen) avoids repeated grow-and-copy insidesetSafe.
Going further
The full scalar surface — varargs, any-typed columns, nested STRUCT/LIST outputs, binary packing, secret accessors — is exercised by vgi-example-worker/src/main/java/farm/query/vgi/example/scalar/ in the vgi-java repo (Double, AddValues, Multiply, BinaryPacket, the geo centroid/distance trio, and more).
Next: table functions →
