I’m writing a library that generates Java code at runtime, and so far it’s producing good results.
The problem: some legacy software (read: "adopting polymorphism now is kinda impossible") needs to parse large volumes of data. The conversion logic from PojoA to PojoB is always the same, so writing a simple mapper that specifically implements the logic to move data from PojoA to PojoB, and reusing it to parse a BIG List<PojoA> to List<PojoB>, would be ideal... but we have ~10,000 different Pojo classes (and an unknown number of meaningful source-destination combinations).
Currently, this "parsing" is done via reflection in a ~30k-line Java file that examines POJO variable names and types (which, to be fair, are remarkably consistent across files). Based on source vs destination comparisons, one can often predict the required transformation. For example, if the source POJO has a variable like:
public int variableUniqueID_date_yyyy_MM_dd_nullable;
...and the destination POJO has something like:
public String variableUniqueID_date_MonthNameALLCAPS_notNullable_default_JAN;
...you know there’s a match (yay!), but parsing is required. This is currently handled with reflection and a lot of if statements.
This approach has worked for ~20 years (I think... I didn’t write it, the project wasn’t even in Git back then, and I’ve only been at the company a few years). But as data volume has grown exponentially, what used to take ~10 minutes now takes ~7 hours. This is expected, since we’re essentially moving data using reflection and huge switch statements based on variable names and types.
My first solution was to write a program that compared all possible source vs destination POJO files and generated a .java file with exact conversion logic for each pair. After a lot of trial and error, I tested the approach with randomly-picked POJOs and real production data. It worked like a charm! I even ran it overnight to compare results with the legacy solution, and they matched.
However, I didn’t implement this version in production because it would have generated nearly 100 million files... not something our build system could realistically handle :D.
I came up with a better solution: at runtime, I generate the code for the needed POJO pair, compile it using the Java Compiler API, cache the resulting class bytecode in an LRU cache (keyed by the source/destination pair), and begin parsing.
The generated code is quite performant because it’s all inside loop with minimal external function calls (parsing helpers). The generated structure looks like:
List<PojoA> source = ...;
List<PojoB> dest = new ArrayList<>(source.size());
for (int index = 0; index < source.size(); index++) {
PojoA current = source.get(index);
PojoB parsed = new PojoB();
dest.add(parsed);
// parse each variable
{
// logic resulting in parsed.setFirstVariableName(something);
}
{
// logic resulting in parsed.setSecondVariableName(something);
}
// ...
}
Debugging generated code is not simple, but it should not be hard. Sometimes the for-loop body is quite long so I was thinking about putting most (if not all) parsing inside public accessible functions so the result could be something like:
List<PojoA> source = ...;
List<PojoB> dest = new ArrayList<>(source.size());
for (int index = 0; index < source.size(); index++) {
PojoA current = source.get(index);
PojoB parsed = new PojoB();
dest.add(parsed);
// parse each variable
parsed.setFirstVariableName(parseFunctionX(current.getFirstVariableName()));
parsed.setSecondVariableName(parseFunctionY(current.getSecondVariableName()));
// ...
}
This is more readable but might slightly impact performance. To mitigate that, I thought about using the @ForceInline annotation on my parsing functions.
Despite the warning inside the ForceInline class (https://github.com/openjdk/jdk/blob/7ae52ce572794f9d17446c66381f703ea1bb8b7c/src/java.base/share/classes/jdk/internal/vm/annotation/ForceInline.java#L35) I believe my use case is valid. Would you agree?
Also, referencing anything in jdk.internal.vm.annotation.* requires adding:
--add-exports java.base/jdk.internal.vm.annotation=ALL-UNNAMED
to the compiler args. My other questions:
- Do I need to specify this at runtime too?
- If another application imports my code as a library, should they also compile/run with --add-exports?
- Why ALL-UNNAMED? Can I restrict the export to only my code generation module? Or should the class with the to-be-inlined-function be inside said module?
- I read that I need to name my module and define exports in a module-info.java file, how?
- Is this compatible with Maven (my current build system)?