1

Question

After automatically modifying some Python comments in Java, I would like to verify the file still contains valid Python syntax, how can I do that from Java, without actually running some Python code using an interpreter? (To be explicit: I am looking for a java-only solution, not a solution that calls some other code from inside Java to compute whether the syntax is valid or not).

I tried building the AST of the file using ANTLR, however, that seems like a non-trivial task for arbitrary Python files, as explained in this answer. Another suggestion would be to simply try and run the file to see if it runs or not, however, that is unsafe for arbitrary files. Alternatively, one could call some python code that verifies it has runnable code, from Java, however that also relies on executing external (controlled) code, (as shown in this answer), which I would prefer not to do.

MWE

Below is an MWE that still requires/assumes you have Python installed somewhere in your system:

package com.something.is_valid_python_syntax;

import java.io.ByteArrayOutputStream;
import java.io.PrintStream;
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenSource;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import com.doctestbot.is_valid_python_syntax.generated.PythonParser;
import com.doctestbot.is_valid_python_syntax.generated.PythonLexer;

public class IsValidPythonSyntax {
  
  
  public static PythonParser getPythonParser(String pythonCode) {
    // Create a CharStream from the Python code
    CharStream charStream = CharStreams.fromString(pythonCode);

    // Create the lexer
    PythonLexer lexer = new PythonLexer(charStream);

    // Create a token stream from the lexer
    CommonTokenStream tokenStream = new CommonTokenStream((TokenSource) lexer);

    // Create the parser
    return new PythonParser(tokenStream);
  }

  public static boolean isValidPythonSyntax(String pythonCode) {

    PythonParser parser = getPythonParser(pythonCode);

    // Parse the input and get the tree

    // Redirect standard error stream
    PrintStream originalErr = System.err;
    ByteArrayOutputStream errStream = new ByteArrayOutputStream();
    System.setErr(new PrintStream(errStream));

    try {
      ParseTree tree = parser.file_input();

    } finally {
      // Restore the original standard error stream
      System.setErr(originalErr);
    }

    // Check if there were any errors in the error stream
    String errorOutput = errStream.toString();
    if (!errorOutput.isEmpty()) {
      System.out.println("Invalid Python syntax:");
      System.out.println(errorOutput);
      return false;
    } else {
      System.out.println("Valid Python syntax");
      return true;
    }
  }
}

However, that claims that the following Python code is invalid syntax:

def foo():
    print("hello world.")
foo()

Based on the following Antlr error message:

Invalid Python syntax:
line 1:3 extraneous input ' ' expecting {'match', '_', NAME}

Searching this error leads to suggestions on how to adapt the grammar, however, this was autogenerated from the Python 3.12 Antlr grammar.

Issue

It seems that the Antlr error message system does not distinguish between warnings and errors, for example, on (missing closing bracket in print statement):

def foo():
    print("hello world."
foo()

It outputs:

Invalid Python syntax:
line 1:3 extraneous input ' ' expecting {'match', '_', NAME}
line 2:9 no viable alternative at input '('
line 3:0 no viable alternative at input 'foo'

I do not know how many different error messages Antlr can produce on parsing Python code, nor do I know which ones I should take seriously nor whether that decision on valid/invalid Python syntax based on Antlr parsing errors is context dependent or not.

5
  • 2
    You said you wanted to make sure the Python you produce "still contains valid Python syntax." That's the purpose of parsing. Use Antlr to generate a parser for Python3 from github.com/antlr/grammars-v4/tree/master/python/python3_12_0. Write a Java program to call the parser on the input file (see github.com/kaby76/AntlrExamples/tree/…). Don't confuse an abstract syntax tree (AST) with a concrete syntax tree (CST, also known as a parse tree, PT). Your problem doesn't even require a PT or AST, just whether the input parses. Commented Jan 26, 2024 at 12:07
  • @kaby76 thank you, I have updated the question to include the approach using the Python 3.12 Antlr parser in Java, and it seems to throw error messages on valid Python syntax as well. I have dedicated the subsection ## Issue, to this behaviour. Commented Feb 5, 2024 at 13:22
  • "After automatically modifying some Python comments" How could that break the code? Commented Feb 6, 2024 at 8:28
  • 1
    I appreciate your curiosity, to prevent trailing off into a discussion on whether some edge-case may or may not occur, I would like to share that consider it better to verify it does not occur, than to think it can not. Commented Feb 6, 2024 at 8:42
  • Can we see how the "automatically modifying some Python comments" works? Some code perhaps? Commented Feb 8, 2024 at 23:24

2 Answers 2

2
+100

I believe the issue is with the Python 3.12 grammar in the grammars-v4 repository. I used your code as a base, and was able to get it working properly using the grammars in the ANTLR4-parser-for-Python-3.12 repository.

Here's the full working code:

package playground;

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import playground.antlr.generated.PythonLexer;
import playground.antlr.generated.PythonParser;

import java.io.ByteArrayOutputStream;
import java.io.PrintStream;
import java.util.List;

public class PythonValidator {

    public static void main(String[] args) {
        List<String> snippets = List.of(
            // Hello world as a function
            """
            def foo():
                print("hello world.")
            foo()
            """,

            // Program to generate a random number between 0 and 0
            """
            # Program to generate a random number between 0 and 9
            import random
                        
            print(random.randint(0,9))
            """,

            // Reverse a number
            """
            num = 1234
            reversed_num = 0
                        
            while num != 0:
                digit = num % 10
                reversed_num = reversed_num * 10 + digit
                num //= 10
                        
            print("Reversed Number: " + str(reversed_num))    
            """
        );
        PythonValidator validator = new PythonValidator();
        for (String snippet : snippets) {
            boolean valid = validator.isValidSyntax(snippet);
            System.out.println("Valid? " + valid);
        }
    }

    public boolean isValidSyntax(String pythonCode) {
        PythonParser parser = getPythonParser(pythonCode);

        // Redirect standard error stream
        PrintStream originalErr = System.err;
        ByteArrayOutputStream errStream = new ByteArrayOutputStream();
        System.setErr(new PrintStream(errStream));

        try {
            parser.file_input();
        } finally {
            // Restore the original standard error stream
            System.setErr(originalErr);
        }

        // Check if there were any errors in the error stream
        String errorOutput = errStream.toString();
        if (!errorOutput.isEmpty()) {
            System.out.println(errorOutput);
            return false;
        } else {
            return true;
        }
    }

    private PythonParser getPythonParser(String pythonCode) {
        // Create a CharStream from the Python code
        CharStream charStream = CharStreams.fromString(pythonCode);

        // Create the lexer
        PythonLexer lexer = new PythonLexer(charStream);

        // Create a token stream from the lexer
        CommonTokenStream tokenStream = new CommonTokenStream(lexer);

        // Create the parser
        return new PythonParser(tokenStream);
    }
}

Here is a working example.

Just clone the repo and run:

./gradlew generateGrammarSource
./gradlew run
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for your sharing your solution and providing an MWE that is ran so easily! I verified your code works, and that it also returns false on invalid syntax.
1

I tried with the official ANTLR grammars and it worked for me.

I used the latest ANTLR version (4.13.1) to generate the files:

java -jar antlr-4.13.1-complete.jar *.g4

Here is my code:

import java.io.IOException;
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.misc.ParseCancellationException;

public class PythonChecker
{
    public static PythonParser getPythonParser(String file) throws IOException
    {
        CharStream charStream = CharStreams.fromFileName(file);
        PythonLexer lexer = new PythonLexer(charStream);
        lexer.removeErrorListeners();
        CommonTokenStream tokenStream = new CommonTokenStream(lexer);
        PythonParser parser = new PythonParser(tokenStream);
        parser.removeErrorListeners();
        parser.setErrorHandler(new BailErrorStrategy());
        return parser;
    }

    public static boolean isValidPythonSyntax(String file) throws IOException
    {
        PythonParser parser = getPythonParser(file);
        try
        {
            parser.file_input();
            return true;
        }
        catch(ParseCancellationException e)
        {
            return false;
        }
    }

    public static void main(String[] args) throws IOException
    {
        String file = args[0];
        if(isValidPythonSyntax(file))
            System.out.println(file + ": OK");
        else
            System.out.println(file + ": invalid");
    }
}

Two things to note:

  • I called removeErrorListeners() on the lexer and the parser to suppress the error messages sent to stderr

  • I set a BailErrorStrategy on the parser, which throws a ParseCancellationException as soon as a parse error is encountered

When checking this Python script (script1.py):

def foo():
    print("hello world.")
foo()

I get:

script1.py: OK

When checking this one (script2.py):

def foo():
    print("hello world."
foo()

I get:

script2.py: invalid

1 Comment

Thank you for your detailed answer, and notes on how you resolved it. The answer by @blacktide was submitted first and has been validated, so I decided to award the bonus to that answer. I hope you can appreciate that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.