tregex.quote() Sanitizing strings for tregex patterns

Unfortunately, TregexPatterns, doesn't have a quote() function yet. Hence, I wrote a small function to quote my own strings, which is not complete, but works for most strings which are found in news articles:

    /**
     * Escapes most sequences which cause problems in Tregexpressions.
     * @param unSanitizedWord The un-escaped string
     * @return The sanitized word.
     */
    private static String sanitizeWord(String unSanitizedWord) {
        String sanitizedWord = unSanitizedWord;
        boolean putSlashes = false;
        sanitizedWord = ((sanitizedWord.equals("(")) ? "/\\(/" : sanitizedWord);
        sanitizedWord = ((sanitizedWord.equals(")")) ? "/\\)/" : sanitizedWord);
        sanitizedWord = ((sanitizedWord.equals("%")) ? "/\\%/" : sanitizedWord);
        sanitizedWord = ((sanitizedWord.equals("&")) ? "/\\&/" : sanitizedWord);
        sanitizedWord = ((sanitizedWord.equals(":")) ? "/\\:/" : sanitizedWord);
        sanitizedWord = ((sanitizedWord.equals("$")) ? "/\\$/" : sanitizedWord);
        sanitizedWord = ((sanitizedWord.equals(".")) ? "/\\./" : sanitizedWord);
        sanitizedWord = ((sanitizedWord.equals(",")) ? "/\\,/" : sanitizedWord);
        if (sanitizedWord.matches("^[0-9].*")) {
            // It contains digits in the beginning
            putSlashes = true;
        }
        if (sanitizedWord.matches(".*[*+?].*")) {
            putSlashes = true;
            sanitizedWord = sanitizedWord.replaceAll("\\*", "\\\\*");
            sanitizedWord = sanitizedWord.replaceAll("\\?", "\\\\?");
            sanitizedWord = sanitizedWord.replaceAll("\\+", "\\\\+");
        }
        if (sanitizedWord.endsWith(".")) {
            // The word is an abbreviation.
            sanitizedWord = sanitizedWord.substring(0, sanitizedWord.length() - 1) + "\\.";
            putSlashes = true;
        }
        return ((putSlashes) ? "/" + sanitizedWord + "/" : sanitizedWord);
    }

I am inclined to think that braces too might cause problems, but I had never encountered them in a news article.

Back to Home



musically_ut 2014-04-03