In this post, we will be looking at how to parse the apache log file in Java. We will also be looking at different parts of the regular expression that will help us parse the apache log file in detail.
The file format was designed for human inspection but not for easy parsing. The problem is that different delimiters are used in the log file – square brackets for the date, quotes for the request line, and spaces sprinkled all through. If you try to use a StringTokenizer, you might be able to get it working, but you would spend a lot of time fiddling with it. Regex will save you a lot of lengthy code, and let’s understand how?
A sample Apache log line looks something like the below :
String ApacheLogSample = "123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] \"GET /java/javaResources.html "+ "HTTP/1.0\" 200 10450 \"-\" \"Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)\"";
And below is the regex for parsing the above file line:
String regex = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"(.+?)\"";
([\d.]+)
It represents digits followed by a dot(.), eg -> 123.- +
It is used to get any number of digits followed by a dot(.), which will help get the IPs in the log file. - (\S+)
This matches any character that is not a whitespace character. - \[([\w:/]+\s[+-]\d{4})\] -> [w:/]
This represents a word followed by a colon(:) or slash(/). It will cover 27/Oct/2000:09:27:09 in the ApacheLogSample String, \s[+-], means a whitespace character followed by either plus(+) or minus(-), and d{4} represents exactly four repetitions of digits. - (.+?)
It is used to get any character up to the quotes. We can’t use(.+)
here, because that would match too much(up to the quote at the end of the line). - \d{3}
It will match precisely 3 repetitions of digits, e.g., 123 or even 1234, but not 12. - (\d+)
It will match any number of digits. - ([^”]+)
It will match any character other than double quotes ("
).
After understanding the above regex, let’s look at the program to parse the file in java. Here, we use double slash ( \\ ) to escape the characters only.
public class ApacheLogParser {
public static void main(String argv[]) {
String regex = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"(.+?)\"";
String ApacheLogSample = "123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] \"GET /java/javaResources.html "
+ "HTTP/1.0\" 200 10450 \"-\" \"Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)\"";
Pattern p = Pattern.compile(regex);
System.out.println("Apache log input line: " + ApacheLogSample);
Matcher matcher = p.matcher(ApacheLogSample);
if (matcher.find()) {
System.out.println("IP Address: " + matcher.group(1));
System.out.println("UserName: " + matcher.group(3));
System.out.println("Date/Time: " + matcher.group(4));
System.out.println("Request: " + matcher.group(5));
System.out.println("Response: " + matcher.group(6));
System.out.println("Bytes Sent: " + matcher.group(7));
if (!matcher.group(8).equals("-"))
System.out.println("Referer: " + matcher.group(8));
System.out.println("User-Agent: " + matcher.group(9));
}
}
}
The output of the program :
Apache log input line: 123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0" 200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)"
IP Address: 123.45.67.89
UserName: -
Date/Time: 27/Oct/2000:09:27:09 -0400
Request: GET /java/javaResources.html HTTP/1.0
Response: 200
Bytes Sent: 10450
User-Agent: Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)
So, that’s it. This is all you have to do to parse an apache log file using java and regex. If you want to learn more about regex, then you can see the below topics –
Reference: Java Cookbook
We hope that you find it helpful. If you have any doubts or concerns, feel free to write us in the comments or mail us at admin@codekru.com.