Как выбрать решение для парсинга сайтов: классификация и большой обзор программ, сервисов и фреймворков
Содержание:
- Overview
- java.io.PrintStream
- Filtering
- HTML Parsing in Java using JSoup
- Math expressions parser — supported frameworks:
- Math Expression Parser — Main functionalities:
- Math ExpressionParser — High flexibility functionalities
- Math Expression Parser — Project documentation
- What is JSoup Library
- Modifying
- java.text.SimpleDateFormat
- Package installation
- Jsoup at a Glance
- Extracting
- Java Program to parse HTML Document
- Loading
- java.util.Scanner
Overview
Jsoup is an open source Java library used mainly for extracting data from HTML. It also allows you to manipulate and output HTML. It has a steady development line, great documentation, and a fluent and flexible API. Jsoup can also be used to parse and build XML.
In this tutorial, we’ll use the Spring Blog to illustrate a scraping exercise that demonstrates several features of jsoup:
- Loading: fetching and parsing the HTML into a Document
- Filtering: selecting the desired data into Elements and traversing it
- Extracting: obtaining attributes, text, and HTML of nodes
- Modifying: adding/editing/removing nodes and editing their attributes
java.io.PrintStream
Класс java.io.PrintStream позволяет писать форматированные данные в любой поток. Вам вряд ли когда-нибудь придётся создавать экземпляры этого класса вручную, гораздо чаще вы будете использовать готовые классы, вроде возвращаемых
System.out.
PrintStream имеет методы
print и
println, перегруженные для любого примитивного типа и для класса
Object (в этом случае используется его метод
toString() ).
Класс
PrintStream никогда не бросает
IOException, вместо этого он устанавливает свой внутренний флаг, который может быть проверен с помощью метода
publicbooleancheckError().
Особое внимание заслуживают методы:
Java
public PrintStream format(String format,
Object… args)
1 |
publicPrintStream format(Stringformat, Object…args) |
Java
public PrintStream format(Locale l,
String format,
Object… args)
1 |
publicPrintStream format(Localel, Stringformat, Object…args) |
Java
public PrintStream printf(Locale l,
String format,
Object… args)
1 |
publicPrintStream printf(Localel, Stringformat, Object…args) |
Java
public PrintStream printf(String format,
Object… args)
1 |
publicPrintStream printf(Stringformat, Object…args) |
Эти методы позволяют писать в поток форматированные данные. Здесь
format — это шаблон строки, который подробно в пункте «».
Filtering
Now that we have the HTML converted into a Document, it’s time to navigate it and find what we are looking for. This is where the resemblance with jQuery/JavaScript is more evident, as its selectors and traversing methods are similar.
5.1. Selecting
The Document select method receives a String representing the selector, using the same selector syntax as in a CSS or JavaScript, and retrieves the matching list of Elements. This list can be empty but not null.
Let’s take a look at some selections using the select method:
You can also use more explicit methods inspired by the browser DOM instead of the generic select:
Since Element is a superclass of Document, you can learn more about working with the selection methods in the Document and Element Javadocs.
5.2. Traversing
Traversing means navigating across the DOM tree. Jsoup provides methods that operate on the Document, on a set of Elements, or on a specific Element, allowing you to navigate to a node’s parents, siblings, or children.
Also, you can jump to the first, the last, and the nth (using a 0-based index) Element in a set of Elements:
You can also iterate through selections. In fact, anything of type Elements can be iterated:
You can make a selection restricted to a previous selection (sub-selection):
HTML Parsing in Java using JSoup
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Login Page</title> </head> <body> <div id="login" class="simple" > <form action="login.do"> Username : <input id="username" type="text" /><br> Password : <input id="password" type="password" /><br> <input id="submit" type="submit" /> <input id="reset" type="reset" /> </form> </div> </body> </html>
HTML parsing is very simple with Jsoup, all you need to call is the static method Jsoup.parse()and pass your HTML String to it. JSoup provides several overloaded parse() method to read HTML files from String, a File, from a base URI, from an URL, and from an InputStream. You can also specify character encoding to read HTML files correctly in case they are not in “UTF-8” format.
Math expressions parser — supported frameworks:
-
- JAVA: 1.5, 1.6, 1.7, 1.8 (separate binaries available if necessary)
- Android: tested with mxparser_jdk1.7.jar
- .NET/MONO: 2.0, 3.0, 3.5, 4.0, 4.5, 4.6 (separate binaries available if necessary)
- .NET Core: 1.0, 1.1
- .NET Standard: 1.0, 1.6
- .NET PCL: portable45, win8, wpa81
- Xamarin.Android: 1.0, 6.0
- Xamarin.iOS: 1.0
Did you find the software useful? Please consider donation
Developing and maintaining MathParser.org-mXparser takes a lot of time, mainly my free time. I hope it saved some of your time. If yes, then buy me a coffee
DONATE
Source code .zipSource code .tar.gz View on GitHubMathSpace.pl
Math Expressions Parser Features:
-
-
- rich built-in library of operators, constants, math functions;
- user defined: arguments, functions, recursive functions and general recursion (direct / indirect);
- grammar and internal syntax checking;
-
Math Expressions Parser — features list with examples
Functionality | Example | Support level |
---|---|---|
Simple calculator | i.e.: 2+3, n! | Full support |
Binary relations | i.e.: a*b | Full support |
Boolean operators | i.e.: a&b | Full support |
Built-in constants | i.e.: 2+pi | Full support |
User defined constants | i.e.: 3tau, where tau = 2pi | Full support |
Built-in unary functions | i.e.: sin(2) | Extensive collection |
Built-in binary functions | i.e.: log(a,b) | Main functions |
Built-in n-arguments functions | i.e.: gcd(a,b,c,d,…) | Special functions |
Evaluating conditions | i.e.: if(a=true, then b, else c) | Full support |
Cases functions | i.e.: iff(case1, then a1, case2, then a2, …) | Full support |
User defined arguments | i.e.: x = 5, cos(x) | Full support |
User defined dependent arguments | i.e.: x=2, y=x^2 | Full support |
Iterated operators — SIGMA summation | i.e.: sum( 1, n, f(…,i) {step} ) | Full support |
Iterated operators — PI product | i.e.: prod( 1, n, f(…,i) {step} ) | Full support |
Derivatives | i.e.: der( sin(x), x) ) | Full support |
Integrals | i.e.: 2*int( sqrt(1-x^2), x, -1, 1) | Full support |
User defined functions | i.e.: f(x,y) = sin(x+y) | Full support |
Fast (limited) recursion | i.e.: fib(n) = fib(n-1) + fib(n-2), addBaseCase(0, 0), addBaseCase(1, 1) | Full support |
Recursion, any kind | i.e.: Cnk(n,k) = if( k>0, if( k<n, Cnk(n-1,k-1)+Cnk(n-1,k), 1), 1) | Full support |
Syntax checking | checkSyntax() | Full support |
Getting computing time | getComputingTime() | Full support |
Verbose mode | setVerboseMode() | Full support |
mXparser — deliverables
Language / Framework | Documentation | Library | Source code |
---|---|---|---|
JAVA | Yes | Yes | Yes |
Android | Yes | Yes | Yes |
C# .NET | Yes | Yes | Yes |
Visual Basic .NET(CLS) | Yes | Yes | |
C++/CLI .NET(CLS) | Yes | Yes | |
F# .NET(CLS) | Yes | Yes | |
Other .NET languages | Yes | Yes, not tested | |
MONO | Yes | Yes, dll tested | C# code |
Math Expression Parser — Main functionalities:
-
-
- basic operators, i.e.: +,- , *, ^, !
- Boolean logic operators i.e.: or, and, xor
- binary relations i.e.: =, <, >
- math functions (large library of unary, binary, 3-args, and n-args functions) i.e.: sin, cos, Stirling numbers, log, inverse functions
- constants (large library), i.e.: pi, e, golden ratio
- n-arguments functions, i.e.:
- iterated summation and product operators
- differentiation and integration
-
Math ExpressionParser — High flexibility functionalities
-
-
- user defined constants and arguments (both free and dependent) + possibility of use in functions
- user defined functions (both free and depended)
- user defined recursive arguments + simple (controlled) recursion (1 recursive argument)
- user defined recursive functions / expressions (any) — complex, many arguments, no limitation
- internal syntax checking
- internal help
- random numbers, random variables and probability distributions
- expression string tokenizer
- other useful functionalities, i.e.: , expression description, verbose mode.
-
Math Expression Parser — Project documentation
*** If you found the software useful donation is something you might consider ***
Mariusz Gromada
What is JSoup Library
Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers like Chrome and Firefox do. Here are some of the useful features of jsoup library :
- Jsoup can scrape and parse HTML from a URL, file, or string
- Jsoup can find and extract data, using DOM traversal or CSS selectors
- Jsoup allows you to manipulate the HTML elements, attributes, and text
- Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks
- Jsoup also output tidy HTML
Jsoup is designed to deal with different kinds of HTML found in the real world, which includes properly validated HTML to incomplete non-validate tag collection. One of the core strengths of Jsoup is that it’s very robust.
Modifying
Modifying encompasses setting attributes, text, and HTML of elements, as well as appending and removing elements. It is done to the DOM tree previously generated by jsoup – the Document.
7.1. Setting Attributes and Inner Text/HTML
As in jQuery, the methods to set attributes, text, and HTML bear the same names but also receive the value to be set:
- attr() – sets an attribute’s values (it creates the attribute if it does not exist)
- text() – sets element inner text, replacing content
- html() – sets element inner HTML, replacing content
Let’s look at a quick example of these methods:
7.2. Creating and Appending Elements
To add a new element, you need to build it first by instantiating Element. Once the Element has been built, you can append it to another Element using the appendChild method. The newly created and appended Element will be inserted at the end of the element where appendChild is called:
7.3. Removing Elements
To remove elements, you need to select them first and run the remove method.
For example, let’s remove all <li> tags that contain the “navbar-link” class from Document, and all images from the first article:
7.4. Converting the Modified Document to HTML
Finally, since we were changing the Document, we might want to check our work.
To do this, we can explore the Document DOM tree by selecting, traversing, and extracting using the presented methods, or we can simply extract its HTML as a String using the html() method:
The String output is a tidy HTML.
java.text.SimpleDateFormat
Класс java.text.SimpleDateFormat наследуется от
java.text.DateFormat и позволяет указать пользовательский шаблон форматирования.
Конструкторы:
Java
public SimpleDateFormat(String pattern)
1 | publicSimpleDateFormat(Stringpattern) |
Java
public SimpleDateFormat(String pattern,
Locale locale)
1 |
publicSimpleDateFormat(Stringpattern, Locale locale) |
Java
public SimpleDateFormat(String pattern,
DateFormatSymbols formatSymbols)
1 |
publicSimpleDateFormat(Stringpattern, DateFormatSymbols formatSymbols) |
Конструктор с DateFormatSymbols позволяет создать форматировщик, используя особые правила.
Шаблон
pattern может содержать следующие специальные символы:
Буква | Компонент даты и времени | Представление | Примеры |
---|---|---|---|
Эра | |||
Год | ; | ||
Год | ; | ||
Месяц в году (зависит от контекста) | ; ; | ||
Месяц в году (самостоятельная форма) | ; ; | ||
Неделя в годе | |||
Неделя в месяце | |||
День в году | |||
День в месяце | |||
День недели в месяце | |||
Название дня недели | ; | ||
Номер дня недели (1 = Понедельник, …, 7 = Воскресенье) | |||
Am/pm | |||
Час в дне (0-23) | |||
Час в дне (1-24) | |||
Час в дне am/pm (0-11) | |||
Час в дне am/pm (1-12) | |||
Минуты | |||
Секунды | |||
Миллисекунды | |||
Часовой пояс | ; ; | ||
Часовой пояс | |||
Часовой пояс | ; ; |
Описание столбца «представление»:
- Text: Если в шаблоне 4 буквы или более, то используется полная форма, в противном случае используется сокращённая форма. При парсинге принимаются обе формы, независимо от количества букв в шаблоне.
- Number: Количество букв в шаблоне — это минимальное количество цифр, более короткие числа добиваются нулями. При парсинге количество букв игнорируется, если только оно не требуется для разделения соседних полей.
-
Year: Если
Calendar форматировщика является григорианским календарём, то применяются следующие правила. При форматировании если количество букв равно двум, то год усекается до двух цифр, в противном случае интерпретируется как число. При парсинге если количество букв больше двух, то год интерпретируется буквально, независимо от количества цифр. Поэтому использование шаблона
«MM/dd/yyyy» и строки
«01/11/12» получается 11 января 12 года нашей эры. При парсинге с сокращённой формой года (
«y» или
«yy» )
SimpleDateFormat интерпретирует сокращённый год относительно какого-либо века. Он выравнивает даты так, чтобы они были в диапазоне от 80 лет до даты создания
SimpleDateFormat и до 20 лет после даты создания
SimpleDateFormat. Во время парсинга только строки, состоящие строго из двух цифр интерпретируются в текущий век. Любые другие числовые строки, состоящие из одной цифры или трёх и более, интерпретируются как полный год. -
Month: Если количество букв в шаблоне равно 3 или более, то месяц интерпретируется как текст, в противном случае интерпретируется как число. Буква
M создаёт имена месяцев, зависимые от контекста. Если в конструктор был передан
DateFormatSymbols или был использован метод
setDateFormatSymbols, то имена месяцев берутся из
DateFormatSymbols. Буква
L создаёт самостоятельную форму имён месяцев. - /a>General time zone: Часовые пояса интерпретируются по текстовым именам. При использовании смещения часовой пояс указывается в виде GMT +01:30 или GMT-12:33.
- RFC 822 time zone: Используются четыре цифры: -0800 или +1200.
- ISO 8601 time zone: Используются две цифры, четыре цифры, или с разделением часов и минут двоеточием: -08; -0800; -08:00.
Package installation
Maven
mXparser is a super easy, rich, fast and highly flexible math expression parser library (parser and evaluator of mathematical expressions / formulas provided as plain text / string). Software delivers easy to use API for JAVA, Android and C# .NET/MONO (Common Language Specification compliant: F#, Visual Basic, C++/CLI). Expression parser comes with extensive documentation, easy to follow tutorial, «Hello World!» projects for 5 different languages explained with many screenshots, and finally — last, but not least — performance test summary. Formula evaluator is distributed under «Simplified BSD license», which means software is completely free
Jsoup at a Glance
Jsoup loads the page HTML and builds the corresponding DOM tree. This tree works the same way as the DOM in a browser, offering methods similar to jQuery and vanilla JavaScript to select, traverse, manipulate text/HTML/attributes and add/remove elements.
If you’re comfortable with client-side selectors and DOM traversing/manipulation, you’ll find jsoup very familiar. Check how easy it is to print the paragraphs of a page:
Bear in mind that jsoup interprets HTML only — it does not interpret JavaScript. Therefore changes to the DOM that would normally take place after page loads in a JavaScript-enabled browser will not be seen in jsoup.
Extracting
We now know how to reach specific elements, so it’s time to get their content — namely their attributes, HTML, or child text.
Take a look at this example that selects the first article from the blog and gets its date, its first section text, and finally, its inner and outer HTML:
Here are some tips to bear in mind when choosing and using selectors:
- Rely on “View Source” feature of your browser and not only on the page DOM as it might have changed (selecting at the browser console might yield different results than jsoup)
- Know your selectors as there are a lot of them and it’s always good to have at least seen them before; mastering selectors takes time
- Use a playground for selectors to experiment with them (paste a sample HTML there)
- Be less dependent on page changes: aim for the smallest and least compromising selectors (e.g. prefer id. based)
Java Program to parse HTML Document
Here is our complete Java program to parse an HTML String, an HTML file downloaded from the internet and an HTML file from the local file system. In order to run this program, you can either use the Eclipse IDE or you can just use any IDE or command prompt. In Eclipse, it’s very easy, just copy this code, create a new Java project, right click on src package and paste it. Eclipse will take care of creating proper package and Java source file with same name, so absolutely less work. If you already have a Sample Java project, then it’s just one step. Following Java program shows 3 examples of parsing and traversing HTML file. In first example, we directly parse an String with html content, in the second example we parse an HTML file downloaded from an URL, in the third example we load and parse an HTML document from local file system.
import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; /** * Java Program to parse/read HTML documents from File using Jsoup library. * Jsoup is an open source library which allows Java developer to parse HTML * files and extract elements, manipulate data, change style using DOM, CSS and * JQuery like method. * * @author Javin Paul */ public class HTMLParser{ public static void main(String args[]) { // Parse HTML String using JSoup library String HTMLSTring = "<!DOCTYPE html>" + "<html>" + "<head>" + "<title>JSoup Example</title>" + "</head>" + "<body>" + "<table><tr><td><h1>HelloWorld</h1></tr>" + "</table>" + "</body>" + "</html>"; Document html = Jsoup.parse(HTMLSTring); String title = html.title(); String h1 = html.body().getElementsByTag("h1").text(); System.out.println("Input HTML String to JSoup :" + HTMLSTring); System.out.println("After parsing, Title : " + title); System.out.println("Afte parsing, Heading : " + h1); // JSoup Example 2 - Reading HTML page from URL Document doc; try { doc = Jsoup.connect("http://google.com/").get(); title = doc.title(); } catch (IOException e) { e.printStackTrace(); } System.out.println("Jsoup Can read HTML page from URL, title : " + title); // JSoup Example 3 - Parsing an HTML file in Java //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong Document htmlFile = null; try { htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1"); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } // right title = htmlFile.title(); Element div = htmlFile.getElementById("login"); String cssClass = div.className(); // getting class form HTML element System.out.println("Jsoup can also parse HTML file directly"); System.out.println("title : " + title); System.out.println("class of div tag : " + cssClass); } }
Output: Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html> After parsing, Title : JSoup Example Afte parsing, Heading : HelloWorld Jsoup Can read HTML page from URL, title : Google Jsoup can also parse HTML file directly title : Login Page class of div tag : simple
The Jsoup HTML parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It can handle the following mistakes :
unclosed tags (e.g. <p>Java <p>Scala to <p>Java</p> <p>Scala</p>)
implicit tags (e.g.a naked <td>Java is Great</td> is wrapped into a <table><tr><td>)
reliably creating the document structure (html containing a head and body, and only appropriate elements within the head).
Jsoup is an excellent and robust open source library which makes reading html documents, body fragments, html strings and directly parsing html content from the web, extremely easy.
Reference: | 3 Examples of Parsing HTML File in Java using Jsoup from our JCG partner Javin Paul at the Javarevisited blog. |
Loading
The loading phase comprises the fetching and parsing of the HTML into a Document. Jsoup guarantees the parsing of any HTML, from the most invalid to the totally validated ones, as a modern browser would do. It can be achieved by loading a String, an InputStream, a File or a URL.
Let’s load a Document from the Spring Blog URL:
Notice the get method, it represents an HTTP GET call. You could also do an HTTP POST with the post method (or you could use a method which receives the HTTP method type as a parameter).
If you need to detect abnormal status codes (e.g. 404), you should catch the HttpStatusException exception:
Sometimes, the connection needs to be a bit more customized. Jsoup.connect(…) returns a Connection which allows you to set, among other things, the user agent, referrer, connection timeout, cookies, post data, and headers:
Since the connection follows a fluent interface, you can chain these methods before calling the desired HTTP method:
You can learn more about the Connection settings by browsing the corresponding Javadoc.
java.util.Scanner
Класс java.util.Scanner предназначен для разбиения форматированного ввода на токены и конвертирования токенов в соответствующий тип данных.
По умолчанию сканер использует пробельные символы (пробелы, табуляторы, разделители линий) для разделения токенов. Рассмотрите следующий код:
Java
import java.io.*;
import java.util.Scanner;
public class ScanXan {
public static void main(String[] args) throws IOException {
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader(«xanadu.txt»)));
while (s.hasNext()) {
System.out.println(s.next());
}
} finally {
if (s != null) {
s.close();
}
}
}
}
1 |
importjava.io.*; importjava.util.Scanner; publicclassScanXan{ publicstaticvoidmain(Stringargs)throwsIOException{ Scanners=null; try{ s=newScanner(newBufferedReader(newFileReader(«xanadu.txt»))); while(s.hasNext()){ System.out.println(s.next()); } }finally{ if(s!=null){ s.close(); } } } } |
Если файл «xanadu.txt» содержит следующий текст:
In Xanadu did Kubla Khan
A stately pleasure-dome decree:
Where Alph, the sacred river, ran
Through caverns measureless to man
Down to a sunless sea.
1 |
In Xanadu did Kubla Khan A stately pleasure-dome decree: Where Alph, the sacred river, ran Through caverns measureless to man Down to a sunless sea. |
То результатом работы программы будет вывод:
In
Xanadu
did
Kubla
Khan
A
stately
pleasure-dome
…
1 |
In Xanadu did Kubla Khan A stately pleasure-dome … |
Чтобы использовать другой разделитель токенов используйте метод
useDelimiter, в который передаётся регулярное выражение. Например, предположим, что мы хотим использовать в качестве разделителя запятую, после которой может идти, а может не идти пробел:
Java
s.useDelimiter(«,\\s*»);
1 | s.useDelimiter(«,\\s*»); |
Класс
java.util.Scanner поддерживает все примитивные типы Java,
java.math.BigInteger и
java.math.BigDecimal.
Scanner использует экземпляр
java.util.Locale для преобразования строк в эти типы данных. Пример:
ScanSum.java
Java
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.IOException;
import java.util.Scanner;
import java.util.Locale;
public class ScanSum {
public static void main(String[] args) throws IOException {
Scanner s = null;
double sum = 0;
try {
s = new Scanner(new BufferedReader(new FileReader(«usnumbers.txt»)));
s.useLocale(Locale.US);
while (s.hasNext()) {
if (s.hasNextDouble()) {
sum += s.nextDouble();
} else {
s.next();
}
}
} finally {
s.close();
}
System.out.println(sum);
}
}
1 |
importjava.io.FileReader; importjava.io.BufferedReader; importjava.io.IOException; importjava.util.Scanner; importjava.util.Locale; publicclassScanSum{ publicstaticvoidmain(Stringargs)throwsIOException{ Scanners=null; doublesum=; try{ s=newScanner(newBufferedReader(newFileReader(«usnumbers.txt»))); s.useLocale(Locale.US); while(s.hasNext()){ if(s.hasNextDouble()){ sum+=s.nextDouble(); }else{ s.next(); } } }finally{ s.close(); } System.out.println(sum); } } |
Цикл статей «Учебник Java 8».
Следующая статья — «Java 8 консоль».
Предыдущая статья — «Java 8 дата и время».