Как выбрать решение для парсинга сайтов: классификация и большой обзор программ, сервисов и фреймворков

Overview

Jsoup is an open source Java library used mainly for extracting data from HTML. It also allows you to manipulate and output HTML. It has a steady development line, great documentation, and a fluent and flexible API. Jsoup can also be used to parse and build XML.

In this tutorial, we’ll use the Spring Blog to illustrate a scraping exercise that demonstrates several features of jsoup:

  • Loading: fetching and parsing the HTML into a Document
  • Filtering: selecting the desired data into Elements and traversing it
  • Extracting: obtaining attributes, text, and HTML of nodes
  • Modifying: adding/editing/removing nodes and editing their attributes

java.io.PrintStream

Класс java.io.PrintStream позволяет писать форматированные данные в любой поток. Вам вряд ли когда-нибудь придётся создавать экземпляры этого класса вручную, гораздо чаще вы будете использовать готовые классы, вроде возвращаемых
System.out.
PrintStream  имеет методы
print  и
println, перегруженные для любого примитивного типа и для класса
Object  (в этом случае используется его метод
toString() ).

Класс
PrintStream  никогда не бросает
IOException, вместо этого он устанавливает свой внутренний флаг, который может быть проверен с помощью метода
publicbooleancheckError().

Особое внимание заслуживают методы:

Java

public PrintStream format(String format,
Object… args)

1
2

publicPrintStream format(Stringformat,

Object…args)

Java

public PrintStream format(Locale l,
String format,
Object… args)

1
2
3

publicPrintStream format(Localel,

Stringformat,

Object…args)

Java

public PrintStream printf(Locale l,
String format,
Object… args)

1
2
3

publicPrintStream printf(Localel,

Stringformat,

Object…args)

Java

public PrintStream printf(String format,
Object… args)

1
2

publicPrintStream printf(Stringformat,

Object…args)

Эти методы позволяют писать в поток форматированные данные. Здесь
format  — это шаблон строки, который подробно в пункте «».

Filtering

Now that we have the HTML converted into a Document, it’s time to navigate it and find what we are looking for. This is where the resemblance with jQuery/JavaScript is more evident, as its selectors and traversing methods are similar.

5.1. Selecting

The Document select method receives a String representing the selector, using the same selector syntax as in a CSS or JavaScript, and retrieves the matching list of Elements. This list can be empty but not null.

Let’s take a look at some selections using the select method:

You can also use more explicit methods inspired by the browser DOM instead of the generic select:

Since Element is a superclass of Document, you can learn more about working with the selection methods in the Document and Element Javadocs.

5.2. Traversing

Traversing means navigating across the DOM tree. Jsoup provides methods that operate on the Document, on a set of Elements, or on a specific Element, allowing you to navigate to a node’s parents, siblings, or children.

Also, you can jump to the first, the last, and the nth (using a 0-based index) Element in a set of Elements:

You can also iterate through selections. In fact, anything of type Elements can be iterated:

You can make a selection restricted to a previous selection (sub-selection):

HTML Parsing in Java using JSoup

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>Login Page</title>
    </head>
    <body>
        <div id="login" class="simple" >
            <form action="login.do">
                Username : <input id="username" type="text" /><br>
                Password : <input id="password" type="password" /><br>
                <input id="submit" type="submit" />
                <input id="reset" type="reset" />
            </form>
        </div>
    </body>
</html>

HTML parsing is very simple with Jsoup, all you need to call is the static method Jsoup.parse()and pass your HTML String to it. JSoup provides several overloaded parse() method to read HTML files from String, a File, from a base URI, from an URL, and from an InputStream. You can also specify character encoding to read HTML files correctly in case they are not in “UTF-8” format.

Math expressions parser — supported frameworks:

    • JAVA: 1.5, 1.6, 1.7, 1.8 (separate binaries available if necessary)
    • Android: tested with mxparser_jdk1.7.jar
    • .NET/MONO: 2.0, 3.0, 3.5, 4.0, 4.5, 4.6 (separate binaries available if necessary)
    • .NET Core: 1.0, 1.1
    • .NET Standard: 1.0, 1.6
    • .NET PCL: portable45, win8, wpa81
    • Xamarin.Android: 1.0, 6.0
    • Xamarin.iOS: 1.0

Did you find the software useful? Please consider donation
Developing and maintaining MathParser.org-mXparser takes a lot of time, mainly my free time. I hope it saved some of your time. If yes, then buy me a coffee
DONATE

Source code .zipSource code .tar.gz View on GitHubMathSpace.pl

Math Expressions Parser Features:

      • rich built-in library of operators, constants, math functions;
      • user defined: arguments, functions, recursive functions and general recursion (direct / indirect);
      • grammar and internal syntax checking;

Math Expressions Parser — features list with examples

Functionality Example Support level
Simple calculator i.e.: 2+3, n! Full support
Binary relations i.e.: a*b Full support
Boolean operators i.e.: a&b Full support
Built-in constants i.e.: 2+pi Full support
User defined constants i.e.: 3tau, where tau = 2pi Full support
Built-in unary functions i.e.: sin(2) Extensive collection
Built-in binary functions i.e.: log(a,b) Main functions
Built-in n-arguments functions i.e.: gcd(a,b,c,d,…) Special functions
Evaluating conditions i.e.: if(a=true, then b, else c) Full support
Cases functions i.e.: iff(case1, then a1, case2, then a2, …) Full support
User defined arguments i.e.: x = 5, cos(x) Full support
User defined dependent arguments i.e.: x=2, y=x^2 Full support
Iterated operators — SIGMA summation i.e.: sum( 1, n, f(…,i) {step} ) Full support
Iterated operators — PI product i.e.: prod( 1, n, f(…,i) {step} ) Full support
Derivatives i.e.: der( sin(x), x) ) Full support
Integrals i.e.: 2*int( sqrt(1-x^2), x, -1, 1) Full support
User defined functions i.e.: f(x,y) = sin(x+y) Full support
Fast (limited) recursion i.e.: fib(n) = fib(n-1) + fib(n-2), addBaseCase(0, 0), addBaseCase(1, 1) Full support
Recursion, any kind i.e.: Cnk(n,k) = if( k>0, if( k<n, Cnk(n-1,k-1)+Cnk(n-1,k), 1), 1) Full support
Syntax checking checkSyntax() Full support
Getting computing time getComputingTime() Full support
Verbose mode setVerboseMode() Full support

mXparser — deliverables

Language / Framework Documentation Library Source code
JAVA Yes Yes Yes
Android Yes Yes Yes
C# .NET Yes Yes Yes
Visual Basic .NET(CLS) Yes Yes  
C++/CLI .NET(CLS) Yes Yes  
F# .NET(CLS) Yes Yes  
Other .NET languages Yes Yes, not tested  
MONO Yes Yes, dll tested C# code

Math Expression Parser — Main functionalities:

      • basic operators, i.e.: +,- , *, ^, !
      • Boolean logic operators i.e.: or, and, xor
      • binary relations i.e.: =, <, >
      • math functions (large library of  unary, binary, 3-args, and n-args functions) i.e.: sin, cos, Stirling numbers, log, inverse functions
      • constants (large library), i.e.: pi, e, golden ratio
      • n-arguments functions, i.e.:
      • iterated summation and product operators
      • differentiation and integration

Math ExpressionParser — High flexibility functionalities

      • user defined constants and arguments (both free  and dependent) + possibility of use in functions
      • user defined functions (both free and depended)
      • user defined recursive arguments + simple (controlled) recursion (1 recursive argument)
      • user defined recursive functions / expressions (any) — complex, many arguments, no limitation
      • internal syntax checking
      • internal help
      • random numbers, random variables and probability distributions
      • expression string tokenizer
      • other useful functionalities, i.e.: , expression description, verbose mode.

Math Expression Parser — Project documentation

*** If you found the software useful donation is something you might consider ***

Mariusz Gromada

What is JSoup Library

Jsoup is an open source Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. Jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers like Chrome and Firefox do. Here are some of the useful features of jsoup library :

  •     Jsoup can scrape and parse HTML from a URL, file, or string
  •     Jsoup can find and extract data, using DOM traversal or CSS selectors
  •     Jsoup allows you to manipulate the HTML elements, attributes, and text
  •     Jsoup provides clean user-submitted content against a safe white-list, to prevent XSS attacks
  •     Jsoup also output tidy HTML

Jsoup is designed to deal with different kinds of HTML found in the real world, which includes properly validated HTML to incomplete non-validate tag collection. One of the core strengths of Jsoup is that it’s very robust.

Modifying

Modifying encompasses setting attributes, text, and HTML of elements, as well as appending and removing elements. It is done to the DOM tree previously generated by jsoup – the Document.

7.1. Setting Attributes and Inner Text/HTML

As in jQuery, the methods to set attributes, text, and HTML bear the same names but also receive the value to be set:

  • attr() – sets an attribute’s values (it creates the attribute if it does not exist)
  • text() – sets element inner text, replacing content
  • html() – sets element inner HTML, replacing content

Let’s look at a quick example of these methods:

7.2. Creating and Appending Elements

To add a new element, you need to build it first by instantiating Element. Once the Element has been built, you can append it to another Element using the appendChild method. The newly created and appended Element will be inserted at the end of the element where appendChild is called:

7.3. Removing Elements

To remove elements, you need to select them first and run the remove method.

For example, let’s remove all <li> tags that contain the “navbar-link” class from Document, and all images from the first article:

7.4. Converting the Modified Document to HTML

Finally, since we were changing the Document, we might want to check our work.

To do this, we can explore the Document DOM tree by selecting, traversing, and extracting using the presented methods, or we can simply extract its HTML as a String using the html() method:

The String output is a tidy HTML.

java.text.SimpleDateFormat

Класс java.text.SimpleDateFormat наследуется от
java.text.DateFormat  и позволяет указать пользовательский шаблон форматирования.

Конструкторы:

Java

public SimpleDateFormat(String pattern)

1 publicSimpleDateFormat(Stringpattern)

Java

public SimpleDateFormat(String pattern,
Locale locale)

1
2

publicSimpleDateFormat(Stringpattern,

Locale locale)

Java

public SimpleDateFormat(String pattern,
DateFormatSymbols formatSymbols)

1
2

publicSimpleDateFormat(Stringpattern,

DateFormatSymbols formatSymbols)

Конструктор с DateFormatSymbols позволяет создать форматировщик, используя особые правила.

Шаблон
pattern  может содержать следующие специальные символы:

Буква Компонент даты и времени Представление Примеры
Эра
Год ;
Год ;
Месяц в году (зависит от контекста) ; ;
Месяц в году (самостоятельная форма) ; ;
Неделя в годе
Неделя в месяце
День в году
День в месяце
День недели в месяце
Название дня недели ;
Номер дня недели (1 = Понедельник, …, 7 = Воскресенье)
Am/pm
Час в дне (0-23)
Час в дне (1-24)
Час в дне am/pm (0-11)
Час в дне am/pm (1-12)
Минуты
Секунды
Миллисекунды
Часовой пояс ; ;
Часовой пояс
Часовой пояс ; ;

Описание столбца «представление»:

  • Text: Если в шаблоне 4 буквы или более, то используется полная форма, в противном случае используется сокращённая форма. При парсинге принимаются обе формы, независимо от количества букв в шаблоне.
  • Number: Количество букв в шаблоне — это минимальное количество цифр, более короткие числа добиваются нулями. При парсинге количество букв игнорируется, если только оно не требуется для разделения соседних полей.
  • Year: Если
    Calendar  форматировщика является григорианским календарём, то применяются следующие правила. При форматировании если количество букв равно двум, то год усекается до двух цифр, в противном случае интерпретируется как число. При парсинге если количество букв больше двух, то год интерпретируется буквально, независимо от количества цифр. Поэтому использование шаблона
    «MM/dd/yyyy»  и строки
    «01/11/12»  получается 11 января 12 года нашей эры. При парсинге с сокращённой формой года (
    «y»  или
    «yy» )
    SimpleDateFormat  интерпретирует сокращённый год относительно какого-либо века. Он выравнивает даты так, чтобы они были в диапазоне от 80 лет до даты создания
    SimpleDateFormat  и до 20 лет после даты создания
    SimpleDateFormat. Во время парсинга только строки, состоящие строго из двух цифр интерпретируются в текущий век. Любые другие числовые строки, состоящие из одной цифры или трёх и более, интерпретируются как полный год.
  • Month: Если количество букв в шаблоне равно 3 или более, то месяц интерпретируется как текст, в противном случае интерпретируется как число. Буква
    M  создаёт имена месяцев, зависимые от контекста. Если в конструктор был передан
    DateFormatSymbols  или был использован метод
    setDateFormatSymbols, то имена месяцев берутся из
    DateFormatSymbols. Буква
    L  создаёт самостоятельную форму имён месяцев.
  • /a>General time zone: Часовые пояса интерпретируются по текстовым именам. При использовании смещения часовой пояс указывается в виде GMT +01:30 или GMT-12:33.
  • RFC 822 time zone: Используются четыре цифры: -0800 или +1200.
  • ISO 8601 time zone: Используются две цифры, четыре цифры, или с разделением часов и минут двоеточием: -08; -0800; -08:00.

Package installation

Maven

mXparser is a super easy, rich, fast and highly flexible math expression parser library (parser and evaluator of mathematical expressions / formulas provided as plain text / string). Software delivers easy to use API for JAVA, Android and C# .NET/MONO (Common Language Specification compliant: F#, Visual Basic, C++/CLI). Expression parser comes with extensive documentation, easy to follow tutorial, «Hello World!» projects for 5 different languages explained with many screenshots, and finally — last, but not least — performance test summary. Formula evaluator is distributed under «Simplified BSD license», which means software is completely free

Jsoup at a Glance

Jsoup loads the page HTML and builds the corresponding DOM tree. This tree works the same way as the DOM in a browser, offering methods similar to jQuery and vanilla JavaScript to select, traverse, manipulate text/HTML/attributes and add/remove elements.

If you’re comfortable with client-side selectors and DOM traversing/manipulation, you’ll find jsoup very familiar. Check how easy it is to print the paragraphs of a page:

Bear in mind that jsoup interprets HTML only — it does not interpret JavaScript. Therefore changes to the DOM that would normally take place after page loads in a JavaScript-enabled browser will not be seen in jsoup.

Extracting

We now know how to reach specific elements, so it’s time to get their content — namely their attributes, HTML, or child text.

Take a look at this example that selects the first article from the blog and gets its date, its first section text, and finally, its inner and outer HTML:

Here are some tips to bear in mind when choosing and using selectors:

  • Rely on “View Source” feature of your browser and not only on the page DOM as it might have changed (selecting at the browser console might yield different results than jsoup)
  • Know your selectors as there are a lot of them and it’s always good to have at least seen them before; mastering selectors takes time
  • Use a playground for selectors to experiment with them (paste a sample HTML there)
  • Be less dependent on page changes: aim for the smallest and least compromising selectors (e.g. prefer id. based)

Java Program to parse HTML Document

Here is our complete Java program to parse an HTML String, an HTML file downloaded from the internet and an HTML file from the local file system. In order to run this program, you can either use the Eclipse IDE or you can just use any IDE or command prompt. In Eclipse, it’s very easy, just copy this code, create a new Java project, right click on src package and paste it. Eclipse will take care of creating proper package and Java source file with same name, so absolutely less work. If you already have a Sample Java project, then it’s just one step. Following Java program shows 3 examples of parsing and traversing HTML file. In first example, we directly parse an String with html content, in the second example we parse an HTML file downloaded from an URL, in the third example we load and parse an HTML document from local file system.

import java.io.File;
import java.io.IOException;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
 
/**
* Java Program to parse/read HTML documents from File using Jsoup library.
* Jsoup is an open source library which allows Java developer to parse HTML
* files and extract elements, manipulate data, change style using DOM, CSS and
* JQuery like method.
*
* @author Javin Paul
*/
public class HTMLParser{
 
    public static void main(String args[]) {
 
        // Parse HTML String using JSoup library
        String HTMLSTring = "<!DOCTYPE html>"
                + "<html>"
                + "<head>"
                + "<title>JSoup Example</title>"
                + "</head>"
                + "<body>"
                + "<table><tr><td><h1>HelloWorld</h1></tr>"
                + "</table>"
                + "</body>"
                + "</html>";
 
        Document html = Jsoup.parse(HTMLSTring);
        String title = html.title();
        String h1 = html.body().getElementsByTag("h1").text();
 
        System.out.println("Input HTML String to JSoup :" + HTMLSTring);
        System.out.println("After parsing, Title : " + title);
        System.out.println("Afte parsing, Heading : " + h1);
 
        // JSoup Example 2 - Reading HTML page from URL
        Document doc;
        try {
            doc = Jsoup.connect("http://google.com/").get();
            title = doc.title();
        } catch (IOException e) {
            e.printStackTrace();
        }
 
        System.out.println("Jsoup Can read HTML page from URL, title : " + title);
 
        // JSoup Example 3 - Parsing an HTML file in Java
        //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong
        Document htmlFile = null;
        try {
            htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1");
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } // right
        title = htmlFile.title();
        Element div = htmlFile.getElementById("login");
        String cssClass = div.className(); // getting class form HTML element
 
        System.out.println("Jsoup can also parse HTML file directly");
        System.out.println("title : " + title);
        System.out.println("class of div tag : " + cssClass);
    }
 
}
Output:
Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html>
After parsing, Title : JSoup Example
Afte parsing, Heading : HelloWorld
Jsoup Can read HTML page from URL, title : Google
Jsoup can also parse HTML file directly
title : Login Page
class of div tag : simple

The Jsoup HTML parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It can handle the following mistakes :
unclosed tags (e.g. <p>Java <p>Scala to <p>Java</p> <p>Scala</p>)
implicit tags (e.g.a naked <td>Java is Great</td> is wrapped into a <table><tr><td>)
reliably creating the document structure (html containing a head and body, and only appropriate elements within the head).

Jsoup is an excellent and robust open source library which makes reading html documents, body fragments, html strings and directly parsing html content from the web, extremely easy.

Reference: 3 Examples of Parsing HTML File in Java using Jsoup from our JCG partner Javin Paul at the Javarevisited blog.

Loading

The loading phase comprises the fetching and parsing of the HTML into a Document. Jsoup guarantees the parsing of any HTML, from the most invalid to the totally validated ones, as a modern browser would do. It can be achieved by loading a String, an InputStream, a File or a URL.

Let’s load a Document from the Spring Blog URL:

Notice the get method, it represents an HTTP GET call. You could also do an HTTP POST with the post method (or you could use a method which receives the HTTP method type as a parameter).

If you need to detect abnormal status codes (e.g. 404), you should catch the HttpStatusException exception:

Sometimes, the connection needs to be a bit more customized. Jsoup.connect(…) returns a Connection which allows you to set, among other things, the user agent, referrer, connection timeout, cookies, post data, and headers:

Since the connection follows a fluent interface, you can chain these methods before calling the desired HTTP method:

You can learn more about the Connection settings by browsing the corresponding Javadoc.

java.util.Scanner

Класс java.util.Scanner предназначен для разбиения форматированного ввода на токены и конвертирования токенов в соответствующий тип данных.

По умолчанию сканер использует пробельные символы (пробелы, табуляторы, разделители линий) для разделения токенов. Рассмотрите следующий код:

Java

import java.io.*;
import java.util.Scanner;

public class ScanXan {
public static void main(String[] args) throws IOException {

Scanner s = null;

try {
s = new Scanner(new BufferedReader(new FileReader(«xanadu.txt»)));

while (s.hasNext()) {
System.out.println(s.next());
}
} finally {
if (s != null) {
s.close();
}
}
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

importjava.io.*;

importjava.util.Scanner;

publicclassScanXan{

publicstaticvoidmain(Stringargs)throwsIOException{

Scanners=null;

try{

s=newScanner(newBufferedReader(newFileReader(«xanadu.txt»)));

while(s.hasNext()){

System.out.println(s.next());

}

}finally{

if(s!=null){

s.close();

}

}

}

}

Если файл «xanadu.txt» содержит следующий текст:

In Xanadu did Kubla Khan
A stately pleasure-dome decree:
Where Alph, the sacred river, ran
Through caverns measureless to man
Down to a sunless sea.

1
2
3
4
5

In Xanadu did Kubla Khan
A stately pleasure-dome decree:
Where Alph, the sacred river, ran
Through caverns measureless to man
Down to a sunless sea.

То результатом работы программы будет вывод:

In
Xanadu
did
Kubla
Khan
A
stately
pleasure-dome

1
2
3
4
5
6
7
8
9

In
Xanadu
did
Kubla
Khan
A
stately
pleasure-dome

Чтобы использовать другой разделитель токенов используйте метод
useDelimiter, в который передаётся регулярное выражение. Например, предположим, что мы хотим использовать в качестве разделителя запятую, после которой может идти, а может не идти пробел:

Java

s.useDelimiter(«,\\s*»);

1 s.useDelimiter(«,\\s*»);

Класс
java.util.Scanner  поддерживает все примитивные типы Java,
java.math.BigInteger  и
java.math.BigDecimal.
Scanner  использует экземпляр
java.util.Locale  для преобразования строк в эти типы данных. Пример:

ScanSum.java

Java

import java.io.FileReader;
import java.io.BufferedReader;
import java.io.IOException;
import java.util.Scanner;
import java.util.Locale;

public class ScanSum {
public static void main(String[] args) throws IOException {

Scanner s = null;
double sum = 0;

try {
s = new Scanner(new BufferedReader(new FileReader(«usnumbers.txt»)));
s.useLocale(Locale.US);

while (s.hasNext()) {
if (s.hasNextDouble()) {
sum += s.nextDouble();
} else {
s.next();
}
}
} finally {
s.close();
}

System.out.println(sum);
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

importjava.io.FileReader;

importjava.io.BufferedReader;

importjava.io.IOException;

importjava.util.Scanner;

importjava.util.Locale;

publicclassScanSum{

publicstaticvoidmain(Stringargs)throwsIOException{

Scanners=null;

doublesum=;

try{

s=newScanner(newBufferedReader(newFileReader(«usnumbers.txt»)));

s.useLocale(Locale.US);

while(s.hasNext()){

if(s.hasNextDouble()){

sum+=s.nextDouble();

}else{

s.next();

}

}

}finally{

s.close();

}

System.out.println(sum);

}

}

Цикл статей «Учебник Java 8».

Следующая статья — «Java 8 консоль».
Предыдущая статья — «Java 8 дата и время».

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *

Adblock
detector