Drukowana wersja tematu

Kliknij tu, aby zobaczyć temat w orginalnym formacie

Forum PHP.pl _ Object-oriented programming _ Złączone dane

Napisany przez: SN@JPER^ 24.11.2017, 17:44:49

Napisałem sobie taką oto klasę:

[PHP] pobierz, plaintext 
 
<?php
 
class Scrapper{
 
    public $url;
    private $data;
    private $dataAfter;
    private $doc;
    private $xpath;
    private $ch;
 
    function __construct($url){
 
        if (http://www.php.net/preg_match('/^http/', $url)) {
 
            libxml_use_internal_errors(true);
 
            $this->url = $url;
            $this->data = $this->curl($this->url);
 
 
            $this->doc = new \DOMDocument();
            $this->doc->loadHTML($this->data);
 
            $this->xpath = new DOMXPath($this->doc);
 
        }
    }
 
    public function queryTag($query){
 
        if(!http://www.php.net/empty($query)){
 
            $this->data = $this->xpath->query($query);
 
            return $this;
        }
    }
 
    public function getData($noHTML = false, $removeAttribute = false){
 
        foreach ($this->data as $dataNodes){
 
            if($removeAttribute === true) {
                $dataNodes->removeAttribute('style');
                $dataNodes->removeAttribute('class');
                $dataNodes->removeAttribute('id');
            }
 
            if($noHTML === true){
                $this->dataAfter .= $dataNodes->nodeValue;
            }else{
                $this->dataAfter .= $dataNodes->ownerDocument->saveHTML($dataNodes);
            }
 
        }
 
        return $this->dataAfter;
    }
 
    private function curl($url){
        if(!http://www.php.net/empty($url)) {
 
            $options = http://www.php.net/array(
                CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
                CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
                CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
                CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
                CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
                CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
                CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
                CURLOPT_URL => $this->url, // Setting cURL's URL option with the $url variable passed into the function
            );
 
            $this->ch = curl_init();
            curl_setopt_array($this->ch, $options);
            $this->data = curl_exec($this->ch);
 
            return $this->data;
        }
    }
 
    function __destruct(){
 
        curl_close($this->ch);
 
    }
 
}
 
 
$class = new \Scrapper('http://www.....');
 
$pic = $class->queryTag('//div[@id="left"]//img[@class="pic"]/@src')->getData();
$title = $class->queryTag('//div[@id="left"]//h2')->getData(true);
$text = $class->queryTag('//div[@id="left"]/p | //center')->getData(false, true);
 
http://www.php.net/echo $title;
http://www.php.net/echo '<hr>';
http://www.php.net/echo $pic;
http://www.php.net/echo '<hr>';
http://www.php.net/echo $text;
http://www.php.net/echo '<hr>';
 
[PHP] pobierz, plaintext

Po wywołaniu tej klasy, przypisuję do każdej zmiennej szukanej wartości - zdjęcie, tytuł i treść.

Niestety tytuł zawiera również ciąg URL obrazka, natomiast tekst zawiera dodatkowo obrazek oraz tytuł. Gdzie robię błąd? Jak to oddzielić?

Jednocześnie proszę o sugestię co mogę poprawić w samej klasie.

Napisany przez: trueblue 24.11.2017, 18:59:11

Pokaż kawałek tej struktury, którą parsujesz.

Napisany przez: Pyton_000 24.11.2017, 19:05:40

Jak dla mnie to ta klasa sama w sobie jest do zaorania

Napisany przez: SN@JPER^ 24.11.2017, 19:51:31

Cytat(trueblue @ 24.11.2017, 18:59:11 )

Pokaż kawałek tej struktury, którą parsujesz.

Prosty przykład: https://www.tehplayground.com/SCtDYOUp67t0EPHt

Cytat(Pyton_000 @ 24.11.2017, 19:05:40 )

Jak dla mnie to ta klasa sama w sobie jest do zaorania

Co proponujesz?

Napisany przez: trueblue 24.11.2017, 20:08:08

Wciąż doklejasz dane do dataAfter.

Napisany przez: SN@JPER^ 24.11.2017, 20:15:40

Działa gdy zmieniłem na:

[PHP] pobierz, plaintext 
 
public function getData($noHTML = false, $removeAttribute = false){
 
        $data_after1 = '';
        foreach ($this->data as $dataNodes){
 
            if($removeAttribute === true) {
                $dataNodes->removeAttribute('style');
                $dataNodes->removeAttribute('class');
                $dataNodes->removeAttribute('id');
            }
 
            if($noHTML === true){
                $data_after1 .= $dataNodes->nodeValue;
            }else{
                $data_after1 .= $dataNodes->ownerDocument->saveHTML($dataNodes);
            }
 
        }
 
        return $data_after1;
    }
 
[PHP] pobierz, plaintext

Napisany przez: abriljoseph 24.04.2018, 07:13:18

The Semantic Web is a Web of Data — of dates and titles and part numbers and chemical properties and any other data one might conceive of. The collection of Semantic Web technologies (RDF, OWL, SKOS, SPARQL, etc.) provides an environment where application can query that data, draw inferences using vocabularies, etc.

However, to make the Web of Data a reality, it is important to have the huge amount of data on the Web available in a standard format, reachable and manageable by Semantic Web tools. Furthermore, not only does the Semantic Web need access to data, but relationships among data should be made available, too, to create a Web of Data (as opposed to a sheer collection of datasets). This collection of interrelated datasets on the Web can also be referred to as Linked Data.

https://www.besanttechnologies.com/training-courses/data-warehousing-training/hadoop-training-institute-in-chennai