Napisałem sobie taką oto klasę:
<?php class Scrapper{ public $url; private $data; private $dataAfter; private $doc; private $xpath; private $ch; function __construct($url){ if (http://www.php.net/preg_match('/^http/', $url)) { libxml_use_internal_errors(true); $this->url = $url; $this->data = $this->curl($this->url); $this->doc = new \DOMDocument(); $this->doc->loadHTML($this->data); $this->xpath = new DOMXPath($this->doc); } } public function queryTag($query){ if(!http://www.php.net/empty($query)){ $this->data = $this->xpath->query($query); return $this; } } public function getData($noHTML = false, $removeAttribute = false){ foreach ($this->data as $dataNodes){ if($removeAttribute === true) { $dataNodes->removeAttribute('style'); $dataNodes->removeAttribute('class'); $dataNodes->removeAttribute('id'); } if($noHTML === true){ $this->dataAfter .= $dataNodes->nodeValue; }else{ $this->dataAfter .= $dataNodes->ownerDocument->saveHTML($dataNodes); } } return $this->dataAfter; } private function curl($url){ if(!http://www.php.net/empty($url)) { $options = http://www.php.net/array( CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent CURLOPT_URL => $this->url, // Setting cURL's URL option with the $url variable passed into the function ); $this->ch = curl_init(); curl_setopt_array($this->ch, $options); $this->data = curl_exec($this->ch); return $this->data; } } function __destruct(){ curl_close($this->ch); } } $class = new \Scrapper('http://www.....'); $pic = $class->queryTag('//div[@id="left"]//img[@class="pic"]/@src')->getData(); $title = $class->queryTag('//div[@id="left"]//h2')->getData(true); $text = $class->queryTag('//div[@id="left"]/p | //center')->getData(false, true); http://www.php.net/echo $title; http://www.php.net/echo '<hr>'; http://www.php.net/echo $pic; http://www.php.net/echo '<hr>'; http://www.php.net/echo $text; http://www.php.net/echo '<hr>';
Pokaż kawałek tej struktury, którą parsujesz.
Jak dla mnie to ta klasa sama w sobie jest do zaorania
Wciąż doklejasz dane do dataAfter.
Działa gdy zmieniłem na:
public function getData($noHTML = false, $removeAttribute = false){ $data_after1 = ''; foreach ($this->data as $dataNodes){ if($removeAttribute === true) { $dataNodes->removeAttribute('style'); $dataNodes->removeAttribute('class'); $dataNodes->removeAttribute('id'); } if($noHTML === true){ $data_after1 .= $dataNodes->nodeValue; }else{ $data_after1 .= $dataNodes->ownerDocument->saveHTML($dataNodes); } } return $data_after1; }
The Semantic Web is a Web of Data — of dates and titles and part numbers and chemical properties and any other data one might conceive of. The collection of Semantic Web technologies (RDF, OWL, SKOS, SPARQL, etc.) provides an environment where application can query that data, draw inferences using vocabularies, etc.
However, to make the Web of Data a reality, it is important to have the huge amount of data on the Web available in a standard format, reachable and manageable by Semantic Web tools. Furthermore, not only does the Semantic Web need access to data, but relationships among data should be made available, too, to create a Web of Data (as opposed to a sheer collection of datasets). This collection of interrelated datasets on the Web can also be referred to as Linked Data.
https://www.besanttechnologies.com/training-courses/data-warehousing-training/hadoop-training-institute-in-chennai
Powered by Invision Power Board (http://www.invisionboard.com)
© Invision Power Services (http://www.invisionpower.com)