Witam!
Napisałem prosty regex do walidowania elementu
DOCTYPE dokumentu HTML (standard 5.1 W3C):
$regex=<<<'REGEX'
@(*UTF8)^
<!((?i)DOCTYPE)
(?<space_characters>[\x20\x09\x0A\x0C\x0D])+
((?i)HTML)
(
\g<space_characters>+
(
( # DOCTYPE legacy string
((?i)SYSTEM)
\g<space_characters>+
(?<quote_mark>["'])
about:legacy-compat
\k<quote_mark>
)|( # obsolete permitted DOCTYPE string
((?i)PUBLIC)
\g<space_characters>+
(?<first_quote_mark>["'])
(
(
-//W3C//DTD\ HTML\ 4\.0//EN
\k<first_quote_mark>
(
\g<space_characters>+
(?<third_quote_mark_1>["'])
<a href="http://www\.w3\.org/TR/REC-html40/strict\.dtd" target="_blank">http://www\.w3\.org/TR/REC-html40/strict\.dtd</a>
\k<third_quote_mark_1>
)?
)|(
-//W3C//DTD\ HTML\ 4\.01//EN
\k<first_quote_mark>
(
\g<space_characters>+
(?<third_quote_mark_2>["'])
<a href="http://www\.w3\.org/TR/html4/strict\.dtd" target="_blank">http://www\.w3\.org/TR/html4/strict\.dtd</a>
\k<third_quote_mark_2>
)?
)|(
-//W3C//DTD\ XHTML\ 1\.0\ Strict//EN
\k<first_quote_mark>
\g<space_characters>+
(?<third_quote_mark_3>["'])
<a href="http://www\.w3\.org/TR/xhtml1/DTD/xhtml1-strict\.dtd" target="_blank">http://www\.w3\.org/TR/xhtml1/DT...trict\.dtd</a>
\k<third_quote_mark_3>
)|(
-//W3C//DTD\ XHTML\ 1\.1//EN
\k<first_quote_mark>
\g<space_characters>+
(?<third_quote_mark_4>["'])
<a href="http://www\.w3\.org/TR/xhtml11/DTD/xhtml11\.dtd" target="_blank">http://www\.w3\.org/TR/xhtml11/D...tml11\.dtd</a>
\k<third_quote_mark_4>
)
)
)
)
)?
\g<space_characters>*
>
$@suxDX
REGEX;
echo "1 - ";var_dump
(preg_match($regex, '<!DOCTYPE htmL SYSTEM \'about:legacy-compat\' >')); echo "1 - ";var_dump
(preg_match($regex, '<!DOCTYPE htmL SYSTEM "about:legacy-compat">')); echo "0 - ";var_dump
(preg_match($regex, '<!DOCTYPE htmL SYSTEM "about:legacy-compat\'>')); echo "0 - ";var_dump
(preg_match($regex, '<!DOCTYPE htmL PUBLIC "about:legacy-compat">')); echo "1 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">')); echo "0 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//en">')); echo "0 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//PL">')); echo "1 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" \'http://www.w3.org/TR/REC-html40/strict.dtd\'>')); echo "0 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" \'http://www4w3.org/TR/REC-html40/strict.dtd\'>')); echo "0 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" \'http://www.w3.org/TR/REC-html40/strict.dtd\'>')); echo "1 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">')); echo "1 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >')); echo "0 - ";var_dump
(preg_match($regex, '<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN">'));
jednak zaobserwowałem, że część kodu
obsolete permitted DOCTYPE string powtarza się w większości i tutaj moje pytanie,
jak to wyrażenie uprościć?
Myślałem nad cachem
third_quote_mark, ale nie mogę tego zrobić dla typu tabelarnego.
Macie może jakiś pomysł?