We receive HTML blood files for clients and I am trying to finish some PHP code to strip, clean and preg strip the code so that I can assemble multiple files into a spreadsheet. The issue is that the HTML file is not playing ball. If anyone can help get the (not) table elements into an array that would be most awesome.
Supplied HTML code (snippet):
<HR>
<PRE><B><U><FONT COLOR="BLUE">HAEMATOLOGY</FONT></U></B>
HAEMOGLOBIN (g/L) 144 g/L 115 - 155
HCT 0.424 0.33 - 0.45
RED CELL COUNT 4.79 x10^12/L 3.95 - 5.15
MCV 88.5 fL 80 - 99
MCH 30.1 pg 27.0 - 33.5
Please note new reference range.
MCHC (g/L) 340 g/L 300 - 350
RDW 13.2 11.5 - 15.0
PLATELET COUNT <FONT Color="red"><B>* 407 x10^9/L 150 - 400</B></FONT>
MPV 9.6 fL 7 - 13
WHITE CELL COUNT 6.16 x10^9/L 3.0 - 10.0
Neutrophils 60.3% 3.71 x10^9/L 2.0 - 7.5
Lymphocytes 29.9% 1.84 x10^9/L 1.2 - 3.65
Monocytes 6.7% 0.41 x10^9/L 0.2 - 1.0
Eosinophils 2.1% 0.13 x10^9/L 0.0 - 0.4
Basophils 1.0% 0.06 x10^9/L 0.0 - 0.1
All cell populations appear normal.
<B><U><FONT COLOR="BLUE">BIOCHEMISTRY</FONT></U></B>
I have used a combination of string replace, preg replace and removing code to get to an output like this (using var dump):
22 => string 'HAEMOGLOBIN 160 130' (length=98)
23 => string '170' (length=3)
24 => string 'HCT 0.468 0.37' (length=122)
25 => string '0.50' (length=4)
26 => string 'RED CELL COUNT 4.88 x10^12/L 4.40' (length=104)
27 => string '5.80' (length=4)
28 => string 'MCV 95.9 fL ' (length=117)
29 => string '80' (length=2)
30 => string '99' (length=2)
31 => string 'MCH 32.8 pg 27.0' (length=121)
32 => string '33.5' (length=4)
33 => string ' Please note new reference range.' (length=94)
34 => string 'MCHC 342 300' (length=106)
35 => string '350' (length=3)
36 => string 'RDW 12.4 11.5' (length=123)
37 => string '15.0' (length=4)
38 => string 'PLATELET COUNT 251 x10^9/L 150' (length=105)
39 => string '400' (length=3)
40 => string 'MPV 9.5 fL ' (length=118)
41 => string '7' (length=1)
42 => string '13' (length=2)
43 => string 'WHITE CELL COUNT 3.97 x10^9/L 3.0' (length=103)
My code is not elegant…
$myfile = file_get_contents($fileURL);
$fileString = file_get_contents($fileURL);
$parts = $fileString;
$flags = PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY;
// remove HTML code
$part_regex = '/(<)(.*?)(>)/';
$parts = preg_replace($part_regex, '', $parts);
//Remove unecessary deliminaters
$parts = str_replace('|', '', $parts);
$parts = str_replace('-', '', $parts);
$parts = str_replace('(g/L)', '', $parts);
$parts = str_replace('g/L', '', $parts);
$parts = str_replace(' ', ' ', $parts);
//Split file string based on spaces
$regex = '/\s\s+/';
$parts = preg_split( $regex, $parts, -1, $flags);
foreach ($parts as $part) {
//$part = str_replace(' ', '|', $part);
$part = trim($part);
if ($part == '') { unset($part);}
else {
$cleanpart = $part;
array_push($cleanfile, $cleanpart);
}
}
var_dump($cleanfile);
I have tried various preg replace options as well as html decode but cannot get an output that consistently splits the table as required. I am loathed to split on string position as the files supplied seem to change format and my code needs to flex to that.
[update]
I would like the original HTML code to be split into an array as below:
Currently:
22 => string 'HAEMOGLOBIN 160
130' (length=98)
Ideal array output:
22 => string 'HAEMOGLOBIN' (length...)
23 => string '160' (length...)
24 => string '130' (length...)
Don’t use string operations to process HTML. Use a DOM parser library, such as
DOMDocument
orsimple-php-dom
.@Barmar although that would be the normal method I would push – there is little DOM markup in there anyway.
Thanks for the comments – I am not familiar with a DOM parser; how would I look to split HAEMOGLOBIN (g/L) 144 g/L 115 – 155 into elements in an array with any markup to hook onto?
I suppose my really simple questions for this are: what are the white spaces between characters on each line and how to I remove/split on them? I have tried etc etc and cannot find a way to replace them with a single place/character.
@Thefourthbird – I looked at the regex link and it does split the groups as I would like – thank you (apologies for not saying that initially).
Show 7 more comments