Cannot remove whitespace from html file in order to preg split

We receive HTML blood files for clients and I am trying to finish some PHP code to strip, clean and preg strip the code so that I can assemble multiple files into a spreadsheet. The issue is that the HTML file is not playing ball. If anyone can help get the (not) table elements into an array that would be most awesome.

Supplied HTML code (snippet):

<HR>
    <PRE><B><U><FONT COLOR="BLUE">HAEMATOLOGY</FONT></U></B>
HAEMOGLOBIN (g/L)              144                  g/L        115  - 155
HCT                            0.424                           0.33 - 0.45
RED CELL COUNT                 4.79                 x10^12/L   3.95 - 5.15
MCV                            88.5                 fL           80 - 99
MCH                            30.1                 pg         27.0 - 33.5
                               Please note new reference range.
MCHC (g/L)                     340                  g/L         300 - 350
RDW                            13.2                            11.5 - 15.0
PLATELET COUNT                 <FONT Color="red"><B>* 407                x10^9/L    150  - 400</B></FONT>
MPV                            9.6                  fL            7 - 13
WHITE CELL COUNT               6.16                 x10^9/L     3.0 - 10.0
  Neutrophils                  60.3%  3.71          x10^9/L     2.0 - 7.5
  Lymphocytes                  29.9%  1.84          x10^9/L     1.2 - 3.65
  Monocytes                     6.7%  0.41          x10^9/L     0.2 - 1.0
  Eosinophils                   2.1%  0.13          x10^9/L     0.0 - 0.4
  Basophils                     1.0%  0.06          x10^9/L     0.0 - 0.1
                               All cell populations appear normal.

<B><U><FONT COLOR="BLUE">BIOCHEMISTRY</FONT></U></B>

I have used a combination of string replace, preg replace and removing code to get to an output like this (using var dump):

22 => string 'HAEMOGLOBIN               160                          130' (length=98)
  23 => string '170' (length=3)
  24 => string 'HCT                            0.468                           0.37' (length=122)
  25 => string '0.50' (length=4)
  26 => string 'RED CELL COUNT                 4.88                 x10^12/L   4.40' (length=104)
  27 => string '5.80' (length=4)
  28 => string 'MCV                            95.9                 fL         ' (length=117)
  29 => string '80' (length=2)
  30 => string '99' (length=2)
  31 => string 'MCH                            32.8                 pg         27.0' (length=121)
  32 => string '33.5' (length=4)
  33 => string '                               Please note new reference range.' (length=94)
  34 => string 'MCHC                      342                           300' (length=106)
  35 => string '350' (length=3)
  36 => string 'RDW                            12.4                            11.5' (length=123)
  37 => string '15.0' (length=4)
  38 => string 'PLATELET COUNT                 251                  x10^9/L    150' (length=105)
  39 => string '400' (length=3)
  40 => string 'MPV                            9.5                  fL         ' (length=118)
  41 => string '7' (length=1)
  42 => string '13' (length=2)
  43 => string 'WHITE CELL COUNT               3.97                 x10^9/L     3.0' (length=103)

My code is not elegant…

$myfile = file_get_contents($fileURL);
$fileString = file_get_contents($fileURL);
$parts = $fileString;


$flags = PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY;
// remove HTML code
$part_regex = '/(<)(.*?)(>)/';
$parts = preg_replace($part_regex, '', $parts);
//Remove unecessary deliminaters 
$parts = str_replace('|', '', $parts);
$parts = str_replace('-', '', $parts);
$parts = str_replace('(g/L)', '', $parts);
$parts = str_replace('g/L', '', $parts);
$parts = str_replace('&nbsp;', ' ', $parts);
//Split file string based on spaces
$regex = '/\s\s+/';
$parts = preg_split( $regex, $parts, -1, $flags);
foreach ($parts as $part) {

        //$part = str_replace('&nbsp;', '|', $part);
        $part = trim($part);
        if ($part == '') { unset($part);}
        else {
        $cleanpart = $part;
        array_push($cleanfile, $cleanpart);    
        }  
    }

var_dump($cleanfile);

I have tried various preg replace options as well as html decode but cannot get an output that consistently splits the table as required. I am loathed to split on string position as the files supplied seem to change format and my code needs to flex to that.

[update]

I would like the original HTML code to be split into an array as below:

Currently:

22 => string 'HAEMOGLOBIN               160                          
130' (length=98)

Ideal array output:

22 => string 'HAEMOGLOBIN' (length...)
23 => string '160' (length...)
24 => string '130' (length...)

  • 2

    Don’t use string operations to process HTML. Use a DOM parser library, such as DOMDocument or simple-php-dom.

    – 

  • @Barmar although that would be the normal method I would push – there is little DOM markup in there anyway.

    – 

  • Thanks for the comments – I am not familiar with a DOM parser; how would I look to split HAEMOGLOBIN (g/L) 144 g/L 115 – 155 into elements in an array with any markup to hook onto?

    – 

  • I suppose my really simple questions for this are: what are the white spaces between characters on each line and how to I remove/split on them? I have tried &nbsp; etc etc and cannot find a way to replace them with a single place/character.

    – 

  • 1

    @Thefourthbird – I looked at the regex link and it does split the groups as I would like – thank you (apologies for not saying that initially).

    – 

Leave a Comment