Cite this as
Lin YJ, Chang HH (2015) Investigation of Significant Features Based on Image Texture Analysis for Automated Denoising in MR Images. Arch Biomed Sci Eng 1(1): 001-005. DOI: 10.17352/abse.000001Introduction: In magnetic resonance (MR) image analysis, noise is one of the main sources of quality deterioration not only for visual inspection but also in computerized processing such as tissue classification, segmentation and registration. Consequently, noise removal in MR images is important and essential for a wide variety of subsequent processing applications. In the literature, abundant denoising algorithms have been proposed, most of which require laborious tuning of parameters that are often sensitive to specific image features and textures. Automation of these parameters through artificial intelligence techniques will be highly beneficial. However, this will induce another problem of seeking appropriate meaningful attributes among a huge number of image characteristics for the automation process. This paper is in an attempt to systematically investigate significant attributes from image texture features to facilitate subsequent automation processes.
Methods: In our approach, a total number of 60 image texture attributes are considered that are based on three categories: 1) Image statistics. 2) Gray-level co-occurrence matrix (GLCM). 3) 2-D discrete wavelet transform (DWT). To obtain the most significant attributes, a paired-samples t-test is applied to each individual image features computed in every image. The evaluation is based on the distinguishing ability between noise levels, intensity distributions, and anatomical geometries.
Results: A wide variety of images were adopted including the BrainWeb image data with various levels of noise and intensity non-uniformity to evaluate the proposed methods. Experimental results indicated that an optimal number of seven image features performed best in distinguishing MR images with various combinations of noise levels and slice positions. They were the contrast and dissimilarity features from the GLCM category and five norm energy and standard deviation features from the 2-D DWT category.
Conclusions: We have introduced a new framework to systematically investigate significant attributes from various image features and textures for the automation process in denoising MR images. Sixty image texture features were computed in every image followed by a paired-samples t-test for the discrimination evaluation. Seven texture features with two from the GLCM category and five from the 2-D DWT category performed best, which can be incorporated into denoising procedures for the automation purpose in the future.
Magnetic resonance imaging (MRI) has been one of the most frequently used medical imaging modalities due to its high contrast among different soft tissues, high spatial resolution across the entire field of view, and multi-spectral characteristics [1,2]. In MR image analysis, noise is one of the main sources of quality deterioration not only for visual inspection but also in computerized processing such as tissue classification, segmentation and registration [3,4]. Consequently, noise removal in MR images is important and essential for a wide variety of subsequent processing applications.
Over the decades, Gaussian filters have been widely used in many MR image processing applications due to its simplicity [5]. Although the Gaussian filter smoothes noise quite efficiently edges are blurred significantly. To preserve the sharpness, a nonlinear method called the anisotropic diffusion filter [6] has been proposed. In their approach, pixel values were averaged from neighborhoods, whose size and shape depended on local image variation that was measured at every point. One promising technique that attempted to improve the Gaussian filter was the bilateral filter [7]. The essence of this approach is to combine both geometric closeness in the spatial domain and gray value similarity in the range as a nonlinear filter for image denoising. It has been shown that the bilateral filter performed effectively in MR image noise suppression and it has been the object of many further studies [8,9].
Indeed, most denoising algorithms require laborious tuning of parameters that are often sensitive to specific image features and image textures. Automation of these parameters through artificial intelligence techniques will be highly beneficial. However, this will induce another problem of seeking appropriate meaningful attributes among a huge number of image characteristics for the automation process. This paper is in an attempt to systematically investigate significant attributes from image texture features to facilitate subsequent automation processes.
Statistical features of gray levels were one of the earliest methods used to classify textures. The gray level co-occurrence matrix (GLCM) [10] extracts second order statistics based on the repeated occurrence of some gray-level configuration in an image. This configuration varies rapidly with respect to distance in fine texture regions and slowly in coarse texture images. More specifically, the GLCM is defined as a matrix of frequencies at which two pixels, separated by a certain vector, occur in the image. For an image of L gray levels, the distribution in the L × L matrix depends on the angular and distance relationship between pixels based on gray tone spatial dependencies using
PendingRisk factors are the smallest units indicating the existence of a disease. A syndrome, on the other hand, is a collection, a set, or a cluster of concurrent risk factors, which together indicate the presence and the nature of the disease. Here the main question is that what the relation between these risk factors and a specific syndrome is. Classification and clustering are therefore basic concerns in medicine. Classification depends on the definition of the classes and on the required degree of affiliation of their elements [20].
Clustering algorithms are generally divided into two groups. First, hard partitioning algorithms which are based on classical set theory; they require that an object either does or does not belong to a cluster. Soft clustering methods however, allow the objects to belong to several clusters simultaneously with different degrees of membership [21]. Fuzzy clustering methods are one of the well-known methods of soft clustering which are vastly used in solving medical diagnosis problems. In medicine, there are usually imprecise conditions and highly overlapping classes and therefore fuzzy methods seem to be more suitable than crisp ones [20].
FCM algorithm was first introduced by Bezdek as the enhancement to the classical K-means clustering [22]. This algorithm estimates the membership function of the object k to the clusters i ( ) to minimize the following cost function [23]:
Where dik is the distance measure of kth data point from ith cluster center and parameters c, n and m ≥ 1 are the number of clusters and objects in the dataset and the fuzzy coefficient, respectively. In the probabilistic FCM, the following constrain must be met when optimizing the above cost function:
The FCM algorithm iteratively estimates the cluster centers and the membership functions to minimize the cost function (Jr) which could be found elsewhere in details [23].
Despite all the benefits of using FCM as the clustering core algorithm, it is still a blind method and may misclassify the input data. Thus it is necessary to train the algorithm in a way that induces a meaningful convergence. As a result, the classification will be even more accurate.
In the original FCM the data distance to the cluster centers are normally calculated using standard Euclidean distance. In our case, each object is a vector of multiple risk factors with various measurement scales types in different ranges, hence using the classic Euclidean distance is not appropriate [24]. Basically, there are three major data types in clinical data sets: nominal, discrete ordinal, and Interval. Nominal scales are only used for non-ranked qualitative classification e.g. gender, blood type, and health condition. A discrete–ordinal scale is a nominal variable, but the different states are ordered in a meaningful sequence e.g. the slope of the peak exercise ST segment. Interval scales are measured on a linear scale e.g. BMI (Body Mass Index) and age. It is important to define a distance measure to balance all these differences in a way that no feature lessens the other features’ effect or vice versa.
The class labels provide a useful guidance during training procedure. Hence, it is necessary to use the labeled samples in training phase and unlabeled samples in testing phase to improve the performance of FCM. This idea led to the development of a new algorithm called Supervised Fuzzy C-Means (SFCM) algorithm, a slight modification of FCM [25].
The main goal of SFCM is to use the labeled data samples to guide the iterative optimization procedure. In this method, a known fixed set of categories and category-labeled training data are used to induce a classification function. The determination of fuzzy partition matrix U (dividing N data sets into C classes) using Supervised Fuzzy C-Means clustering is an iterative optimization procedure. The objective function of SFCM classification is defined as:
Where U is the fuzzy partition matrix, V is the cluster center, fik is the membership degree of kth labeled sample belonging to the ith cluster (value is either 0 or 1).
The coefficient ‘a’ denotes the scaling factor. The role of ‘a’ is to maintain a balance between supervised and unsupervised component within the optimization procedure and parameter ‘m’ controls the amount of fuzziness in the classification. The typical value of m is 2 and a=L/n, L denotes the size of labeled samples [25]. However, it is better to tune these two parameters based on the properties of the dataset. Function Jm can take a large number of values, the smallest one being associated with the best clustering.
An effective algorithm for supervised fuzzy classification is discussed herein. The steps of algorithm are as below [20,25]:
1) Initiate fuzzy partition matrix, U(0), with random values between 0 and 1 and fix the number of cluster centers as the number of outcome classes.
2) Start the iterative procedure and set the iteration counter to one.
3) Calculate the cluster centers using the following equation:
Where, Vij(t) represents the ith cluster center of jth feature which j changes from 1 to m (number of features), and Zkj(train) is the kth data instance corresponding to the mth selected feature variable.
4) Calculate the distance between ith cluster center and kth dataset, distance measured with Euclidean Distance as follows:
5) Update the fuzzy partition matrix for the next iteration given by the following equation:
For test set samples, whose class labels are unknown, the fuzzy partition matrix is calculated as follows:
6) When ‖U(t+1) - U(t)‖ ≤ ε (ε is the iterative accuracy) has achieved, Stop the iteration; In this case outputs will be v (cluster center) and U (fuzzy matrix), otherwise return to step 3.
In many classification problems, especially in the biomedical domain, high dimensional data with few observations are used [26]. This can lead to lower classification accuracy and clusters of poor quality. High dimensional data is also a serious problem for many classification algorithms due to its high computational cost, memory usage andthe curse of dimensionality [27].
Since most of the features are redundant or irrelevant, feature selection method (FS) is used to pick a subset of features that are relevant to the target concept [28]. In this work, a statistical FS method entitled as Multiple Logistic Regression (MLR) was used, which is widely used to identify relevant risk factors in epidemiological studies. MLR, known as feature vector machine in machine learning, can be used to select statistically significant features. It not only considers significant features that provide acceptable discrimination between two classes, but it also takes into account the correlation between features. After running MLR on the input features (excluding the intercept point in the analysis), the selected features were used in the tested classifier [2,29,30].
It is impossible to perform any arithmetic operation on nominal data because it has no order. The only Operation defined here is the equality. The distance of two nominal instances A and B is 1, if A equals B, and 0 otherwise.
For interval scales, it is possible to calculate the distance with standard norm definitions. The distance between two data samples A and B from a given interval I, is defined as |A-B|. As the interval size could be different between multiple features, it is important to normalize the distances relative to the interval size given as |A-B| / |I|.
Discrete–ordinal scale is a nominal variable, but the different states are ordered in a meaningful sequence. Ordinal data has order, but the intervals between scale points may be uneven. But still, the distance of two samples lying in the same interval is computed similar to that of interval distance i.e. |A-B| / |I|. Now, each feature is normalized to a value between 0 and 1.
Note that |I| is calculated by subtracting the maximum and minimum values herein. The l1-norm was then used to combine the distance between different transformed features, simply known as GMM in the literature [31,32]. Thus, the GMM distance definition between two feature vectors A and B, could be given as below:
Where d is the number of dimensions, Ψ is the distance function for each feature which varies according to its measurement scale, and Ck are weights. The weights could be either set to the value of unity or tuned using an optimization algorithm. Accordingly, the features’ weights were set to unity when using GMM+SFCM with/without MLR. Alternatively, instead of using SFS, all of the features were used and their weights were calculated and the features with small value of weights were neglected. For the later approach, it is necessary to use an efficient optimization algorithm, discussed at the next section.
DSA is an optimization algorithm developed by P. Civicioglu simulating the Brownian-like random-walk movement used by an organism to migrate [33]. The motivation of DSA, like many population-based stochastic optimization algorithms, was taken from the nature. Many living organisms show annual migration. In this migration, super organism is constituted containing large number of individuals. The movement of a super organism could be illustrated by a Brownian-like random-walk model [33,34]. In DSA, the population contains random solutions. The migration is performed to the global optimum of the cost function. At each iteration, some of the populations are selected. They move based on a Brownian-like random walk model [34]. DSA is simple to implement and was shown to have acceptable performance in variety of the optimization problems in comparison with that of other traditional optimization algorithms while it is not too sensitive to the initialization of its parameters [33].
We have used DSA to optimize the parameters of GMM and SFCM. The cost function was the absolute error rate of the classifier on the training set. The initial setting of the DSA used in our study was similar to that of P. Civicioglu [33]. Briefly, the size of the population was set to 30, and the maximum number of function evaluation value (i.e. the number of times that the cost function was called in the program) as the only stopping criterion was 2,000,000.
The performance of a classifier could be evaluated by computing the number of correctly recognized CAD subjects (TP: True Positives), the number of correctly recognized healthy subjects (TN: True Negatives), and examples that either were incorrectly assigned to the CAD class (FP: False Positives) or that were missed as class examples (FN: False Negatives). These four counts constitute the information-theory formulas to accurately measure the performance of the classification [35,36].
The demographic information of the Cleveland CAD data was shown in (Table 1) for the case (CAD) and control (healthy) groups. To assess the performance of the base classifier, the main dataset was randomly divided into two roughly equal size datasets, namely dataset 1 and dataset 2 (hold-out validation method [23]). The best Accuracy (Acc) achieved when tuning on the dataset 1 and testing on the dataset 2 and vice versa in ten runs of the SFCM algorithm with GMM distance measure (unity weights) was shown in (Table 2). The maximum number of iterations was set to 100 in all the classifiers. The scaling factor (a) was tuned in the training set using exhaustive grid search (a=0.8) while the value of the fuzziness parameter (m) was set to 2 in the base classifier. The overall percentage accuracy (the average Acc of the classifier on dataset 2 when it was tuned on dataset 1 and vice versa) in the base classifier was 79%. The average Sensitivity and Specificity of the base classifier were 71% and 84%, respectively.
Table 3 shows the results of running the algorithm with SFS. MLR revealed that following significant features: gender, cp (chest pain type), trestbps (resting systolic blood pressure), thalach (maximum exercise heart rate achieved), slope (the slope of the peak exercise ST segment), ca (number of major vessels (0-3) colored by fluoroscopy), and thal (thallium-201 stress scintigraphy). The overall accuracy for SFCM-SFS was 82%. The average Sensitivity and Specificity of this classifier were 85% and 82%, respectively.
Finally, DSA optimization method was used to tune the features’ weights, the fuzziness parameter and the scaling factor. The cost function was set as the absolute error rate of the classifier (SFCM+GMM) (i.e. 1-Acc) on the training set. Guarding against Type III error [37], 10-fold cross-validation [23] was used to assess the performance of the proposed hybrid classifier (Table 4). The average Accuracy, Sensitivity and Specificity of this classifier were 88%, 86%, and 88% respectively. The McNemar’s test [23] revealed that the performance of the hybrid classifier was significantly better than the base classifier (p_value<0.05) but comparable with that of SFS+SFCM. Meanwhile, Multi-fold cross validation was used instead of leave-one-out, since it is proven to have better performance in terms of accuracy and efficiency [23].
The values of the parameters of the hybrid classifier tuned using DSA were shown in Table 5. The most important features (feature weight w>0.5) were listed in the descending order: the number of major vessels colored by fluoroscopy (w=1), the family history of CAD (w=1), peak exercise systolic blood pressure (w=0.89), maximum exercise heart rate achieved (w=0.87), chest pain type (w=0.82), resting heart rate (w=0.67), Fasting Blood Sugar (w=0.64) and gender (w=0.54). However, the list significant attributes (w<0.1) were age, resting blood pressure, number of cigarettes per day, resting electrocardiographic results, peak exercise diastolic blood pressure, and the slope of the peak exercise ST segment. For the definition, and number of categories of the above attributes the reader is referred to (Table 1).
The overall performance of the hybrid classifier was shown in Table 6. It includes the contingency table (confusion matrix) on the total of 303 subjects. The agreement rate between the results of this classifier and those of the gold standard (i.e. CAD diagnosis using angiography) was assessed based on the Cohen’s kappa coefficient [38]. Substantial agreement was shown between the outcomes of the proposed hybrid classifier and angiography (kappa=0.73) [39].
In this paper, three classification systems were designed for non-invasive CAD diagnosis; among which the hybrid classifier showed better performance (Tables 2-4). This diagnosis system was based on the SFCM classifier in which the distance between objects were calculated using GMM and the parameters of the system (SFCM parameters and GMM weights) were estimated using DSA optimization. Other approaches such as PSO [40] were used for optimization, but DSA showed more accurate results. The Type I error and the power of the hybrid classifier were 0.1 and 86%, respectively. Since the data-set was not totally balanced (i.e. the number of cases and controls were not identical), F1-score measure might be more accurate than the accuracy. The average F1-score during 10-fold cross-validation was 85±10 (%), indicating that the proposed system is accurate. The comparison between the performance of the postposed system and some of the other systems designed on the CAD dataset was shown (Table 7). Some methods had higher accuracy that n that of the proposed system. We compared the result of the method proposed by Muthukaruppan et al. [12]. Although its accuracy was 93%, McNamara’s test showed that it was not significantly higher than our hybrid system (p_value>0.05). Another issue is that among the methods listed in (Table 7), those in which Fuzzy classification was used, showed higher accuracies. Since most or all classificatory concepts in medicine are fuzzy, fuzzy taxonomy was used in our study. Meanwhile, it is very difficult to define sharp borders between various symptoms in the set of all symptoms and between various diseases in the set of diseases [41]. Thus, the framework of fuzzy systems is very useful to deal with the absence of sharp boundaries of the sets of symptoms, diagnoses, and phenomena of diseases [19,42].
The significant features selected by the DSA, were known to be directly involved in CAD. Fluoroscopy is one of the most popular non-invasive CAD diagnosis methods whose accuracy ranges between 35% and 75% in comparison with that of the gold standard (i.e. angiography) in the literature [43,44]. We permed a univariate (i.e. the number of major vessels (0-3) colored by fluoroscopy) classification based on the Receiver Operating Characteristic (ROC) plot. Its accuracy was 75% (Area under Curve: AUC=0.75; cut-off=0.5). The average number of vessels colures were statistically different in the CAD and normal group (independent-samples t-test; p_value 0.05). High value of GMM weights are in agreement with the statistical test. Thus, it was a suitable feature but not enough for accurate classification. The other traditional non-invasive CAD diagnosis method is thallium-201 stress scintigraphy. The prevalence of CAD in three groups of scintigraphy was statistically different (Chi-square test; p_value<0.05). Having designed a decision-tree classifier with scintigraphy feature, the accuracy was 76%. However, due to the directional correlation between fluoroscopy and scintigraphy (Eta=0.3), DSA estimated the fluoroscopy and the scintigraphy wegihts as 1.00 and 0.28. The other clinical variable is ST segment depression used in cardiography. Its eight was zero, indicating that no further information could be extracted by adding this variable. In the lietrature, the ranked order of CAD predictive were cardiac fluoroscopy score, thallium score and extent of Electrocardiography (e.g. ST segment depression) [45,46] which is in agreement with our findings.
CAD is associated with higher morbidity and mortality in women than in men [47]. It was also shown that the incidence of CAD in women aged less than 70 years is lower than their male counterparts [48]. In our study, the percentage of men and women having CAD were 84% and 16%, respectively. Considering that women in the CAD group had the age of 66 years old or lower, this is in agreement with our study. However, women usually have CAD 7 to 10 years later than men [49]. In our data-set, the average age of women and men who had CAD was 60±5 and 55±8 years, respectively. Moreover, gender was a significant feature (w=0.539). Meanwhile, the age by itself was not a significant feature in our study (Table 5; w=0.085). This is in agreement with the fact that the average age of people in the CAD and normal groups was 56±8 and 53±9 years, respectively (Table 1). This is related to the stratified age sampling used in our study.
Although, it is proved that high blood pressure increases the risk of CAD [50], it was not significant in our study. Meanwhile elevated resting heart rate is known as a CAD risk factor, which is in agreement with our findings (w=0.674) [51]. In the literature, the family history of CAD is a major CAD risk factor in adults [52]. This is in agreement with our findings where it had the highest GMM weight (w=1; Table 5). Meanwhile, fasting blood sugar was known as an important determinant of CAD [53], in agreement with our findings where its GMM weight was estimated as 0.641. A high total cholesterol level can increase your risk of cardiovascular disease. However, decisions about when to treat high cholesterol are usually based upon the level of LDL or HDL cholesterol, rather than the level of total cholesterol. This might explain the fact that the weight of the cholesterol was 0.289 in our study. Moreover, total cholesterol/ high-density lipoprotein cholesterol ratio were shown to be associated to CAD rather than cholesterol, by itself [54,55].
Chest pain type (GMM weight of 0.818) was divided into the following categories: Typical angina pectoris, atypical angina, non-anginal pain, and no pain. Typical angina (pain that occurs in the anterior thorax, neck, shoulders, jaw, or arms is precipitated by exertion and relieved within 20 min by rest) was the most common symptom of CAD [18]. It occurs when blood flow to an area of heart is decreased, impairing the delivery of oxygen and vital nutrients to the heart muscle cells. The byproduct of using this inefficient fuel is producing lactic acid that builds up in the muscle and causes pain [56].
The key to a good classification is a dataset containing all the possibly relevant features (i.e. risk factors mentioned in the literature) with enough cases (i.e. suitable sample size). Although the sample size of the Cleveland dataset is rather high, some important features such as BMI, LDL and HDL are missing. Meanwhile, we are going to design an automated CAD risk assessment program, based on the findings of this study, in collaboration with Isfahan Healthy Heart Program [57]. Such a large database, could allow us to investigate the accuracy of the proposed diagnosis system in a broader sense. Another issue is that the performance of the base classifier with/without FS (Tables 2,3) was so different in the first and second scenarios. Having calculated the cluster representatives for the healthy and CAD groups in the first and second datasets, the dataset 1 showed better discrimination in comparison with dataset2 on the whole 20 features and also those selected by the MLR. This is why that performance of the base classifier with/without FS was higher on the data set 1 in the entire training and test procedure. Also, the discrimination with/without FS was not that different. This, in fact, shows that the FS could have selected features with significant discrimination power.
Another step would be developing a web-based online system with which patients/ medical doctors could assess their risk of having CAD at home. These Web-based diagnostic decision support systems have been recently focused in Medicine and are proven to be valuable in identifying the correct diagnosis in complicated cases [58]. There might be two possible approaches to improve the performance of the proposed diagnosis system. First, further features could be defined by considering the interactions between input risk factors/predictors [59] e.g. simply multiplication of the predictors. Second, multiple clusters could be formed for each healthy and CAD class by using mixed-type data clustering methods [24]. Then, supervised FCM could be used with multiple clusters corresponding with two healthy and CAD classes. Extracting supervised classification rules on groups of similar objects could potentially reduce the misclassification rate especially close to the class borderlines. These two approaches will be the focus of our future work.
The hybrid classifier showed the average accuracy of 87%. The power of the designed diagnosis system was 86%. Type I error (α) was 0.1 and the F-score was 85%. Although the power of the method is acceptable, type I error must be reduced down to 0.05, to introduce a reliable and accurate clinical test which is the focus of the future work. One possible strategy to improve the accuracy of the proposed diagnosis system is using classifier fusion. Combining different reliable classifiers, might improve the accuracy though the fusion procedure. In conclusion, we designed an automated non-invasive CAD diagnosis system based on the Fuzzy theory. The results showed that the proposed system is promising. However, further improvements are needed to be able to use it in clinical laboratories.
This work was supported by the University of Isfahan (MN, SJ, HM) and Isfahan University of medical Sciences (MM).
Subscribe to our articles alerts and stay tuned.
PTZ: We're glad you're here. Please click "create a new query" if you are a new visitor to our website and need further information from us.
If you are already a member of our network and need to keep track of any developments regarding a question you have already submitted, click "take me to my Query."