Search Immortality Topics:

Page 8«..78910..2030..»


Category Archives: Machine Learning

Plagiarism Detection Tools Offer a False Sense of Accuracy The Markup – The Markup

When Katherine Pickering Antonova became a history professor in 2008, she got access to the plagiarism detection software tools Turnitin and SafeAssign. At first blush, she thought the technology would be great. She had just finished a graduate program where she had manually graded papers as a teaching assistant, meticulously checking students suspect phrases to see if any showed up elsewhere.

But her first use of the plagiarism checkers gave her a jolt. The software suggested the majority of her students had copied portions of their essays.

Soon she realized the lie in how the tools were described to her. Its not tracking plagiarism at all, Pickering Antonova said. Its just flagging matching text. Those two concepts have different standards; plagiarism is a subjective assessment of misconduct, but scholars may have matching words in their academic articles for a variety of legitimate reasons.

Plagiarism checkers are built into The City University of New Yorks learning management system, where faculty members post assignments and students submit them. As at many colleges throughout the country, scanning for plagiarism in submitted assignments is the default. But fed up with false flags and the countless hours required to check potentially plagiarized passages against the source material Turnitin and SafeAssign highlight, Pickering Antonova gave up on the tools entirely a couple years ago.

The bots are literally worse than useless, she said. They do harm, and they dont find anything I couldnt find by myself.

Some experts agree that Claudine Gay, Harvards ousted president and a widely respected political scientist, recently became the latest victim of this technology. She was forced to step down from the presidency after an accuser flagged nearly 50 examples from her writing that they called plagiarism. But many of the examples looked a lot like what Pickering Antonova considered a waste of her time when she was grading student work.

The Voting Rights Act of 1965 is often cited as one of the most significant pieces of civil rights legislation passed in our nations history, Gay wrote in one paper. Her accuser says she plagiarized David Canons description of the landmark lawbut as the Washington Free Beacon reported in publishing the allegations, Canon himself disagrees, arguing Gay had done nothing wrong.

The controversy over Gays alleged plagiarism has roiled the academic community, and while much of the attention has been on the political maneuvering behind her ouster and the definition of plagiarism, some scholars have commented on the detection software that was likely behind it. The fact is, however, that students, not academics, bear the brunt of the tools shoddy analyses. Turnitin is the industry leader in marshaling text analysis tools to assess academic integrity, boasting partnerships with more than 20,000 institutions globally and a repository of over 1.8 billion student paper submissions (and still counting).

The companies that are marketing plagiarism detection tools tend to acknowledge their limitations. While they may be referred to as plagiarism checkers, the products are described as highlighting text similarities or duplicate content. They scan billions of webpages and scholarly articles looking for those matches and surface them for a reviewer. Some, like Grammarlys, are marketed to writers and offer to help people add proper citations where they may have forgotten them. It isnt meant to police plagiarism, but rather help writers avoid it. Turnitin specifically says its Similarity Report does not check for plagiarism.

Still, the tools are frequently used to justify giving students zeroes on their assignmentsand the students most likely to get such dismissive grading are those at less-selective institutions, where faculty are overstretched and underpaid.

For her part, Pickering Antonova came to feel guilty about putting students through the stress of seeing their Turnitin results.

They see their paper is showing up 60 percent plagiarized, and they have a heart attack, she said.

Plagiarism does not carry a legal definition. Institutions create their own plagiarism policies, and academic fields have norms about how to credit and cite sources in scholarly text. Plagiarism checkers are not designed with such nuance. It is up to users to follow up their algorithmic output with good, human judgment.

Jo Guldi, a professor of quantitative methods at Emory University, recently published The Dangerous Art of Text Mining: A Methodology for Digital History and jumped into the Gay plagiarism controversy with a now-deleted post on X before Christmas. She pointed out that computers can search for five-word overlaps in text but argued that such repetition does not equal plagiarism: the technology of text mining can be used to destroy the career of any scholar at any time, she wrote.

By phone, Guldi said that while she didnt cover plagiarism detection in her book, the parallel is clear. Her book traces bad conclusions reached because people fail to critically analyze the data. She, too, has used Turnitin in her classes and recognized the findings cannot be taken at face value.

You look at them and you see you have to apply judgment, she said. Its always a judgment call.

Many scholars, including those Gay is supposed to have plagiarized, have come to Gays defense over the course of the last month, arguing the text similarities highlighted do not rise to the level of plagiarism.

Machine Learning

Stanford study found AI detectors are biased against non-native English speakers

Yet her accuser has identified nearly 50 examples of overlap, pairing her writing with that of other scholars and insisting there is a pattern of academic misconduct. The sheer number of examplesand promise of more to comehelped seal Gays fate. And some scholars worry anyone with enemies could be next.

Ian Bogost, a professor at Washington University in St. Louis, mulled in The Atlantic what a full-bore plagiarism war could look like, running his own dissertation through iThenticate, a checker run by the same company as Turnitin that is marketed to researchers, publishers, and scholars.

Bill Ackman, a billionaire Harvard megadonor, signaled his commitment to participating in such a war after Business Insider launched its own grenade, publishing an analysis last week that accused his wife, Neri Oxman, of plagiarizing parts of her dissertation. Oxman got her Ph.D. at MIT in 2010 before joining the faculty and then leaving to become an entrepreneur. Suspecting someone from MIT encouraged Business Insider to take a closer look at her dissertation, Ackman posted on X that he was going to begin a review of the work of all current @MIT faculty members, President Kornbluth, other officers of the Corporation, and its board members for plagiarism.

He later added, Why would we stop at MIT? Dont we have to do a deep dive into academic integrity at Harvard as well? What about Yale, Princeton, Stanford, Penn, Dartmouth? You get the point.

Its unclear which tool Gays accuser used to identify their examples, but experts agree the accusations seem to come from a text comparison algorithm. A Markup analysis of five of Gays papers in the Grammarly and EasyBib plagiarism checkers did not turn up any of the plagiarism accusations that have surfaced in recent months. Grammarlys tool did flag instances of text overlap between Gays writing and other scholars, sometimes because they were citing her paper, but sometimes because the two authors were simply describing similar things. Gays 2017 political science paper A Room for Ones Own? is the subject of more than half a dozen accusations of plagiarism that Grammarly didnt flagbut the tool did, for example, suggest her line The estimated coefficients and standard errors from the may have been plagiarized from an article about diabetes in Bali.

Analyzing the same paper, Turnitin ignored several of the lines included in complaints against her but it did flag four from two academic papers. It also found other similarities, suggesting, for example, that the phrase receive a 10-year stream of tax credits warranted review.

Credit:Turnitin

David Smith, an associate professor of computer science at Northeastern University, has studied natural language processing and computational linguistics. He said plagiarism detection tools tend to start with what is called a null model. The algorithm is given very few assumptions and simply told to identify matching words across texts. To find examples in Gays writing, he said, it basically took people looking through the really low-precision output of these models.

Machine Learning

A Markup examination of a typical college shows how students are subject to a vast and growing array of watchful tech, including homework trackers, test-taking software, and even license platereaders

Somebody could have trained a better model that had higher precision, Smith said. That doesnt seem to be how it went in this case.

The result was a long list of plagiarism accusations most scholars found baffling.

Turnitin introduced its similarity check in 2000. Since then, plagiarism analyses have become the norm for editors of some academic journals as well as many college and university faculty members. Yet the tool is not universal. Many users, like Pickering Antonova, have decided the software isnt worth the time and dont align with their teaching goals. This has created two distinct classes of people: those who are subjected to plagiarism checkers and those who are not. For professional academics, Gays case highlights the concern that anyone with a high profile who makes the wrong enemy could quickly become part of the former group.

For students, its often just a matter of their schools norms. Plagiarism checkers can seem like a straightforward assessment of the originality of student work, reporting a percentage of the paper that may have been plagiarized. For faculty members who dont have the time to look at the dozens of false flags, it can be easy to rely on the total percentage and grade accordingly.

This behavior worries Smith, the computer scientist. Getting a quantification makes it easier to just judge a lot of student papers at scale, he said. Thats not whats going on in the Claudine Gay case but is troubling about whats going on with students subjection to these methods.

Tech companies have produced a steady stream of new tools for educators concerned with students cheating, including AI detectors that followed the widespread adoption of ChatGPT. With each new tool comes a promise of scientific accuracy and cutting-edge analysis of unbiased data.

But as Claudine Gays case demonstratesand the threat of the plagiarism wars promisesplagiarism detection is far from precise.

Follow this link:
Plagiarism Detection Tools Offer a False Sense of Accuracy The Markup - The Markup

Posted in Machine Learning | Comments Off on Plagiarism Detection Tools Offer a False Sense of Accuracy The Markup – The Markup

Minimizing the Reality Gap in Quantum Devices with Machine Learning – AZoQuantum

A major obstacle facing quantum devices has been solved by a University of Oxford study that leveraged machine learning capabilities. The results show how to bridge the reality gap, or the discrepancy between expected and observed behavior from quantum devices, for the first time. Physical Review X has published the findings.

Image Credit:metamorworks/Shutterstock.com

Numerous applications, such as drug development, artificial intelligence, financial forecasting, and climate modeling, might be significantly improved by quantum computing. However, this will necessitate efficient methods for combining and scaling separate quantum bits (also known as qubits). Inherent variability, which occurs when even seemingly similar units display distinct behaviors, is a significant obstacle to this.

It is assumed that nanoscale flaws in the materials utilized to create quantum devices are the source of functional variability. This internal disorder cannot be represented in simulations since these cannot be measured directly, which accounts for the discrepancy between expected and observed results.

The study team addressed this by indirectly inferring certain disease traits through the use of a physics-informed machine learning technique. This was predicated on how the devices intrinsic instability impacted the electron flow.

As an analogy, when we play crazy golf the ball may enter a tunnel and exit with a speed or direction that doesnt match our predictions. But with a few more shots, a crazy golf simulator, and some machine learning, we might get better at predicting the balls movements and narrow the reality gap.

Natalia Ares, Study Lead Researcher and Associate Professor, Department of Engineering Science, University of Oxford

One quantum dot device was used as a test subject, and the researchers recorded the output current across it at various voltage settings. A simulation was run using the data to determine the difference between the measured current and the theoretical current in the absence of an internal disturbance.

The simulation was forced to discover an internal disorder arrangement that could account for the results at all voltage levels by monitoring the current at numerous distinct voltage settings. Deep learning was combined with statistical and mathematical techniques in this method.

Ares added, In the crazy golf analogy, it would be equivalent to placing a series of sensors along the tunnel, so that we could take measurements of the balls speed at different points. Although we still cant see inside the tunnel, we can use the data to inform better predictions of how the ball will behave when we take the shot.

The novel model not only identified appropriate internal disorder profiles to explain the observed current levels, but it also demonstrated the ability to precisely forecast the voltage settings necessary for particular device operating regimes.

Most importantly, the model offers a fresh way to measure the differences in variability between quantum devices. This could make it possible to predict device performance more precisely and aid in the development of ideal materials for quantum devices. It could guide compensatory strategies to lessen the undesirable consequences of material flaws in quantum devices.

Similar to how we cannot observe black holes directly but we infer their presence from their effect on surrounding matter, we have used simple measurements as a proxy for the internal variability of nanoscale quantum devices. Although the real device still has greater complexity than the model can capture, our study has demonstrated the utility of using physics-aware machine learning to narrow the reality gap.

David Craig, Study Co-Author and PhD Student, Department of Materials, University of Oxford

Craig, D. L., et. al. (2023) Bridging the Reality Gap in Quantum Devices with Physics-Aware Machine Learning. Physical Review X. doi:10.1103/PhysRevX.14.011001

Source: https://www.ox.ac.uk/

See the rest here:
Minimizing the Reality Gap in Quantum Devices with Machine Learning - AZoQuantum

Posted in Machine Learning | Comments Off on Minimizing the Reality Gap in Quantum Devices with Machine Learning – AZoQuantum

Machine Learning for Predicting Oliguria in Intensive Care Units | Healthcare News – Medriva

Intensive care units (ICUs) are critical environments that deal with high-risk patients, where early detection of complications can significantly improve patient outcomes. Oliguria, a condition characterized by low urine output, is a common concern in ICUs and often signals acute kidney injury (AKI). Early prediction of oliguria can lead to timely intervention and better management of patients. Recent studies have shown that machine learning, a branch of artificial intelligence, can be effectively used to predict the onset of oliguria in ICU patients.

A retrospective cohort study aimed to develop and evaluate a machine learning algorithm for predicting oliguria in ICU patients. The study used electronic health record data from 9,241 patients admitted to the ICU between 2010 and 2019. The machine learning model demonstrated high accuracy in predicting the onset of oliguria at 6 hours and 72 hours with Area Under the Curve (AUC) values of 0.964 and 0.916, respectively. This suggests that the machine learning model can be a valuable tool for early identification of patients at risk of developing oliguria, enabling prompt intervention and optimal management of AKI.

The machine learning model identified several important variables for predicting oliguria. These included urine values, severity scores (SOFA score), serum creatinine, oxygen partial pressure, fibrinogen, fibrin degradation products, interleukin 6, and peripheral temperature. By taking into account these variables, the model was able to provide accurate predictions. The use of machine learning also allows for the continuous update and improvement of the model as more data becomes available, increasing its predictive accuracy over time.

Interestingly, the models accuracy varied based on several factors, including sex, age, and furosemide administration. This highlights the complex nature of predicting oliguria and the need for personalized, patient-specific models. It also underlines the potential of machine learning to adapt and learn from varying patient characteristics, providing more precise and individualized predictions.

The utilization of machine learning is not limited to predicting oliguria. Another study aimed to develop a machine learning model for early prediction of adverse events and treatment effectiveness in patients with hyperkalemia, a condition characterized by high levels of potassium in the blood. This study, too, achieved promising results, underscoring the potential of machine learning to revolutionize various aspects of patient care in the ICU setting.

The use of machine learning models in healthcare, and particularly in intensive care units, is a promising avenue for improving patient outcomes. By predicting the onset of conditions like oliguria, these models can provide critical early warnings that allow healthcare providers to intervene promptly. However, its crucial to remember that these models are tools to assist clinicians and not replace their judgment. As research continues and more data becomes available, these models are expected to become even more accurate and valuable in the future.

View post:
Machine Learning for Predicting Oliguria in Intensive Care Units | Healthcare News - Medriva

Posted in Machine Learning | Comments Off on Machine Learning for Predicting Oliguria in Intensive Care Units | Healthcare News – Medriva

Machine Learning in Business: 5 things a Data Science course won’t teach you – Towards Data Science

The author shares some important aspects of Applied Machine Learning that can be overlooked in formal Data Science education.

If you feel that I used a clickbaity title for this article, Id agree with you but hear me out! I have managed multiple junior data scientists over the years and in the last few years I have been teaching an applied Data Science course to Masters and PhD students. Most of them have great technical skills but when it comes to applying Machine Learning to real-world business problems, I realized there were some gaps.

Below are the 5 elements that I wish data scientists were more aware of in a business context:

Im hoping that reading this will be helpful to junior and mid-level data scientists to grow their career!

In this piece, I will focus on a scenario where data scientists are tasked with deploying machine learning models to predict customer behavior. Its worth noting that the insights can be applicable to scenarios involving product or sensor behaviors as well.

Lets start with the most critical of all: the What that you are trying to predict. All subsequent steps data cleaning, preprocessing, algorithm, feature engineering, hyperparameters optimization become futile unless you are focusing on the right target.

In order to be actionable, the target must represent a behavior, not a data point.

Ideally, your model aligns with a business use case, where actions or decisions will be based on its output. By making sure the target you are using is a good representation of a customer behavior, it is easy for the business to understand and utilize these models outputs.

Read more from the original source:
Machine Learning in Business: 5 things a Data Science course won't teach you - Towards Data Science

Posted in Machine Learning | Comments Off on Machine Learning in Business: 5 things a Data Science course won’t teach you – Towards Data Science

Genomic prediction in multi-environment trials in maize using statistical and machine learning methods | Scientific … – Nature.com

Phenotypic data

The data are composed of 265 single cross hybrids from the maize breeding program of Embrapa Maize and Sorghum evaluated in eight combinations of trials/locations/years under irrigated trials (WW) and water stress (WS) conditions at two locations in Brazil (JanabaMinas Gerais and TeresinaPiau) over two years (2010 and 2011). The hybrids were obtained from crosses between 188 inbred lines and two testers. The inbred lines belong to heterotic groups: dent (85 inbred lines), flint (86 inbred lines), and an additional group, referred to as group C (17 inbred lines), which is unrelated to the dent and flint origins. The two testers are inbred lines belonging to the flint (L3) and dent (L228-3) groups. Among the inbred lines, 120 were crossed with both testers, 52 were crossed with the L228-3 tester only, and 16 lines were crossed with the L3 tester only. Silva et al. (2020) evaluated the genetic diversity and heterotic groups in the same database. These authors showed the existence of subgroups within each heterotic group. Therefore, once these groups were not genetically well defined and the breeding program from Embrapa Maize was in the beginning, the effect of allelic substitution in both groups are assumed to be the same. More details on the experimental design and procedures can be found in Dias et al.13,30.

The experiment originally included 308 entries, but hybrids that were not present in all environments were also removed to evaluate the genomic prediction within each environment, resulting in a total of 265 hybrids for analysis. Each trial consisted of 308 maize single cross hybrids, randomly divided into six sets: sets 13 for crosses with L3 (61, 61, and 14 hybrids each), and sets 46 for crosses with L228-3 (80, 77, and 15 hybrids each). Four checks (commercial maize cultivars) were included in each set, and the experiment was designed in completely randomized blocks. Between trials, hybrids within each set remained the same, but hybrids and checks were randomly allocated into groups of plots within each set. This allocation varied between replicates of sets and between trials. The WS trials had three replications, except for the set containing 15 hybrids and the trials evaluated in 2010, which had two replications. All WW trials, except for the trial in 2011, had two replicates.

Two agronomic traits related to drought tolerance were analyzed: grain yield (GY) and female flowering time (FFT). GY was determined by weighing all grains in each plot, adjusted for 13% grain moisture, and converted to tons per hectare (t/ha), accounting for differences in plot sizes across trials. FFT was measured as the number of days from sowing until the stigmas appeared in 50% of the plants. A summary of means, standard deviations, and ranges of both evaluated traits are available in Table 1.

To conduct the analyses, hybrids considered as outliers were removed (i.e., hybrids that presented phenotypic values greater than 1.5interquartile range above the third quartile or below the first quartile) for the GY and FFT traits. The variations in predictive abilities among hybrids of T2, T1, and T0 are widely recognized31. However, the primary aim of our study was to compare different prediction methodologies in MET assays. In this study, there were 240 T2 hybrids and 68 T1 hybrids, with T2 hybrids had both parents evaluated in different hybrid combinations, while hybrids being single-cross hybrids sharing one parent with the tested hybrids. Given the realistic nature of our scenario, we have a limited and imbalanced distribution of these hybrid groups, making a fair comparison challenging. Consequently, we opted to construct a training set comprising T2 and T0 hybrids.

To correct the phenotypic values for experimental design effects, each trial (WW and WS) and environment were analyzed independently to obtain the Best Linear Unbiased Estimator (eBLUEs) for each hybrid, for the two traits evaluated. The estimates were obtained based on the following model:

$${varvec{y}} = 1mu + user2{ X}_{1} {varvec{r}} + {varvec{X}}_{2} {varvec{s}} + {varvec{X}}_{3} {varvec{h}} + {varvec{e}}$$

(1)

where (user2{y }left( {n times 1} right)) is the phenotype vector for (f) replicates, (t) sets of (p) hybrids, and (n) is the number of observations; (mu) is the mean; (user2{r }left( {f times 1} right)) is the fixed effect vector of the replicates; (user2{s }left( {t times 1} right)) is the fixed effect vector of the sets; ({varvec{h}}) (left( {p times 1} right)) is the fixed effect vector of the hybrids; and (user2{e }left( {k times 1} right)) is the residue vector, with (user2{ e} sim ,MVNleft( {0,{varvec{I}}sigma_{e}^{2} } right)), where ({varvec{I}}) is an identity matrix of corresponding order, and (sigma_{e}^{2}) the residual variance. ({varvec{X}}_{{1user2{ }}} left( {k times f} right)), ({varvec{X}}_{{2user2{ }}} left( {k times t} right)) e ({varvec{X}}_{{3user2{ }}} left( {k times p} right)) represents incidence matrices for their respective effects. The eBLUES of each environment were used in further analyses.

A total of 57,294 Single Nucleotide Polymorphisms (SNPs) markers were obtained from 188 inbred lines, and two testers used as parents of the 265 single cross hybrids. The genotyping by sequencing (GBS) strategyare detailed in Dias et al.13. For the quality control, SNPs were discarded if: the minor allele frequency was smaller than 5%, more than 20% of missing genotypes were found, and/or there were more than 5% of heterozygous genotypes. After filtering, missing data were imputed using NPUTE. Then, for each SNP, the genotypes of the hybrids were inferred based on the genotype of their parents (inbred line and tester). The number of SNPs per chromosome ranged from 3121 (chromosome 10) to 7705 (chromosome 1), totalizing 47,127 markers.

The additive and dominance genomic relationship matrices were constructed32 based on information from the SNPs using the package AGHmatrix33, following VanRaden34 and Vitezica et al., respectively.

Genomic predictions were performed using the Genomic Best Line Unbiased Prediction (GBLUP) method using the package AsReml v. 436. Two groups were considered: the first group comprised four environments under WW conditions, and the second included four environments under WS conditions. The linear model is described below:

$$overline{user2{y}} = mu 1 + user2{ Xb} + {varvec{Z}}_{1} {varvec{u}}_{{varvec{a}}} + {varvec{Z}}_{2} {varvec{u}}_{{varvec{d}}} + {varvec{e}}$$

(2)

where (user2{overline{y} }left( {pq times 1} right)) is the vector of eBLUES previously estimated for each environment with (p) hybrids and (q) environments;(mu) is the mean; (user2{b }left( {q times 1} right)) is the vector of environmental effects (fixed); ({varvec{u}}_{{user2{a }}} left( {pq times 1} right)) is the vector of individual additive genetic values nested within environments (random), with ({varvec{u}}_{{varvec{a}}} sim MVNleft( {0,left[ {{varvec{I}}_{{varvec{q}}} sigma_{{u_{a} }}^{2} + rho_{a} left( {{varvec{J}}_{{varvec{q}}} - {varvec{I}}_{{varvec{q}}} } right)} right] otimes {varvec{A}}} right)), where ({varvec{A}}) is the genomic relationship matrix between individuals for additive effects, (rho_{a}) is the additive genetic correlation coefficient between environments, ({varvec{I}}_{{varvec{q}}} user2{ }left( {q times q} right)) is an identity matrix, ({varvec{J}}_{{varvec{q}}} user2{ }left( {q times q} right)) is a matrix of ones, and (otimes) denotes the Kronecker product; ({varvec{u}}_{{varvec{d}}}) (left( {pq times 1} right)) is the vector of individual dominance genetic values nested within environments (random), with ({varvec{u}}_{{varvec{d}}} sim MVNleft( {0,left[ {{varvec{I}}_{{varvec{q}}} sigma_{{u_{d} }}^{2} + rho_{d} left( {{varvec{J}}_{{varvec{q}}} - {varvec{I}}_{{varvec{q}}} } right)} right] otimes {varvec{D}}} right)), where ({varvec{D}}) is the genomic relationship matrix between individuals for dominance effects, (rho_{{varvec{d}}}) is the dominance correlation coefficient between environments; ({varvec{e}}) (left( {pq times 1} right)) is the random residuals vector with ({varvec{e}}sim MVNleft( {0,{varvec{I}}sigma_{e}^{2} } right)). The capital letters (user2{X }left( {pq times q} right),user2{ Z}_{1} left( {pq times pq} right)) and ({varvec{Z}}_{2} user2{ }left( {pq times pq} right)) represent the incidence matrices for their respective effects, (1user2{ }left( {pq times 1} right)) is a vector of ones. The (co)variance components were obtained using the residual maximum likelihood method (REML)37.

Two alternative models were also used. The first for genomic prediction retained only additive effects by removing ({varvec{u}}_{{varvec{d}}}) from Eq.(2). The second model was used to estimate the genetic parameters within each environment separately.

The significance of random effects was tested using the Likelihood Ratio Test (LRT)38, given by:

$$LRT = 2*left( {LogL_{c} - LogL_{r} } right)$$

(3)

where (LogL_{c}) is the logarithm of the likelihood function of the complete model (with all effects included), and (LogL_{r}) is the logarithm of the restricted likelihood function of the reduced model (without the effect under test). Effect significance was tested by LRT using the chi-square (X2) probability density function with a degree of freedom and significance level of 5%39.

The narrow-sense heritability (({ }h^{2})), the proportion of variance explained by dominance effects ((d^{2})), and the broad-sense heritability (left( {H^{2} } right)) for each trait were estimated following Falconer and Mackay 199635.

Similar to the previous topic, the trials were divided between WW and WS conditions, and the potential of regression trees (RT) was explored using the following three algorithms: bagging, random forest, and boosting22. Bagging (Bag) is a methodology that aims to reduce the RT variance22. In other words, it consists of obtaining D samples with available sampling replacement, thus obtaining D models (hat{f}^{1} left( x right), hat{f}^{2} left( x right), ldots , hat{f}^{D} left( x right)), and finally use the generated models to obtain an average, given by:

$$hat{f}_{medio} left( x right) = frac{1}{D}mathop sum limits_{d = 1}^{D} hat{f}^{d} left( x right)$$

(4)

This decreases the variability obtained in the decision trees. The number of trees used in Bag is not a parameter that will result in overfitting of the model. In practice, a number of trees is used until the error has stabilized22. The number of trees sampled for Bag was set at 500 trees.

Random forest (RF) was proposed by HO40 and it is an improvement of Bag to avoid the high correlation of the trees and to improve the accuracy in the selection of individuals. RF changes only the number of predictor variables used in each split. That is, each time a split in a tree is considered, a random sample of (m) variables is chosen as candidates from the complete set of (p) variables. Hastie et al.21 suggest that the number of predictor variables used in each partition is equal to (m = frac{p}{3}) for regression trees. The number of trees for the RF was set at 500.

Boosting uses RT by adjusting the residual of an initial model. The residual is updated with each tree that grows sequentially from the previous tree's residual, and the response variable involves a combination of a large number of trees, such that:

$$hat{f}left( x right) = mathop sum limits_{b = 1}^{B} {uplambda } hat{f}^{b} left( x right)$$

(5)

The function (hat{f}left( . right)) refers to the final tree combined with sequentially adjusted trees, and is the shrinkage parameter that controls the learning rate of the method. Furthermore, this method needs to be adjusted with several splits in each of the trees. This parameter controls the complexity of the Boost and is known as the depth. For Boosting, the number of trees sampled was 250, with a learning rate of 0.1 and a depth of 3.

To perform hybrid prediction for each environment based on MET dataset, we propose the incorporation of location and year information in which the experiments were carried out as factors in the data input file together with SNPs markers as predictors in machine learning methodologies. As a response variable, the eBLUEs previously estimated by Eq.(1) were used.

For the construction of the bagging and random forest models, the randomForest function from the package randomForest41 was used. Finally, the package's gbm function gbm42 was used for boosting. All analyzes were implemented in the software R43.

Genomic predictions were carried out following Burgueo et al.16, considering two different prediction problems, CV1 and CV2, which simulate two possible scenarios a breeder can face. In CV1, the ability of the algorithms to predict the performance of hybrids that have not yet been evaluated in any field trial was evaluated. Thus, predictions derived from the CV1 scenario are entirely based on phenotypic and genotypic records from other related hybrids. In CV2, the ability of the algorithms to predict the performance of hybrids using data collected in other environments was evaluated. It simulates the prediction problem found in incomplete MET trials. Here, information from related individuals is used, and the prediction can benefit from genetic relationships between hybrids and correlations between environments. Within the CV2 scenario, two different situations of data imbalance were evaluated. In the first, called CV2 (50%), the tested hybrids were not present in half of the environments, while in the second, called CV2 (25%), the tested hybrids were not present in only 25% of the environments. Table 2 provides a hypothetical representation of this CV1, CV2 (50%), and CV2 (25%) validation scheme.

To separate the training and validation sets, the k-folds procedure was used, considering (k = 5). The set of 265 hybrids was divided into five groups, with 80% of the hybrids considered as the training population, and the remaining 20% hybrids considered as the validation population. The hybrids were separated into sets proportionally containing all the crosses performed (DentDent, DentFlint, FlintFlint, CDent, CFlint). The cross-validation process was performed separately for each trait, condition (WS or WW) and scenario (CV1, CV2-50% and CV2-25%) and was repeated five times to assess the predictive ability of the analyses.

The predictive ability within each environment for the conditions (WS and WW) was estimated by the Pearson correlation coefficient44 between the corrected phenotypic values (eBLUES) of Eq.(1) for each environment and the GEBVs predicted by each fitted method.

The authors confirm that all methods were carried out by relevant guidelines in the method section. The authors also confirm that the handling of the plant materials used in the study complies with relevant institutional, national, and international guidelines and legislation.

The authors confirm that the appropriate permissions and/or licenses for collection of plant or seed specimens are taken.

Read more:
Genomic prediction in multi-environment trials in maize using statistical and machine learning methods | Scientific ... - Nature.com

Posted in Machine Learning | Comments Off on Genomic prediction in multi-environment trials in maize using statistical and machine learning methods | Scientific … – Nature.com

The 11 Best AI Tools for Data Science to Consider in 2024 – Solutions Review

Solutions Reviews listing of the best AI tools for data science is an annual sneak peek of the top tools included in our Buyers Guide for Data Science and Machine Learning Platforms. Information was gathered via online materials and reports, conversations with vendor representatives, and examinations of product demonstrations and free trials.

The editors at Solutions Review have developed this resource to assist buyers in search of the best AI tools for data science to fit the needs of their organization. Choosing the right vendor and solution can be a complicated process one that requires in-depth research and often comes down to more than just the solution and its technical capabilities. To make your search a little easier, weve profiled the best AI tools for data science all in one place. Weve also included platform and product line names and introductory software tutorials straight from the source so you can see each solution in action.

Note: The best AI tools for data science are listed in alphabetical order.

Platform: DataRobot Enterprise AI Platform

Related products: Paxata Data Preparation, Automated Machine Learning, Automated Time Series, MLOps

Description: DataRobot offers an enterprise AI platform that automates the end-to-end process for building, deploying, and maintaining AI. The product is powered by open-source algorithms and can be leveraged on-prem, in the cloud or as a fully-managed AI service.DataRobotincludesseveralindependent but fully integrated tools (PaxataData Preparation,Automated Machine Learning, Automated Time Series,MLOps, and AI applications), and each can be deployed in multiple ways to match business needs and IT requirements.

Platform: H2O Driverless AI

Related products: H2O 3, H2O AutoML for ML, H2O Sparkling Water for Spark Integration, H2O Wave

Description: H2O.ai offers a number of AI and data science products, headlined by its commercial platform H2O Driverless AI. Driverless AI is a fully open-source, distributed in-memory machine learning platform with linearscalability. H2O supports widely used statistical and machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O has also developedAutoMLfunctionality that automatically runs through all the algorithms to produce a leaderboard of the best models.

Platform: IBM Watson Studio

Related products: IBM Cloud Pak for Data, IBM SPSS Modeler, IBM Decision Optimization, IBM Watson Machine Learning

Description: IBM Watson Studio enables users to build, run, and manage AI models at scale across any cloud. The product is a part of IBM Cloud Pak for Data, the companys main data and AI platform. The solution lets you automate AI lifecycle management, govern and secure open-source notebooks, prepare and build models visually, deploy and run models through one-click integration, and manage and monitor models with explainable AI. IBM Watson Studio offers a flexible architecture that allows users to utilize open-source frameworks likePyTorch, TensorFlow, and scikit-learn.

https://www.youtube.com/watch?v=rSHDsCTl_c0

Platform: KNIME Analytics Platform

Related products: KNIME Server

Description: KNIME Analytics is an open-source platform for creating data science. It enables the creation of visual workflows via a drag-and-drop-style graphical interface that requires no coding. Users can choose from more than 2000 nodes to build workflows, model each step of analysis, control the flow of data, and ensure work is current. KNIME can blend data from any source and shape data to derive statistics, clean data, and extract and select features. The product leverages AI and machine learning and can visualize data with classic and advanced charts.

Platform: Looker

Related products: Powered by Looker

Description: Looker offers a BI and data analytics platform that is built on LookML, the companys proprietary modeling language. The products application for web analytics touts filtering and drilling capabilities, enabling users to dig into row-level details at will. Embedded analytics in Powered by Looker utilizes modern databases and an agile modeling layer that allows users to define data and control access. Organizations can use Lookers full RESTful API or the schedule feature to deliver reports by email or webhook.

Platform: Azure Machine Learning

Related products:Azure Data Factory, Azure Data Catalog, Azure HDInsight, Azure Databricks, Azure DevOps, Power BI

Description: The Azure Machine Learning service lets developers and data scientists build, train, and deploy machine learning models. The product features productivity for all skill levels via a code-first and drag-and-drop designer, and automated machine learning. It also features expansiveMLopscapabilities that integrate with existing DevOps processes. The service touts responsible machine learning so users can understand models with interpretability and fairness, as well as protect data with differential privacy and confidential computing. Azure Machine Learning supports open-source frameworks and languages likeMLflow, Kubeflow, ONNX,PyTorch, TensorFlow, Python, and R.

Platform: Qlik Analytics Platform

Related products: QlikView, Qlik Sense

Description: Qlik offers a broad spectrum of BI and analytics tools, which is headlined by the companys flagship offering, Qlik Sense. The solution enables organizations to combine all their data sources into a single view. The Qlik Analytics Platform allows users to develop, extend and embed visual analytics in existing applications and portals. Embedded functionality is done within a common governance and security framework. Users can build and embed Qlik as simple mashups or integrate within applications, information services or IoT platforms.

Platform: RapidMiner Studio

Related products:RapidMiner AI Hub, RapidMiner Go, RapidMiner Notebooks, RapidMiner AI Cloud

Description: RapidMiner offers a data science platform that enables people of all skill levels across the enterprise to build and operate AI solutions. The product covers the full lifecycle of the AI production process, from data exploration and data preparation to model building, model deployment, and model operations. RapidMiner provides the depth that data scientists needbut simplifies AI for everyone else via a visual user interface that streamlines the process of building and understanding complex models.

Platform: SAP Analytics Cloud

Related products:SAP BusinessObjects BI, SAP Crystal Solutions

Description: SAP offers a broad range of BI and analytics tools in both enterprise and business-user driven editions. The companys flagship BI portfolio is delivered via on-prem (BusinessObjects Enterprise), and cloud (BusinessObjects Cloud) deployments atop the SAP HANA Cloud. SAP also offers a suite of traditional BI capabilities for dashboards and reporting. The vendors data discovery tools are housed in the BusinessObjects solution, while additional functionality, including self-service visualization, are available through the SAP Lumira tool set.

Platform: Sisense

Description: Sisense makes it easy for organizations to reveal business insight from complex data in any size or format. The product allows users to combine data and uncover insights in a single interface without scripting, coding or assistance from IT. Sisense is sold as a single-stack solution with a back end for preparing and modeling data. It also features expansive analytical capabilities, and a front-end for dashboarding and visualization. Sisense is most appropriate for organizations that want to analyze large amounts of data from multiple sources.

Platform: Tableau Desktop

Related products:Tableau Prep, Tableau Server, Tableau Online, Tableau Data Management

Description: Tableau offers an expansive visual BI and analytics platform, and is widely regarded as the major player in the marketplace. The companys analytic software portfolio is available through three main channels: Tableau Desktop, Tableau Server, and Tableau Online. Tableau connects to hundreds of data sources and is available on-prem or in the cloud. The vendor also offers embedded analytics capabilities, and users can visualize and share data with Tableau Public.

Visit link:
The 11 Best AI Tools for Data Science to Consider in 2024 - Solutions Review

Posted in Machine Learning | Comments Off on The 11 Best AI Tools for Data Science to Consider in 2024 – Solutions Review