** Python Certification Training: https://www.edureka.co/python **
This Edureka tutorial on "Python Tutorial for Beginners" (Python Blog Series: https://goo.gl/nKQJHQ) covers all the basics of Python. It includes python programming examples, so try it yourself and mention in the comments section if you have any doubts. Following are the topics included in this PPT:
Introduction to Python
Reasons to choose Python
Installing and running Python
Development Environments
Basics of Python Programming
Starting with code
Python Operators
Python Lists
Python Tuples
Python Sets
Python Dictionaries
Conditional Statements
Looping in Python
Python Functions
Python Arrays
Classes and Objects (OOP)
Conclusion
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This document provides an overview of tools and techniques for data analysis in Python. It discusses popular Python libraries for data analysis like NumPy, pandas, and matplotlib. It also provides examples of importing datasets, working with Series and DataFrames, merging datasets, and using GroupBy to aggregate data. The document is intended as a tutorial for getting started with data analysis and visualization using Python.
This document discusses using the Seaborn library in Python for data visualization. It covers installing Seaborn, importing libraries, reading in data, cleaning data, and creating various plots including distribution plots, heatmaps, pair plots, and more. Code examples are provided to demonstrate Seaborn's functionality for visualizing and exploring data.
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length encoding and binary packing designed for CPU and cache optimizations. Benchmark results show Parquet provides much better compression and faster query performance than other formats like text, Avro and RCFile. The project is developed as an open source community with contributions from many organizations.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Structured Streaming provides stateful stream processing capabilities in Spark SQL through built-in operations like aggregations and joins as well as user-defined stateful transformations. It handles state automatically through watermarking to limit state size by dropping old data. For arbitrary stateful logic, MapGroupsWithState requires explicit state management by the user.
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
Pandas is a Python library for data analysis and manipulation. It provides high performance tools for structured data, including DataFrame objects for tabular data with row and column indexes. Pandas aims to have a clean and consistent API that is both performant and easy to use for tasks like data cleaning, aggregation, reshaping and merging of data.
Python is the choice llanguage for data analysis,
The aim of this slide is to provide a comprehensive learning path to people new to python for data analysis. This path provides a comprehensive overview of the steps you need to learn to use Python for data analysis.
This document provides an overview of Python for data analysis using the pandas library. It discusses key pandas concepts like Series and DataFrames for working with one-dimensional and multi-dimensional labeled data structures. It also covers common data analysis tasks in pandas such as data loading, aggregation, grouping, pivoting, filtering, handling time series data, and plotting.
Pandas is a powerful Python library for data analysis and manipulation. It provides rich data structures for working with structured and time series data easily. Pandas allows for data cleaning, analysis, modeling, and visualization. It builds on NumPy and provides data frames for working with tabular data similarly to R's data frames, as well as time series functionality and tools for plotting, merging, grouping, and handling missing data.
Python Pandas is a powerful library for data analysis and manipulation. It provides rich data structures and methods for loading, cleaning, transforming, and modeling data. Pandas allows users to easily work with labeled data and columns in tabular structures called Series and DataFrames. These structures enable fast and flexible operations like slicing, selecting subsets of data, and performing calculations. Descriptive statistics functions in Pandas allow analyzing and summarizing data in DataFrames.
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Introduction to Python Pandas for Data AnalyticsPhoenix
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, medical...
The document provides a cheat sheet summarizing key concepts in Python, Pandas, NumPy, Scikit-Learn, and data visualization with Matplotlib and Seaborn. It includes sections on Python basics like data types, variables, lists, dictionaries, functions, modules, and control flow. Sections on Pandas cover data structures, loading/saving data, selection, merging, grouping, pivoting and visualization. NumPy sections cover arrays, array math, manipulation, aggregation and subsetting. Scikit-Learn sections cover the machine learning workflow and common algorithms.
Pandas is an open-source Python library used for data manipulation and analysis. It allows users to extract data from files like CSVs into DataFrames and perform statistical analysis on the data. DataFrames are the primary data structure and allow storage of heterogeneous data in tabular form with labeled rows and columns. Pandas can clean data by removing missing values, filter rows/columns, and visualize data using Matplotlib. It supports Series, DataFrames, and Panels for 1D, 2D, and 3D labeled data structures.
Exploratory data analysis in R - Data Science ClubMartin Bago
How to analyse new dataset in R? What libraries to use, and what commands? How to understand your dataset in few minutes? Read my presentation for Data Science Club by Exponea and find out!
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Graph databases are well-suited for storing and querying multi-relational data. They provide better performance, flexibility, and agility than relational databases for such data. Tests showed graph databases like Neo4j outperforming relational databases by returning results faster and for more records as depth and complexity of queries increased. Cypher is the query language for Neo4j that allows starting queries, matching patterns, returning and filtering results through clauses like START, MATCH, RETURN, and WHERE. Graph databases are used successfully by many large companies needing to handle complex relationships in data.
These slides present how DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialized and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialized views.
Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Odoo Experience 2018 - How to Break Odoo Security (or how to prevent it)ElínAnna Jónasdóttir
Odoo's security model uses multi-level access controls to restrict data access through groups, access control lists (ACLs), and rules at both the model and field level. Common vulnerabilities include injection, improper access controls, information leaks, and cross-site scripting. To break Odoo's security, one would try to exploit vulnerabilities like SQL injection, accessing data without proper permissions, or leaking sensitive information through unsafe domain combinations.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Gurpreet Singh from Microsoft gave a talk on scaling Python for data analysis and machine learning using DASK and Apache Spark. He discussed the challenges of scaling the Python data stack and compared options like DASK, Spark, and Spark MLlib. He provided examples of using DASK and PySpark DataFrames for parallel processing and showed how DASK-ML can be used to parallelize Scikit-Learn models. Distributed deep learning with tools like Project Hydrogen was also covered.
The document discusses various operators and control flow statements in Python. It covers arithmetic, comparison, logical, assignment and membership operators. It also covers if-else conditional statements, while and for loops, and break, continue and pass statements used with loops. The key points are:
- Python supports operators like +, -, *, / etc. for arithmetic and ==, !=, >, < etc. for comparison.
- Control flow statements include if-else for conditional execution, while and for loops for repetition, and break/continue to control loop flow.
- The while loop repeats as long as the condition is true. for loops iterate over sequences like lists, tuples using a loop variable.
The document provides an overview of a Python programming course taught by Dr. C. Sreedhar. The course covers topics like the history of Python, installing Python, data types, operators, expressions, functions, and more. It includes code examples for basic programs to calculate area and perimeter, check if a number is even or odd, and determine if a number is divisible by 4 and 9. The document contains lecture slides with explanations and syntax for various Python concepts.
The document discusses various data structures and their operations. It begins with an introduction to linear data structures like arrays, linked lists, stacks and queues. It describes array implementation using sequential storage and linked list implementation using pointers. Common operations on these structures like traversal, insertion, deletion and searching are discussed. The document also covers non-linear data structures like trees and graphs and basic tree traversals. It provides examples of applications of different data structures and concludes with definitions of key terms.
C'est la troisième partie du cours Business Intelligence et Data warehouse.
Si vous avez des questions, des remarques ou des propositions, n'hésitez pas de me les envoyer via mon email:
[email protected].
Bonne lecture
Python provides similar functionality to R for data analysis and machine learning tasks. Key differences include using import statements to load packages rather than library, and minor syntactic variations such as brackets [] instead of parentheses (). Common data analysis operations like reading data, creating data frames, applying machine learning algorithms, and visualizing results can be performed in both languages.
This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.
This document provides an overview of Python for data analysis using the pandas library. It discusses key pandas concepts like Series and DataFrames for working with one-dimensional and multi-dimensional labeled data structures. It also covers common data analysis tasks in pandas such as data loading, aggregation, grouping, pivoting, filtering, handling time series data, and plotting.
Pandas is a powerful Python library for data analysis and manipulation. It provides rich data structures for working with structured and time series data easily. Pandas allows for data cleaning, analysis, modeling, and visualization. It builds on NumPy and provides data frames for working with tabular data similarly to R's data frames, as well as time series functionality and tools for plotting, merging, grouping, and handling missing data.
Python Pandas is a powerful library for data analysis and manipulation. It provides rich data structures and methods for loading, cleaning, transforming, and modeling data. Pandas allows users to easily work with labeled data and columns in tabular structures called Series and DataFrames. These structures enable fast and flexible operations like slicing, selecting subsets of data, and performing calculations. Descriptive statistics functions in Pandas allow analyzing and summarizing data in DataFrames.
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Introduction to Python Pandas for Data AnalyticsPhoenix
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, medical...
The document provides a cheat sheet summarizing key concepts in Python, Pandas, NumPy, Scikit-Learn, and data visualization with Matplotlib and Seaborn. It includes sections on Python basics like data types, variables, lists, dictionaries, functions, modules, and control flow. Sections on Pandas cover data structures, loading/saving data, selection, merging, grouping, pivoting and visualization. NumPy sections cover arrays, array math, manipulation, aggregation and subsetting. Scikit-Learn sections cover the machine learning workflow and common algorithms.
Pandas is an open-source Python library used for data manipulation and analysis. It allows users to extract data from files like CSVs into DataFrames and perform statistical analysis on the data. DataFrames are the primary data structure and allow storage of heterogeneous data in tabular form with labeled rows and columns. Pandas can clean data by removing missing values, filter rows/columns, and visualize data using Matplotlib. It supports Series, DataFrames, and Panels for 1D, 2D, and 3D labeled data structures.
Exploratory data analysis in R - Data Science ClubMartin Bago
How to analyse new dataset in R? What libraries to use, and what commands? How to understand your dataset in few minutes? Read my presentation for Data Science Club by Exponea and find out!
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Graph databases are well-suited for storing and querying multi-relational data. They provide better performance, flexibility, and agility than relational databases for such data. Tests showed graph databases like Neo4j outperforming relational databases by returning results faster and for more records as depth and complexity of queries increased. Cypher is the query language for Neo4j that allows starting queries, matching patterns, returning and filtering results through clauses like START, MATCH, RETURN, and WHERE. Graph databases are used successfully by many large companies needing to handle complex relationships in data.
These slides present how DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialized and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialized views.
Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
Odoo Experience 2018 - How to Break Odoo Security (or how to prevent it)ElínAnna Jónasdóttir
Odoo's security model uses multi-level access controls to restrict data access through groups, access control lists (ACLs), and rules at both the model and field level. Common vulnerabilities include injection, improper access controls, information leaks, and cross-site scripting. To break Odoo's security, one would try to exploit vulnerabilities like SQL injection, accessing data without proper permissions, or leaking sensitive information through unsafe domain combinations.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Gurpreet Singh from Microsoft gave a talk on scaling Python for data analysis and machine learning using DASK and Apache Spark. He discussed the challenges of scaling the Python data stack and compared options like DASK, Spark, and Spark MLlib. He provided examples of using DASK and PySpark DataFrames for parallel processing and showed how DASK-ML can be used to parallelize Scikit-Learn models. Distributed deep learning with tools like Project Hydrogen was also covered.
The document discusses various operators and control flow statements in Python. It covers arithmetic, comparison, logical, assignment and membership operators. It also covers if-else conditional statements, while and for loops, and break, continue and pass statements used with loops. The key points are:
- Python supports operators like +, -, *, / etc. for arithmetic and ==, !=, >, < etc. for comparison.
- Control flow statements include if-else for conditional execution, while and for loops for repetition, and break/continue to control loop flow.
- The while loop repeats as long as the condition is true. for loops iterate over sequences like lists, tuples using a loop variable.
The document provides an overview of a Python programming course taught by Dr. C. Sreedhar. The course covers topics like the history of Python, installing Python, data types, operators, expressions, functions, and more. It includes code examples for basic programs to calculate area and perimeter, check if a number is even or odd, and determine if a number is divisible by 4 and 9. The document contains lecture slides with explanations and syntax for various Python concepts.
The document discusses various data structures and their operations. It begins with an introduction to linear data structures like arrays, linked lists, stacks and queues. It describes array implementation using sequential storage and linked list implementation using pointers. Common operations on these structures like traversal, insertion, deletion and searching are discussed. The document also covers non-linear data structures like trees and graphs and basic tree traversals. It provides examples of applications of different data structures and concludes with definitions of key terms.
C'est la troisième partie du cours Business Intelligence et Data warehouse.
Si vous avez des questions, des remarques ou des propositions, n'hésitez pas de me les envoyer via mon email:
[email protected].
Bonne lecture
Python provides similar functionality to R for data analysis and machine learning tasks. Key differences include using import statements to load packages rather than library, and minor syntactic variations such as brackets [] instead of parentheses (). Common data analysis operations like reading data, creating data frames, applying machine learning algorithms, and visualizing results can be performed in both languages.
This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.
“Practical Data Science”. R programming language and Jupiter notebooks are used in this tutorial. However, the concepts are generic and can be applied for Python or other programming language users as well.
R is a free programming language and software environment for statistical analysis and graphics. It contains functions for data manipulation, calculation, and graphical displays. Some key features of R include being free, running on multiple platforms, and having extensive statistical and graphical capabilities. Common object types in R include vectors, matrices, data frames, and lists. R also has packages that add additional functions.
R is a free software environment for statistical analysis and graphics. It allows importing, cleaning, analyzing, and visualizing data. Key features include its ability to read various data formats, perform statistical analyses and modeling, and produce publication-quality graphs. R has a steep learning curve but is highly extensible and supports a wide range of statistical techniques through its packages. This document provides an introduction to obtaining and installing R, performing basic tasks like importing data and help functions, and using R for descriptive statistics, statistical modeling, and multivariate analyses.
Slides on introduction to R by ArinBasu MDSonaCharles2
R is a free software environment for statistical analysis and graphics. It allows importing, cleaning, analyzing, and visualizing data. Key features include its ability to read various data formats, perform statistical analyses and modeling, and produce publication-quality graphs. R has a steep learning curve but is highly extensible and supports a wide range of statistical techniques through its packages. This document provides an introduction to obtaining and installing R, performing basic tasks like importing data and help functions, and using R for descriptive statistics, statistical modeling, and multivariate analyses.
R is a free software environment for statistical analysis and graphics. It allows importing, cleaning, analyzing, and visualizing data. Key features include its ability to handle many types of data, produce high-quality graphs, and implement a wide variety of statistical techniques like regression. R has a steep learning curve but a strong user community and implements advanced statistical methods. It can effectively store, manipulate, and summarize data.
This document introduces the R programming language. It covers obtaining and installing R, reading and exporting data, and performing basic statistical analyses and econometrics. R can be used for statistical analysis, modeling, and data visualization. It has a steep learning curve but is free, open source software with a strong user community and implements many advanced statistical methods.
This document provides an introduction to data analysis techniques using Python. It discusses key Python libraries for data analysis like NumPy, Pandas, SciPy, Scikit-Learn and libraries for data visualization like matplotlib and Seaborn. It covers essential concepts in data analysis like Series, DataFrames and how to perform data cleaning, transformation, aggregation and visualization on data frames. It also discusses statistical analysis, machine learning techniques and how big data and data analytics can work together. The document is intended as an overview and hands-on guide to getting started with data analysis in Python.
The document discusses various Python libraries used for data science tasks. It describes NumPy for numerical computing, SciPy for algorithms, Pandas for data structures and analysis, Scikit-Learn for machine learning, Matplotlib for visualization, and Seaborn which builds on Matplotlib. It also provides examples of loading data frames in Pandas, exploring and manipulating data, grouping and aggregating data, filtering, sorting, and handling missing values.
Advanced Data Analytics with R Programming.pptAnshika865276
R is a software environment for statistical analysis and graphics. It allows users to import, clean, analyze, and visualize data. Key features include importing data from various sources, conducting descriptive statistics and statistical modeling, and creating publication-quality graphs. R has a steep learning curve but is highly extensible and supports a wide range of statistical techniques through its base functionality and contributed packages.
Exploratory Data Analysis (EDA) is an approach to gain insights from data through cleaning, statistical analysis, and visualization. The basic steps of EDA involve importing data into a workspace, calculating descriptive statistics like mean and median, detecting and handling missing values, and creating visualizations like bar graphs and scatter plots to identify patterns and outliers. EDA in Pandas includes importing data, finding descriptive statistics, removing null values, and visualizing the data distribution and relationships between variables.
This document provides a side-by-side comparison of code samples in R and Python for common data science tasks. It covers topics like IDEs, data operations, manipulation, visualization, machine learning, text mining and more. For each topic, the document lists the main packages/functions used in R and Python and provides brief code examples. The goal is to give beginners a basic introduction to how similar tasks are accomplished in both languages.
R is a programming language and free software used for statistical analysis and graphics. It allows users to analyze data, build statistical models and visualize results. Key features of R include its extensive library of statistical and graphical methods, machine learning algorithms, and ability to handle large and complex data. R is widely used in academia and industry for data science tasks like data analysis, modeling, and communicating results.
R is a programming language and free software used for statistical analysis and graphics. It allows users to analyze data, create visualizations and build predictive models. Key features of R include its extensive library of statistical and machine learning methods, ability to handle large datasets, and functionality for data wrangling, modeling, visualization and communication of results. The document provides instructions on downloading and installing R and RStudio, loading and installing packages, and introduces basic R programming concepts like vectors, matrices, data frames and factors.
R is a programming language and free software used for statistical analysis and graphics. It allows users to analyze data, build statistical models and visualize results. Key features of R include its extensive library of statistical and graphical methods, machine learning algorithms, and ability to handle large and complex data. R is widely used in academia and industry for data science tasks like data analysis, modeling, and communicating results.
This document provides an overview of the R programming language. It describes R as a functional programming language for statistical computing and graphics that is open source and has over 6000 packages. Key features of R discussed include matrix calculation, data visualization, statistical analysis, machine learning, and data manipulation. The document also covers using R Studio as an IDE, reading and writing different data types, programming features like flow control and functions, and examples of correlation, regression, and plotting in R.
This document provides an overview of exploratory data analysis (EDA) for machine learning applications. It discusses identifying data sources, collecting data, and the machine learning process. The main part of EDA involves cleaning, preprocessing, and visualizing data to gain insights through descriptive statistics and data visualizations like histograms, scatter plots, and boxplots. This allows discovering patterns, errors, outliers and missing values to understand the dataset before building models.
- R is a free software environment for statistical computing and graphics. It has an active user community and supports graphical capabilities.
- R can import and export data, perform data manipulation and summaries. It provides various plotting functions and control structures to control program flow.
- Debugging tools in R include traceback, debug, browser and trace which help identify and fix issues in functions.
R can summarize documents in 3 sentences or less:
R is a popular language for data science that can be used for data manipulation, calculation, and graphical display. It includes facilities for data handling, mathematical and statistical analysis, and data visualization. R has an effective programming language and is widely used for tasks like machine learning, statistical modeling, and data analysis.
This document provides an introduction to using R for data science and analytics. It discusses what R is, how to install R and RStudio, statistical software options, and how R can be used with other tools like Tableau, Qlik, and SAS. Examples are given of how R is used in government, telecom, insurance, finance, pharma, and by companies like ANZ bank, Bank of America, Facebook, and the Consumer Financial Protection Bureau. Key statistical concepts are also refreshed.
Social Media and Fake News in the 2016 ElectionAjay Ohri
This document discusses fake news and its potential impact on the 2016 US presidential election. It begins with background on the definition and history of fake news, noting its long existence but arguing it is growing as an issue today due to lower barriers to media entry, the rise of social media, declining trust in mainstream media, and increasing political polarization. It then presents new data on fake news consumption prior to the 2016 election, finding that fake news was widely shared on social media and heavily tilted towards supporting Trump. While estimates vary, the average American may have seen or remembered one or a few fake news stories. Education level, age, and total media consumption were associated with more accurate assessment of true vs. fake news headlines.
The document shows code for installing PySpark and loading the iris dataset to analyze it using PySpark. It loads the iris CSV data into an RDD and DataFrame. It performs data cleaning and wrangling like changing column names and data types. It runs aggregation operations like calculating mean sepal length grouped by species. This provides an end-to-end example of loading data into PySpark and exploring it using RDDs and DataFrames/SQL.
This book provides a comparative introduction and overview of the R and Python programming languages for data science. It offers concise tutorials with command-by-command translations between the two languages. The book covers topics like data input, inspection, analysis, visualization, statistical modeling, machine learning, and more. It is designed to help practitioners and students that know one language learn the other.
This document provides instructions for installing Spark on Windows 10 by:
1. Installing Java 8, Scala, Eclipse Mars, Maven 3.3, and Spark 1.6.1
2. Setting environment variables for each installation
3. Creating a sample WordCount project in Eclipse using Maven, adding Spark dependencies, and compiling and running the project using spark-submit.
Ajay Ohri is an experienced principal data scientist with 14 years of experience. He has expertise in R, Python, machine learning, data visualization, SAS, SQL and cloud computing. Ohri has extensive experience in financial services domains including credit cards, loans, and insurance. He is proficient in data science tasks like exploratory data analysis, regression modeling, and data cleaning. Ohri has worked on significant projects for government and private clients. He also publishes books and articles on data science topics.
This document provides an overview of key concepts in statistics for data science, including:
- Descriptive statistics like measures of central tendency (mean, median, mode) and variation (range, variance, standard deviation).
- Common distributions like the normal, binomial, and Poisson distributions.
- Statistical inference techniques like hypothesis testing, t-tests, and the chi-square test.
- Bayesian concepts like Bayes' theorem and how to apply it in R.
- How to use R and RCommander for exploring and visualizing data and performing statistical analyses.
R is an open source programming language and software environment for statistical analysis and graphics. It is widely used among data scientists for tasks like data manipulation, calculation, and graphical data analysis. Some key advantages of R include that it is open source and free, has a large collection of statistical tools and packages, is flexible, and has strong capabilities for data visualization. It also has an active user community and can integrate with other software like SAS, Python, and Tableau. R is a popular and powerful tool for data scientists.
This document provides an introduction and overview of a summer school course on business analytics and data science. It begins by introducing the instructor and their qualifications. It then outlines the course schedule and topics to be covered, including introductions to data science, analytics, modeling, Google Analytics, and more. Expectations and support resources are also mentioned. Key concepts from various topics are then defined at a high level, such as the data-information-knowledge hierarchy, data mining, CRISP-DM, machine learning techniques like decision trees and association analysis, and types of models like regression and clustering.
This document summarizes intelligence techniques known as "tradecraft". It defines tradecraft as the techniques used in modern espionage, including general methods like dead drops and specific techniques of organizations like NSA encryption. It provides examples of intelligence technologies like microdots, covert cameras, and concealment devices. It also describes analytical, operational, and technological tradecraft methods such as agent handling, black bag operations, cryptography, cutouts, and honey traps.
The document describes the game of craps and various bets that can be made. It provides the rules and probabilities associated with different outcomes. For a standard craps bet that pays even money, the probability of winning is 5/9 and losing is 4/9. Simulation of 1,000 $1 bets results in an expected net loss, with actual results varying randomly based on dice rolls. Bets with higher payouts have lower probabilities of winning to offset the house advantage.
This document provides a tutorial on data science in Python. It discusses Python's history and the Jupyter notebook interface. It also demonstrates how to import Python packages, load data, inspect data, and munge data for analysis. Specific techniques shown include importing datasets, checking data types and dimensions, selecting rows and columns, and obtaining summary information about the data.
How does cryptography work? by Jeroen OomsAjay Ohri
This document provides a conceptual introduction to cryptographic methods. It explains that cryptography works by using the XOR operator and one-time pads or stream ciphers to encrypt messages. With one-time pads, a message is XOR'd with random data and can only be decrypted by someone with the pad. Stream ciphers generate pseudo-random streams from a key and nonce to encrypt messages. Public-key encryption uses Diffie-Hellman key exchange to allow parties to establish a shared secret to encrypt messages.
Using R for Social Media and Sports AnalyticsAjay Ohri
Sqor is a social network focused on sports that uses various technologies like Python, R, Erlang, and SQL in its data pipeline. R is used exclusively for machine learning and statistics tasks like clustering, classification, and predictive analytics. Sqor has developed prediction algorithms in R to identify influential athletes on social media and collaborate with them. Their prediction algorithms appear to be working effectively so far based on results. Sqor is also building an Erlang/R bridge to allow R scripts to be run and scaled from Erlang for tasks like predictive modeling.
Can you teach coding to kids in a mobile game app in local languages. Do you need to be good in English to learn coding in R or Python?
How young can we train people in coding-
something we worked on for six months but now we are giving up due to lack of funds is this idea.
Feel free to use it, it is licensed cc-by-sa
This document provides an overview of analyzing data using open source tools and techniques to cut costs and improve metrics. It demonstrates tools like R, Python, and Spark that can be used for tasks like data exploration, predictive modeling, and clustering. Common techniques are discussed like examining median, mode, and standard deviation instead of just means. The document also gives examples of use cases like churn prediction, conversion propensity, and web/social network analytics. It concludes by encouraging the systematic collection and use of data to make decisions and that visualizing data through graphs is very helpful.
Cox Communications is an American company that provides digital cable television, telecommunications, and home automation services in the United States. Gary Bonneau is a senior manager for product operations at Cox Business (the business side of Cox Communications).
Gary has been working in the telecommunications industry for over two decades and — after following the topic for many years — is a bit of a process mining veteran as well. Now, he is putting process mining to use to visualize his own fulfillment processes. The business life cycles are very complex and multiple data sources need to be connected to get the full picture. At camp, Gary shared the dos and don'ts and take-aways of his experience.
Important JavaScript Concepts Every Developer Must Knowyashikanigam1
Mastering JavaScript requires a deep understanding of key concepts like closures, hoisting, promises, async/await, event loop, and prototypal inheritance. These fundamentals are crucial for both frontend and backend development, especially when working with frameworks like React or Node.js. At TutorT Academy, we cover these topics in our live courses for professionals, ensuring hands-on learning through real-world projects. If you're looking to strengthen your programming foundation, our best online professional certificates in full-stack development and system design will help you apply JavaScript concepts effectively and confidently in interviews or production-level applications.
Dr. Robert Krug - Expert In Artificial IntelligenceDr. Robert Krug
Dr. Robert Krug is a New York-based expert in artificial intelligence, with a Ph.D. in Computer Science from Columbia University. He serves as Chief Data Scientist at DataInnovate Solutions, where his work focuses on applying machine learning models to improve business performance and strengthen cybersecurity measures. With over 15 years of experience, Robert has a track record of delivering impactful results. Away from his professional endeavors, Robert enjoys the strategic thinking of chess and urban photography.
Oak Ridge National Laboratory (ORNL) is a leading science and technology laboratory under the direction of the Department of Energy.
Hilda Klasky is part of the R&D Staff of the Systems Modeling Group in the Computational Sciences & Engineering Division at ORNL. To prepare the data of the radiology process from the Veterans Affairs Corporate Data Warehouse for her process mining analysis, Hilda had to condense and pre-process the data in various ways. Step by step she shows the strategies that have worked for her to simplify the data to the level that was required to be able to analyze the process with domain experts.
Wil van der Aalst gave the closing keynote at camp. He started with giving an overview of the progress that has been made in the process mining field over the past 20 years. Process mining unlocks great potential but also comes with a huge responsibility. Responsible data science focuses on positive technological breakthroughs and aims to prevent “pollution” by “bad data science”.
Wil gave us a sneak peek at current responsible process mining research from the area of ‘fairness’ (how to draw conclusions from data that are fair without sacrificing accuracy too much) and ‘confidentiality’ (how to analyze data without revealing secrets). While research can provide some solutions by developing new techniques, understanding these risks is a responsibility of the process miner.
Today's children are growing up in a rapidly evolving digital world, where digital media play an important role in their daily lives. Digital services offer opportunities for learning, entertainment, accessing information, discovering new things, and connecting with other peers and community members. However, they also pose risks, including problematic or excessive use of digital media, exposure to inappropriate content, harmful conducts, and other online safety concerns.
In the context of the International Day of Families on 15 May 2025, the OECD is launching its report How’s Life for Children in the Digital Age? which provides an overview of the current state of children's lives in the digital environment across OECD countries, based on the available cross-national data. It explores the challenges of ensuring that children are both protected and empowered to use digital media in a beneficial way while managing potential risks. The report highlights the need for a whole-of-society, multi-sectoral policy approach, engaging digital service providers, health professionals, educators, experts, parents, and children to protect, empower, and support children, while also addressing offline vulnerabilities, with the ultimate aim of enhancing their well-being and future outcomes. Additionally, it calls for strengthening countries’ capacities to assess the impact of digital media on children's lives and to monitor rapidly evolving challenges.
Get Started with FukreyGame Today!......liononline785
Python for R Users
1. Python for R Users
By
Chandan Routray
As a part of internship at
www.decisionstats.com
2. Basic Commands
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. i
Functions R Python
Downloading and installing a package install.packages('name') pip install name
Load a package library('name') import name as other_name
Checking working directory getwd() import os
os.getcwd()
Setting working directory setwd() os.chdir()
List files in a directory dir() os.listdir()
List all objects ls() globals()
Remove an object rm('name') del('object')
3. Data Frame Creation
R Python
(Using pandas package*)
Creating a data frame “df” of
dimension 6x4 (6 rows and 4
columns) containing random
numbers
A<
matrix(runif(24,0,1),nrow=6,ncol=4)
df<data.frame(A)
Here,
• runif function generates 24 random
numbers between 0 to 1
• matrix function creates a matrix from
those random numbers, nrow and ncol
sets the numbers of rows and columns
to the matrix
• data.frame converts the matrix to data
frame
import numpy as np
import pandas as pd
A=np.random.randn(6,4)
df=pd.DataFrame(A)
Here,
• np.random.randn generates a
matrix of 6 rows and 4 columns;
this function is a part of numpy**
library
• pd.DataFrame converts the matrix
in to a data frame
*To install Pandas library visit: http://pandas.pydata.org/; To import Pandas library type: import pandas as pd;
**To import Numpy library type: import numpy as np;
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 1
4. Data Frame Creation
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 2
5. Data Frame: Inspecting and Viewing Data
R Python
(Using pandas package*)
Getting the names of rows and
columns of data frame “df”
rownames(df)
returns the name of the rows
colnames(df)
returns the name of the columns
df.index
returns the name of the rows
df.columns
returns the name of the columns
Seeing the top and bottom “x”
rows of the data frame “df”
head(df,x)
returns top x rows of data frame
tail(df,x)
returns bottom x rows of data frame
df.head(x)
returns top x rows of data frame
df.tail(x)
returns bottom x rows of data frame
Getting dimension of data frame
“df”
dim(df)
returns in this format : rows, columns
df.shape
returns in this format : (rows,
columns)
Length of data frame “df” length(df)
returns no. of columns in data frames
len(df)
returns no. of columns in data frames
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 3
6. Data Frame: Inspecting and Viewing Data
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 4
7. Data Frame: Inspecting and Viewing Data
R Python
(Using pandas package*)
Getting quick summary(like
mean, std. deviation etc. ) of
data in the data frame “df”
summary(df)
returns mean, median , maximum,
minimum, first quarter and third quarter
df.describe()
returns count, mean, standard
deviation, maximum, minimum, 25%,
50% and 75%
Setting row names and columns
names of the data frame “df”
rownames(df)=c(“A”, ”B”, “C”, ”D”,
“E”, ”F”)
set the row names to A, B, C, D and E
colnames=c(“P”, ”Q”, “R”, ”S”)
set the column names to P, Q, R and S
df.index=[“A”, ”B”, “C”, ”D”,
“E”, ”F”]
set the row names to A, B, C, D and
E
df.columns=[“P”, ”Q”, “R”, ”S”]
set the column names to P, Q, R and
S
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 5
8. Data Frame: Inspecting and Viewing Data
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 6
9. Data Frame: Sorting Data
R Python
(Using pandas package*)
Sorting the data in the data
frame “df” by column name “P”
df[order(df$P),] df.sort(['P'])
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 7
10. Data Frame: Sorting Data
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 8
11. Data Frame: Data Selection
R Python
(Using pandas package*)
Slicing the rows of a data frame
from row no. “x” to row no.
“y”(including row x and y)
df[x:y,] df[x1:y]
Python starts counting from 0
Slicing the columns name “x”,”Y”
etc. of a data frame “df”
myvars < c(“X”,”Y”)
newdata < df[myvars]
df.loc[:,[‘X’,’Y’]]
Selecting the the data from row
no. “x” to “y” and column no. “a”
to “b”
df[x:y,a:b] df.iloc[x1:y,a1,b]
Selecting the element at row no.
“x” and column no. “y”
df[x,y] df.iat[x1,y1]
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 9
12. Data Frame: Data Selection
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 10
13. Data Frame: Data Selection
R Python
(Using pandas package*)
Using a single column’s values
to select data, column name “A”
subset(df,A>0)
It will select the all the rows in which the
corresponding value in column A of that
row is greater than 0
df[df.A > 0]
It will do the same as the R function
PythonR
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 11
14. Mathematical Functions
Functions R Python
(import math and numpy library)
Sum sum(x) math.fsum(x)
Square Root sqrt(x) math.sqrt(x)
Standard Deviation sd(x) numpy.std(x)
Log log(x) math.log(x[,base])
Mean mean(x) numpy.mean(x)
Median median(x) numpy.median(x)
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 12
15. Mathematical Functions
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 13
16. Data Manipulation
Functions R Python
(import math and numpy library)
Convert character variable to numeric variable as.numeric(x) For a single value: int(x), long(x), float(x)
For list, vectors etc.: map(int,x), map(float,x)
Convert factor/numeric variable to character
variable
paste(x) For a single value: str(x)
For list, vectors etc.: map(str,x)
Check missing value in an object is.na(x) math.isnan(x)
Delete missing value from an object na.omit(list) cleanedList = [x for x in list if str(x) !
= 'nan']
Calculate the number of characters in character
value
nchar(x) len(x)
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 14
17. Date & Time Manipulation
Functions R
(import lubridate library)
Python
(import datetime library)
Getting time and date at an instant Sys.time() datetime.datetime.now()
Parsing date and time in format:
YYYY MM DD HH:MM:SS
d<Sys.time()
d_format<ymd_hms(d)
d=datetime.datetime.now()
format= “%Y %b %d %H:%M:%S”
d_format=d.strftime(format)
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 15
18. Data Visualization
Functions R Python
(import matplotlib library**)
Scatter Plot variable1 vs variable2 plot(variable1,variable2) plt.scatter(variable1,variable2)
plt.show()
Boxplot for Var boxplot(Var) plt.boxplot(Var)
plt.show()
Histogram for Var hist(Var) plt.hist(Var)
plt.show()
Pie Chart for Var pie(Var) from pylab import *
pie(Var)
show()
** To import matplotlib library type: import matplotlib.pyplot as plt
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 16
19. Data Visualization: Scatter Plot
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 17
20. Data Visualization: Box Plot
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 18
21. Data Visualization: Histogram
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 19
22. Data Visualization: Line Plot
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 20
23. Data Visualization: Bubble
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 22
24. Data Visualization: Bar
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 21
25. Data Visualization: Pie Chart
R Python
Dec 2014 Copyrigt www.decisionstats.com Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 23
27. Coming up
● Data Mining in Python and R ( see draft slides
afterwards)
28. Machine Learning: SVM on Iris Dataset
*To know more about svm function in R visit: http://cran.r-project.org/web/packages/e1071/
** To install sklearn library visit : http://scikit-learn.org/, To know more about sklearn svm visit: http://scikit-
learn.org/stable/modules/generated/sklearn.svm.SVC.html
R(Using svm* function) Python(Using sklearn** library)
library(e1071)
data(iris)
trainset <iris[1:149,]
testset <iris[150,]
svm.model < svm(Species ~ ., data =
trainset, cost = 100, gamma = 1, type= 'C
classification')
svm.pred< predict(svm.model,testset[5])
svm.pred
#Loading Library
from sklearn import svm
#Importing Dataset
from sklearn import datasets
#Calling SVM
clf = svm.SVC()
#Loading the package
iris = datasets.load_iris()
#Constructing training data
X, y = iris.data[:1], iris.target[:1]
#Fitting SVM
clf.fit(X, y)
#Testing the model on test data
print clf.predict(iris.data[1])
Output: Virginica Output: 2, corresponds to Virginica
29. Linear Regression: Iris Dataset
*To know more about lm function in R visit: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html
** ** To know more about sklearn linear regression visit : http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
R(Using lm* function) Python(Using sklearn** library)
data(iris)
total_size<dim(iris)[1]
num_target<c(rep(0,total_size))
for (i in 1:length(num_target)){
if(iris$Species[i]=='setosa'){num_target[i]<0}
else if(iris$Species[i]=='versicolor')
{num_target[i]<1}
else{num_target[i]<2}
}
iris$Species<num_target
train_set <iris[1:149,]
test_set <iris[150,]
fit<lm(Species ~ 0+Sepal.Length+ Sepal.Width+
Petal.Length+ Petal.Width , data=train_set)
coefficients(fit)
predict.lm(fit,test_set)
from sklearn import linear_model
from sklearn import datasets
iris = datasets.load_iris()
regr = linear_model.LinearRegression()
X, y = iris.data[:1], iris.target[:1]
regr.fit(X, y)
print(regr.coef_)
print regr.predict(iris.data[1])
Output: 1.64 Output: 1.65
30. Random forest: Iris Dataset
*To know more about randomForest package in R visit: http://cran.r-project.org/web/packages/randomForest/
** To know more about sklearn random forest visit : http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
R(Using randomForest* package) Python(Using sklearn** library)
library(randomForest)
data(iris)
total_size<dim(iris)[1]
num_target<c(rep(0,total_size))
for (i in 1:length(num_target)){
if(iris$Species[i]=='setosa'){num_target[i]<0}
else if(iris$Species[i]=='versicolor')
{num_target[i]<1}
else{num_target[i]<2}}
iris$Species<num_target
train_set <iris[1:149,]
test_set <iris[150,]
iris.rf < randomForest(Species ~ .,
data=train_set,ntree=100,importance=TRUE,
proximity=TRUE)
print(iris.rf)
predict(iris.rf, test_set[5], predict.all=TRUE)
from sklearn import ensemble
from sklearn import datasets
clf =
ensemble.RandomForestClassifier(n_estimato
rs=100,max_depth=10)
iris = datasets.load_iris()
X, y = iris.data[:1], iris.target[:1]
clf.fit(X, y)
print clf.predict(iris.data[1])
Output: 1.845 Output: 2
31. Decision Tree: Iris Dataset
*To know more about rpart package in R visit: http://cran.r-project.org/web/packages/rpart/
** To know more about sklearn desicion tree visit : http://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
R(Using rpart* package) Python(Using sklearn** library)
library(rpart)
data(iris)
sub < c(1:149)
fit < rpart(Species ~ ., data = iris,
subset = sub)
fit
predict(fit, iris[sub,], type = "class")
from sklearn.datasets import load_iris
from sklearn.tree import
DecisionTreeClassifier
clf =
DecisionTreeClassifier(random_state=0)
iris = datasets.load_iris()
X, y = iris.data[:1], iris.target[:1]
clf.fit(X, y)
print clf.predict(iris.data[1])
Output: Virginica Output: 2, corresponds to virginica
32. Gaussian Naive Bayes: Iris Dataset
*To know more about e1071 package in R visit: http://cran.r-project.org/web/packages/e1071/
** To know more about sklearn Naive Bayes visit : http://scikit-
learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
R(Using e1071* package) Python(Using sklearn** library)
library(e1071)
data(iris)
trainset <iris[1:149,]
testset <iris[150,]
classifier<naiveBayes(trainset[,1:4],
trainset[,5])
predict(classifier, testset[,5])
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
iris = datasets.load_iris()
X, y = iris.data[:1], iris.target[:1]
clf.fit(X, y)
print clf.predict(iris.data[1])
Output: Virginica Output: 2, corresponds to virginica
33. K Nearest Neighbours: Iris Dataset
*To know more about kknn package in R visit:
** To know more about sklearn k nearest neighbours visit : http://scikit-
learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html
R(Using kknn* package) Python(Using sklearn** library)
library(kknn)
data(iris)
trainset <iris[1:149,]
testset <iris[150,]
iris.kknn < kknn(Species~.,
trainset,testset, distance = 1,
kernel = "triangular")
summary(iris.kknn)
fit < fitted(iris.kknn)
fit
from sklearn.datasets import load_iris
from sklearn.neighbors import
KNeighborsClassifier
knn = KNeighborsClassifier()
iris = datasets.load_iris()
X, y = iris.data[:1], iris.target[:1]
knn.fit(X,y)
print knn.predict(iris.data[1])
Output: Virginica Output: 2, corresponds to virginica